EzDevInfo.com

ggplot2 interview questions

Top ggplot2 frequently asked interview questions

How to save a plot as image on the disk?

I plot a simple linear regression using R. I would like to save that image as PNG or JPEG, is it possible to do it automatically? (via code)

There are two different questions: First, I am already looking at the plot on my monitor and I would like to save it as is. Second, I have not yet generated the plot, but I would like to directly save it to disk when I execute my plotting code.

Source: (StackOverflow)

Adding custom image to geom_polygon fill in ggplot

I was asked by a student if it was possible to recreate a plot similar to the one below using R:

enter image description here This is from this paper....

This sort of stuff isn't my specialty, but using the following code I was able to create 95% CI ellipses and to plot them with geom_polygon(). I filled the images with images I grabbed from the phylopic library using the rphylopic package.

#example data/ellipses
set.seed(101)
n <- 1000
x1 <- rnorm(n, mean=2)
y1 <- 1.75 + 0.4*x1 + rnorm(n)
df <- data.frame(x=x1, y=y1, group="A")
x2 <- rnorm(n, mean=8)
y2 <- 0.7*x2 + 2 + rnorm(n)
df <- rbind(df, data.frame(x=x2, y=y2, group="B"))
x3 <- rnorm(n, mean=6)
y3 <- x3 - 5 - rnorm(n)
df <- rbind(df, data.frame(x=x3, y=y3, group="C"))


#calculating ellipses
library(ellipse)
df_ell <- data.frame()
for(g in levels(df$group)){
  df_ell <- rbind(df_ell, cbind(as.data.frame(with(df[df$group==g,], ellipse(cor(x, y), 
                                                                             scale=c(sd(x),sd(y)), 
                                                                             centre=c(mean(x),mean(y))))),group=g))
}
#drawing
library(ggplot2)
p <- ggplot(data=df, aes(x=x, y=y,colour=group)) + 
  #geom_point(size=1.5, alpha=.6) +
  geom_polygon(data=df_ell, aes(x=x, y=y,colour=group, fill=group), alpha=0.1, size=1, linetype=1)


### get center points of ellipses
library(dplyr)
ell_center <- df_ell %>% group_by(group) %>% summarise(x=mean(x), y=mean(y))

### animal images
library(rphylopic)
lion <- get_image("e2015ba3-4f7e-4950-9bde-005e8678d77b", size = "512")[[1]]
mouse <- get_image("6b2b98f6-f879-445f-9ac2-2c2563157025", size="512")[[1]] 
bug <- get_image("136edfe2-2731-4acd-9a05-907262dd1311", size="512")[[1]]

### overlay images on center points
p + add_phylopic(lion, alpha=0.9, x=ell_center[[1,2]], y=ell_center[[1,3]], ysize=2, color="firebrick1")  + 
    add_phylopic(mouse, alpha=1, x=ell_center[[2,2]], y=ell_center[[2,3]], ysize=2, color="darkgreen") +
    add_phylopic(bug, alpha=0.9, x=ell_center[[3,2]], y=ell_center[[3,3]], ysize=2, color="mediumblue") + 
  theme_bw()

Which gives the following:

enter image description here

This is ok, but what I'd really like to do is to add an image directly to the 'fill' command of geom_polygon. Is this possible ?

Source: (StackOverflow)

List of ggplot2 theme options?

After some research I found the way to prevent an uninformative legend from displaying

... + theme(legend.position = "none")

Where can I find all of the available theme options and their default values for ggplot2?

Source: (StackOverflow)

Increase number of axis ticks in ggplot2

I'm generating ggplot plots for some data, but the number of ticks is too small, I need more 'precision' on the reading.

Is there some way to increase the number of axis ticks in ggplot2?

I know I can tell ggplot to use a vector as axis ticks, but what I want is to increase the number of ticks, for all data. In other words, I want the tick number to be calculated from the data. Possibly ggplot do this internally with some algorithm, but I couldn't find how it does it, to change according to what I want.

Thanks!

Source: (StackOverflow)

How can I handle R CMD check "no visible binding for global variable" notes when my ggplot2 syntax is sensible?

EDIT: Hadley Wickham points out that I misspoke. R CMD check is throwing NOTES, not Warnings. I'm terribly sorry for the confusion. It was my oversight.

The short version

R CMD check throws this note every time I use sensible plot-creation syntax in ggplot2:

no visible binding for global variable [variable name]

I understand why R CMD check does that, but it seems to be criminalizing an entire vein of otherwise sensible syntax. I'm not sure what steps to take to get my package to pass R CMD check and get admitted to CRAN.

The background

Sascha Epskamp previously posted on essentially the same issue. The difference, I think, is that subset()'s manpage says it's designed for interactive use.

In my case, the issue is not over subset() but over a core feature of ggplot2: the data = argument.

An example of code I write that generates these notes

Here's a sub-function in my package that adds points to a plot:

JitteredResponsesByContrast <- function (data) {
  return(
    geom_point(
             aes(
               x = x.values, 
               y = y.values
             ),
             data     = data,
             position = position_jitter(height = 0, width = GetDegreeOfJitter(jj))
    )
  )
}

R CMD check, on parsing this code, will say

granovagg.contr : JitteredResponsesByContrast: no visible binding for
  global variable 'x.values'
granovagg.contr : JitteredResponsesByContrast: no visible binding for
  global variable 'y.values'

Why R CMD check is right

The check is technically correct. x.values and y.values

Aren't defined locally in the function JitteredResponsesByContrast()
Aren't pre-defined in the form x.values <- [something] either globally or in the caller.

Instead, they're variables within a dataframe that gets defined earlier and passed into the function JitteredResponsesByContrast().

Why ggplot2 makes it difficult to appease R CMD check

ggplot2 seems to encourage the use of a data argument. The data argument, presumably, is why this code will execute

library(ggplot2)
p <- ggplot(aes(x = hwy, y = cty), data = mpg)
p + geom_point()

but this code will produce an object-not-found error:

library(ggplot2)
hwy # a variable in the mpg dataset

Two work-arounds, and why I'm happy with neither

The NULLing out strategy

Matthew Dowle recommends setting the problematic variables to NULL first, which in my case would look like this:

JitteredResponsesByContrast <- function (data) {
  x.values <- y.values <- NULL # Setting the variables to NULL first
  return(
    geom_point(
             aes(
               x = x.values, 
               y = y.values
             ),
             data     = data,
             position = position_jitter(height = 0, width = GetDegreeOfJitter(jj))
    )
  )
}

I appreciate this solution, but I dislike it for three reasons.

it serves no additional purpose beyond appeasing R CMD check.
it doesn't reflect intent. It raises the expectation that the aes() call will see our now-NULL variables (it won't), while obscuring the real purpose (making R CMD check aware of variables it apparently wouldn't otherwise know were bound)
The problems of 1 and 2 multiply because every time you write a function that returns a plot element, you have to add a confusing NULLing statement

The with() strategy

You can use with() to explicitly signal that the variables in question can be found inside some larger environment. In my case, using with() looks like this:

JitteredResponsesByContrast <- function (data) {
  with(data, {
      geom_point(
               aes(
                 x = x.values, 
                 y = y.values
               ),
               data     = data,
               position = position_jitter(height = 0, width = GetDegreeOfJitter(jj))
      )
    }
  )
}

This solution works. But, I don't like this solution because it doesn't even work the way I would expect it to. If with() were really solving the problem of pointing the interpreter to where the variables are, then I shouldn't even need the data = argument. But, with() doesn't work that way:

library(ggplot2)
p <- ggplot()
p <- p + with(mpg, geom_point(aes(x = hwy, y = cty)))
p # will generate an error saying `hwy` is not found

So, again, I think this solution has similar flaws to the NULLing strategy:

I still have to go through every plot element function and wrap the logic in a with() call
The with() call is misleading. I still need to supply a data = argument; all with() is doing is appeasing R CMD check.

Conclusion

The way I see it, there are three options I could take:

Lobby CRAN to ignore the notes by arguing that they're "spurious" (pursuant to CRAN policy), and do that every time I submit a package
Fix my code with one of two undesirable strategies (NULLing or with() blocks)
Hum really loudly and hope the problem goes away

None of the three make me happy, and I'm wondering what people suggest I (and other package developers wanting to tap into ggplot2) should do. Thanks to all in advance. I really appreciate your even reading through this :-)

Source: (StackOverflow)

Plotting two variables as lines using ggplot2 on the same graph

A very newbish question, but say I have data like this:

test_data <- data.frame(
var0 = 100 + c(0, cumsum(runif(49, -20, 20))),
var1 = 150 + c(0, cumsum(runif(49, -10, 10))),
date = seq.Date(as.Date("2002-01-01"), by="1 month", length.out=100))

How can I plot both time series var0 and var1 on the same graph, with date on the x-axis, using ggplot2? Bonus points if you make var0 and var1 different colours, and can include a legend!

I'm sure this is very simple, but I can't find any examples out there.

Source: (StackOverflow)

ggplot: showing % instead of counts in charts of categorical variables

I'm plotting a categorical variable and instead of showing the counts for each category value, I'm looking for a way to get ggplot to display the percentage of values in that category. Of course, it is possible to create another variable with the calculated percentage and plot that one, but I have to do it several dozens of times and I hope to achieve that in one command.

I was experimenting with something like

qplot(mydataf) +
  stat_bin(aes(n = nrow(mydataf), y = ..count../n)) +
  scale_y_continuous(formatter = "percent")

but I must be using it incorrectly, as I got errors.

To easily reproduce the setup, here's a simplified example:

mydata <- c ("aa", "bb", null, "bb", "cc", "aa", "aa", "aa", "ee", null, "cc");
mydataf <- factor(mydata);
qplot (mydataf); #this shows the count, I'm looking to see % displayed.

In the real case I'll probably use ggplot instead of qplot, but the right way to use stat_bin still eludes me.

I've also tried these four approaches:

ggplot(mydataf, aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(formatter = 'percent');

ggplot(mydataf, aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(formatter = 'percent') + geom_bar();

ggplot(mydataf, aes(x = levels(mydataf), y = (..count..)/sum(..count..))) + 
  scale_y_continuous(formatter = 'percent');

ggplot(mydataf, aes(x = levels(mydataf), y = (..count..)/sum(..count..))) + 
  scale_y_continuous(formatter = 'percent') + geom_bar();

but all 4 give:

Error: ggplot2 doesn't know how to deal with data of class factor

The same error appears for the simple case of

ggplot (data=mydataf, aes(levels(mydataf))) +
  geom_bar()

so it's clearly something about how ggplot interacts with a single vector. I'm scratching my head, googling for that error gives a single result.

Source: (StackOverflow)

Plot with 2 y axes, one y axis on the left, and another y axis on the right

I need to plot a bar chart showing counts and a line chart showing rate all in one chart, I can do both of them separately, but when I put them together, I scale of the first layer (i.e. the geom_bar) is overlapped by the second layer (i.e. the geom_line).

Can I move the axis of the geom_line to the right?

Source: (StackOverflow)

Error: could not find function "unit"

While trying to modify theme settings this simple code gives the following error:

library(ggplot2)
theme_nogrid <- theme_set(theme_update(  
                  plot.margin=unit(c(.25, .25, .25, .25), "in"),))

Error in do.call(theme, list(...)) : could not find function "unit"

R gives me this error for any element that uses 'unit'. Any other settings that do not call 'unit' work fine. I am running R v.2.15.2 (64-bit Windows).

I extensively searched online about this problem and found nothing. I appreciate any suggestions to the problem.

Source: (StackOverflow)

How can I arrange an arbitrary number of ggplots using grid.arrange?

This is cross-posted on the ggplot2 google group

My situation is that I'm working on a function that outputs an arbitrary number of plots (depending upon the input data supplied by the user). The function returns a list of n plots, and I'd like to lay those plots out in 2 x 2 formation. I'm struggling with the simultaneous problems of:

How can I allow the flexibility to be handed an arbitrary (n) number of plots?
How can I also specify I want them laid out 2 x 2

My current strategy uses grid.arrange from the gridExtra package. It's probably not optimal, especially since, and this is key, it totally doesn't work. Here's my commented sample code, experimenting with three plots:

library(ggplot2)
library(gridExtra)

x <- qplot(mpg, disp, data = mtcars)
y <- qplot(hp, wt, data = mtcars)
z <- qplot(qsec, wt, data = mtcars)

# A normal, plain-jane call to grid.arrange is fine for displaying all my plots
grid.arrange(x, y, z)

# But, for my purposes, I need a 2 x 2 layout. So the command below works acceptably.
grid.arrange(x, y, z, nrow = 2, ncol = 2)

# The problem is that the function I'm developing outputs a LIST of an arbitrary
# number plots, and I'd like to be able to plot every plot in the list on a 2 x 2
# laid-out page. I can at least plot a list of plots by constructing a do.call()
# expression, below. (Note: it totally even surprises me that this do.call expression
# DOES work. I'm astounded.)
plot.list <- list(x, y, z)
do.call(grid.arrange, plot.list)

# But now I need 2 x 2 pages. No problem, right? Since do.call() is taking a list of
# arguments, I'll just add my grid.layout arguments to the list. Since grid.arrange is
# supposed to pass layout arguments along to grid.layout anyway, this should work.
args.list <- c(plot.list, "nrow = 2", "ncol = 2")

# Except that the line below is going to fail, producing an "input must be grobs!"
# error
do.call(grid.arrange, args.list)

As I am wont to do, I humbly huddle in the corner, eagerly awaiting the sagacious feedback of a community far wiser than I. Especially if I'm making this harder than it needs to be.

Source: (StackOverflow)

Add legend to ggplot2 line plot

I have a question about legends in ggplot2. I managed to plot three lines in the same graph and want to add a legend with the three colors used. This is the code used

library(ggplot2)    
require(RCurl)

link<-getURL("https://dl.dropbox.com/s/ds5zp9jonznpuwb/dat.txt")
datos<- read.csv(textConnection(link),header=TRUE,sep=";")
datos$fecha <- as.POSIXct(datos[,1], format="%d/%m/%Y")    

temp = ggplot(data=datos,aes(x=fecha, y=TempMax,colour="1")) + 
           geom_line(colour="red") + opts(title="TITULO") +
           ylab("Temperatura (C)") + xlab(" ") + 
           scale_y_continuous(limits = c(-10,40)) + 
           geom_line(aes(x=fecha, y=TempMedia,colour="2"),colour="green") + 
           geom_line(aes(x=fecha, y=TempMin,colour="2"),colour="blue") +
           scale_colour_manual(values=c("red","green","blue"))

temp

and the output

ggplot three lines

I'd like to add a legend with the three colours used and the name of the variable (TempMax,TempMedia and TempMin). I have tried

scale_colour_manual

but can't find the exact way.

Unfortunately original data were deleted from linked site and could not be recovered. But they came from meteo data files with this format

"date","Tmax","Tmin","Tmed","Precip.diaria","Wmax","Wmed"
2000-07-31 00:00:00,-1.7,-1.7,-1.7,-99.9,20.4,20.4
2000-08-01 00:00:00,22.9,19,21.11,-99.9,6.3,2.83
2000-08-03 00:00:00,24.8,12.3,19.23,-99.9,6.8,3.87
2000-08-04 00:00:00,20.3,9.4,14.4,-99.9,8.3,5.29
2000-08-08 00:00:00,25.7,14.4,19.5,-99.9,7.9,3.22
2000-08-09 00:00:00,29.8,16.2,22.14,-99.9,8.5,3.27
2000-08-10 00:00:00,30,17.8,23.5,-99.9,7.7,3.61
2000-08-11 00:00:00,27.5,17,22.68,-99.9,8.8,3.85
2000-08-12 00:00:00,24,13.3,17.32,-99.9,8.4,3.49

Source: (StackOverflow)

What do hjust and vjust do when making a plot using ggplot?

Every time I make a plot using ggplot, I spend a little while trying different values for hjust and vjust in a line like

+ opts(axis.text.x = theme_text(hjust = 0.5))

to get the axis labels to line up where the axis labels almost touch the axis, and are flush against it (justified to the axis, so to speak). However, I don't really understand what's going on. Often, hjust = 0.5 gives such dramatically different results from hjust = 0.6, for example, that I haven't been able to figure it out just by playing around with different values.

Can anyone point me to a comprehensive explanation of how hjust and vjust options work?

Source: (StackOverflow)

Scatterplot with marginal histograms in ggplot2

Is there a way of creating scatterplots with marginal histograms just like in the sample below in ggplot2? In Matlab it is the scatterhist() function and there exist equivalents for R as well. However, I haven't seen it for ggplot2.

scatterplot with marginal histograms

I started an attempt by creating the single graphs but don't know how to arrange them properly.

 require(ggplot2)
 x<-rnorm(300)
 y<-rt(300,df=2)
 xy<-data.frame(x,y)
     xhist <- qplot(x, geom="histogram") + scale_x_continuous(limits=c(min(x),max(x))) + opts(axis.text.x = theme_blank(), axis.title.x=theme_blank(), axis.ticks = theme_blank(), aspect.ratio = 5/16, axis.text.y = theme_blank(), axis.title.y=theme_blank(), background.colour="white")
     yhist <- qplot(y, geom="histogram") + coord_flip() + opts(background.fill = "white", background.color ="black")

     yhist <- yhist + scale_x_continuous(limits=c(min(x),max(x))) + opts(axis.text.x = theme_blank(), axis.title.x=theme_blank(), axis.ticks = theme_blank(), aspect.ratio = 16/5, axis.text.y = theme_blank(), axis.title.y=theme_blank() )


     scatter <- qplot(x,y, data=xy)  + scale_x_continuous(limits=c(min(x),max(x))) + scale_y_continuous(limits=c(min(y),max(y)))
none <- qplot(x,y, data=xy) + geom_blank()

and arranging them with the function posted here. But to make long story short: Is there a way of creating these graphs?

Source: (StackOverflow)

ggplot: How to change facet labels?

I have used the following ggplot command:

ggplot(survey,aes(x=age))+stat_bin(aes(n=nrow(h3),y=..count../n), binwidth=10)
  +scale_y_continuous(formatter = "percent", breaks=c(0, 0.1, 0.2)) 
  + facet_grid(hospital ~ .) 
  + opts(panel.background = theme_blank())

to produce

alt text

I'd like to change the facet labels, however, to something shorter (like Hosp 1, Hosp 2...) because they are too long now and look cramped (increasing the height of the graph is not an option, it would take too much space in the document). I looked at the facet_grid help page but cannot figure out how.

Thanks in advance for any pointers.

Source: (StackOverflow)

Turning off some legends in a ggplot

Suppose I have a ggplot with more than one legend.

mov <- subset(movies, length != "")
(p0 <- ggplot(mov, aes(year, rating, colour = length, shape = mpaa)) +
  geom_point()
)

I can turn off the display of all the legends like this:

(p1 <- p0 + theme(legend.position = "none"))

Passing show_guide = FALSE to geom_point (as per this question) turns off the shape legend.

(p2 <- ggplot(mov, aes(year, rating, colour = length, shape = mpaa)) +
  geom_point(show_guide = FALSE)
)

But what if I want to turn off the colour legend instead? There doesn't seem to be a way of telling show_guide which legend to apply its behaviour to. And there is no show_guide argument for scales or aesthetics.

(p3 <- ggplot(mov, aes(year, rating, colour = length, shape = mpaa)) +
  scale_colour_discrete(show_guide = FALSE) +
  geom_point()
)
# Error in discrete_scale

(p4 <- ggplot(mov, aes(year, rating, shape = mpaa)) +
  aes(colour = length, show_guide = FALSE) +
  geom_point()
)
#draws both legends

This question suggests that the modern (since ggplot2 v0.9.2) way of controlling legends is with the guides function.

I want to be able to do something like

p0 + guides(
  colour = guide_legend(show = FALSE) 
)

but guide_legend doesn't have a show argument.

How do I specify which legends get displayed?

Source: (StackOverflow)