EzDevInfo.com

r interview questions

Top r frequently asked interview questions

R list to data frame

I have a nested list of data. Its length is 132 and each item is a list of length 20. Is there a quick way to convert this structure into a data frame that has 132 rows and 20 columns of data?

I am new to R, so I figure there is probably an easy way to do this. I searched here on Stack Overflow and couldn’t find a similar question, so I apologize if I missed it. Some sample data:

l <- replicate(
  132,
  list(sample(letters, 20)),
  simplify = FALSE
)

Source: (StackOverflow)

How do I replace NA values with zeros in R?

I have a data.frame and some columns have NA values. I want to replace the NAs with zeros. How I do this?


Source: (StackOverflow)

Advertisements

Tools for making latex tables in R [closed]

On general request, a community wiki on producing latex tables in R. In this post I'll give an overview of the most commonly used packages and blogs with code for producing latex tables from less straight-forward objects. Please feel free to add any I missed, and/or give tips, hints and little tricks on how to produce nicely formatted latex tables with R.

Packages :

  • xtable : for standard tables of most simple objects. A nice gallery with examples can be found here.
  • memisc : tool for management of survey data, contains some tools for latex tables of (basic) regression model estimates.
  • Hmisc contains a function latex() that creates a tex file containing the object of choice. It is pretty flexible, and can also output longtable latex tables. There's a lot of info in the help file ?latex
  • miscFuncs has a neat function 'latextable' that converts matrix data with mixed alphabetic and numeric entries into a LaTeX table and prints them to the console, so they can be copied and pasted into a LaTeX document.
  • texreg package (JSS paper) converts statistical model output into LaTeX tables. Merges multiple models. Can cope with about 50 different model types, including network models and multilevel models (lme and lme4).
  • reporttools package (JSS paper) is another option for descriptive statistics on continuous, categorical and date variables.
  • tables package is perhaps the most general LaTeX table making package in R for descriptive statistics
  • stargazer package makes nice comparative statistical model summary tables

Blogs and code snippets

Related questions :


Source: (StackOverflow)

How to Correctly Use Lists in R?

Brief background: Many (most?) contemporary programming languages in widespread use have at least a handful of ADTs [abstract data types] in common, in particular,

  • string (a (sequence comprised of characters)

  • list (an ordered collection of values), and

  • map-based type (an unordered array that maps keys to values)

In the R programming language, the first two are implemented as character and vector, respectively.

When I began learning R, two things were obvious almost from the start: List is the most important data type in R (because it is the parent class for the R Data Frame), and second, I just couldn't understand how they worked, at least not well enough to use them correctly in my code.

For one thing, it seemed to me that R's List data type was a straightforward implementation of the map ADT (dictionary in Python, NSMutableDictionary in Objective C, hash in Perl and Ruby, object literal in Javascript, and so forth).

For instance, you create them just like you would a Python dictionary, by passing key-value pairs to a constructor (which in Python is dict not list):

>>> x = list("ev1"=10, "ev2"=15, "rv"="Group 1")

And you access the items of an R List just like you would those of a Python dictionary, e.g., x['ev1']. Likewise, you can retrieve just the 'keys' or just the 'values' by:

>>> names(x)    # fetch just the 'keys' of an R list
      "ev1" "ev2" "rv"

>>> unlist(x)   # fetch just the 'values' of an R list
      10 15 "Group1"

>>> x = list("a"=6, "b"=9, "c"=3)  

>>> sum(unlist(x))
      18

but R Lists are also unlike other map-type ADTs (from among the languages I've learned anyway). My guess is that this is a consequence of the initial spec for S, i.e., an intention to design a data/statistics DSL [domain-specific language] from the ground-up.

three significant differences between R Lists and mapping types in other languages in widespread use (e.g,. Python, Perl, JavaScript):

first, Lists in R are an ordered collection, just like vectors, even though the values are keyed (ie, the keys can be any hashable value not just sequential integers). Nearly always, the mapping data type in other languages is unordered.

second, Lists can be returned from functions even though you never passed in a List when you called the function, and even though the function that returned the list doesn't contain an (explicit) List constructor (Of course, you can deal with this in practice by wrapping the returned result in a call to unlist):

>>> x = strsplit(LETTERS[1:10], "")     # passing in an object of type 'character'

>>> class(x)                            # returns 'list', not a vector of length 2
      list

A third peculiar feature of R's Lists: it doesn't seem that they can be members of another ADT, and if you try to do that then the primary container is coerced to a list. E.g.,

>>> x = c(0.5, 0.8, 0.23, list(0.5, 0.2, 0.9), recursive=T)

>>> class(x)
      list

my intention here is not to criticize the language or how it is documented; likewise, I'm not suggesting there is anything wrong with the List data structure or how it behaves. All I'm after is to correct is my understanding of how they work so I can correctly use them in my code.

Here are the sorts of things I'd like to better understand:

  • What are the rules which determine when a function call will return a List (e.g., strsplit expression recited above)?

  • If i don't explicitly assign names to a list (e.g., list(10,20,30,40)) are the default names just sequential integers beginning with 1? (I assume, but i am far from certain that the answer is yes, otherwise we wouldn't be able to coerce this type of List to a vector w/ a call to unlist.

  • why do these two different operators, [], and [[]], return the same result?

    x = list(1, 2, 3, 4)

    both expressions return "1":

    x[1]
    
    x[[1]]
    
  • why do these two expressions not return the same result?

    x = list(1, 2, 3, 4)
    
    x2 = list(1:4)
    

please don't point me to the R Documentation (?list, R-intro)--i have read it carefully and it does not help me answer the type of questions i recited just above.

(lastly, I recently learned of and began using an R Package (available on CRAN) called hash which implements conventional map-type behavior via an S4 class; i can certainly recommend this Package.)


Source: (StackOverflow)

In R, what is the difference between the [] and [[]] notations for accessing the elements of a list?

R provides two different methods for accessing the elements of a list or data.frame- the [] and [[]] operators.

What is the difference between the two? In what situations should I use one over the other?


Source: (StackOverflow)

How to properly document S4 class slots using Roxygen2?

For documenting classes with roxygen(2), specifying a title and description/details appears to be the same as for functions, methods, data, etc. However, slots and inheritance are their own sort of animal. What is the best practice -- current or planned -- for documenting S4 classes in roxygen2?

Due Diligence:

I found mention of an @slot tag in early descriptions of roxygen. A 2008 R-forge mailing list post seems to indicate that this is dead, and there is no support for @slot in roxygen:

Is this true of roxygen2? The previously-mentioned post suggests a user should instead make their own itemized list with LaTeX markup. E.g. a new S4 class that extends the "character" class would be coded and documented like this:

#' The title for my S4 class that extends \code{"character"} class.
#'
#' Some details about this class and my plans for it in the body.
#'
#' \describe{
#'    \item{myslot1}{A logical keeping track of something.}
#'
#'    \item{myslot2}{An integer specifying something else.}
#' 
#'    \item{myslot3}{A data.frame holding some data.}
#'  }
#' @name mynewclass-class
#' @rdname mynewclass-class
#' @exportClass mynewclass
setClass("mynewclass",
    representation(myslot1="logical",
        myslot2="integer",
        myslot3="data.frame"),
    contains = "character"
)

However, although this works, this \describe , \item approach for documenting the slots seems inconsistent with the rest of roxygen(2), in that there are no @-delimited tags and slots could go undocumented with no objection from roxygenize(). It also says nothing about a consistent way to document inheritance of the class being defined. I imagine dependency still generally works fine (if a particular slot requires a non-base class from another package) using the @import tag.

So, to summarize, what is the current best-practice for roxygen(2) slots?

There seem to be three options to consider at the moment:

  • A -- Itemized list (as example above).
  • B -- @slot ... but with extra tags/implementation I missed. I was unable to get @slot to work with roxygen / roxygen2 in versions where it was included as a replacement for the itemized list in the example above. Again, the example above does work with roxygen(2).
  • C -- Some alternative tag for specifying slots, like @param, that would accomplish the same thing.

I'm borrowing/extending this question from a post I made to the roxygen2 development page on github.


Source: (StackOverflow)

Drop factor levels in a subsetted data frame

I have a data frame containing a factor. When I create a subset of this data frame using subset() or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels -- even when they do not exist in the new data frame.

This creates headaches when doing faceted plotting or using functions that rely on factor levels.

What is the most succinct way to remove levels from a factor in my new data frame?

Here's my example:

df <- data.frame(letters=letters[1:5],
                    numbers=seq(1:5))

levels(df$letters)
## [1] "a" "b" "c" "d" "e"

subdf <- subset(df, numbers <= 3)
##   letters numbers
## 1       a       1
## 2       b       2
## 3       c       3    

## but the levels are still there!
levels(subdf$letters)
## [1] "a" "b" "c" "d" "e"

Source: (StackOverflow)

Quickly reading very large tables as dataframes in R

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.

I know that reading in a table as a list using scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way of doing this? Or quite possibly completely different approach to the problem?


Source: (StackOverflow)

Assignment operators in R: '=' and '<-'

What are differences in the assignment operators '=' and '<-' in R? I know that operators are slightly different as this example shows

> x <- y <- 5
> x = y = 5
> x = y <- 5
> x <- y = 5
Error in (x <- y) = 5 : could not find function "<-<-"

But is this the only difference?


Source: (StackOverflow)

Tricks to manage the available memory in an R session

What tricks do people use to manage the available memory of an interactive R session? I use the functions below [based on postings by Petr Pikal and David Hinds to the r-help list in 2004] to list (and/or sort) the largest objects and to occassionally rm() some of them. But by far the most effective solution was ... to run under 64-bit Linux with ample memory.

Any other nice tricks folks want to share? One per post, please.

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

Source: (StackOverflow)

Remove rows with NAs in data.frame

I'd like to remove the lines in this data frame that contain NAs across all columns. Below is my example data frame.

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   NA
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   NA   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Basically, I'd like to get a data frame such as the following.

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

Also, I'd like to know how to only filter for some columns, so I can also get a data frame like this:

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

Source: (StackOverflow)

What statistics should a programmer (or computer scientist) know? [closed]

I'm a programmer with a decent background in math and computer science. I've studied computability, graph theory, linear algebra, abstract algebra, algorithms, and a little probability and statistics (through a few CS classes) at an undergraduate level.

I feel, however, that I don't know enough about statistics. Statistics are increasingly useful in computing, with statistical natural language processing helping fuel some of Google's algorithms for search and machine translation, with performance analysis of hardware, software, and networks needing proper statistical grounding to be at all believable, and with fields like bioinformatics becoming more prevalent every day.

I've read about how "Google uses Bayesian filtering the way Microsoft uses the if statement", and I know the power of even fairly naïve, simple statistical approaches to problems from Paul Graham's A Plan for Spam and Better Bayesian Filtering, but I'd like to go beyond that.

I've tried to look into learning more statistics, but I've gotten a bit lost. The Wikipedia article has a long list of related topics, but I'm not sure which I should look into. I feel like from what I've seen, a lot of statistics makes the assumption that everything is a combination of factors that linearly combine, plus some random noise in a Gaussian distribution; I'm wondering what I should learn beyond linear regression, or if I should spend the time to really understand that before I move on to other techniques. I've found a few long lists of books to look at; where should I start?

So I'm wondering where to go from here; what to learn, and where to learn it. In particular, I'd like to know:

  1. What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?
  2. What kind of statistical methods should I spend my time learning?
  3. What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

To clarify what I am looking for, I am interested in what problems that programmers typically need to deal with can benefit from a statistical approach, and what kind of statistical tools can be useful. For instance:

  • Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?
  • More generally, artificial intelligence has been moving away from discrete, symbolic approaches and towards statistical techniques. What statistical AI approaches have the most to offer now, to the working programmer (as opposed to ongoing research that may or may not provide concrete results)?
  • Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?
  • Simulation of physical systems, such as in computer graphics, frequently involves a stochastic approach.
  • Are there other problems commonly encountered by programmers that would benefit from a statistical approach?

Source: (StackOverflow)

Drop columns in R data frame

I have a number of columns that I would like to drop from a data frame. I know that we can drop them individually using something like:

df$x <- NULL

but I was hoping to do this with fewer commands.

Also, I know that I could use this:

df <- df[ -c(1,3:6, 12) ]

But I am concerned that the relative position of my variables may change.

Given how powerful R is, I figured there might be a better way than dropping each column 1 by 1.


Source: (StackOverflow)

How to join (merge) data frames (inner, outer, left, right)?

Given two data frames:

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
#  CustomerId Product
#           1 Toaster
#           2 Toaster
#           3 Toaster
#           4   Radio
#           5   Radio
#           6   Radio

df2
#  CustomerId   State
#           2 Alabama
#           4 Alabama
#           6    Ohio

How can I do database style, i.e., sql style, joins? That is, how do I get:

  • An inner join of df1 and df2:
    Return only the rows in which the left table have matching keys in the right table.
  • An outer join of df1 and df2:
    Returns all rows from both tables, join records from the left which have matching keys in the right table.
  • A left outer join (or simply left join) of df1 and df2
    Return all rows from the left table, and any rows with matching keys from the right table.
  • A right outer join of df1 and df2
    Return all rows from the right table, and any rows with matching keys from the left table.

Extra credit:

How can I do a sql style select statement?


Source: (StackOverflow)

How can we make xkcd style graphs?

Apparently, folk have figured out how to make xkcd style graphs in Mathematica and in LaTeX. Can we do it in R? Ggplot2-ers? A geom_xkcd and/or theme_xkcd?

I guess in base graphics, par(xkcd=TRUE)? How do I do it?

xkcd#1064

As a first stab (and as much more elegantly shown below) in ggplot2, adding the jitter argument to a line makes for a great hand-drawn look. So -

ggplot(mapping=aes(x=seq(1,10,.1), y=seq(1,10,.1))) + 
  geom_line(position="jitter", color="red", size=2) + theme_bw()

It makes for a nice example - but the axes and fonts appear trickier. Fonts appear solved (below), though. Is the only way to deal with axes to blank them out and draw them in by hand? Is there a more elegant solution? In particular, in ggplot2, can element_line in the new theme system be modified to take a jitter-like argument?


Source: (StackOverflow)