Friday, May 13, 2011

Describing Data: Frequently Used Commands

Obtaining a coherent numerical summary of data is a common task, and it is common to want to port these summary statistics into a table of results. When I am in interactive mode with my data, I use the summary() command applied to my data frame. For example, the following code loads and summarizes a data frame on Yogurt advertising and prices:

library(Ecdat) ## Econometrics Data (useful!)
data(Yogurt) ## Loads Yogurt from Ecdat
summary(Yogurt) ## Summarizes Yogurt


For each quantitative variable, the summary() command provides a five-number summary (min, max, Q1, Q3, median) plus the mean. For categorical variables, the counts of each level are provided. This provides an excellent summary measure of each variable, but you may prefer a richer set of information (especially when it comes to typing up tables).

I recently discovered a great way to obtain a richer set of information on a data frame. This method involves using the psych library, which contains functions describe() and describe.by(). Continuing with the code from above, here is the basic syntax:

library(psych)
describe(Yogurt) ## Describes in more detail the Yogurt data frame


Suppose you also want to break your summary statistics into two (or four) tables for comparison sake (perhaps to illustrate stark differences across select subsets of your data). The describe.by() command is a convenient technique to break the data down by the levels of a factor. Here's an example with on the Yogurt data.

describe.by(Yogurt, Yogurt$choice)

Finally, you may want to port your data into LaTeX format and/or select particular summary statistics from the list. I wrote a function that serves as a convenience interface to describe.by() and toLatex(). As toLatex() does not work directly on objects created using describe.by(), you might find this helpful.



If you do not like knowing about the kurtosis of your data, you could read up on the options of describe.by() to learn about how to shut it down. If you're going to port it into a LaTeX table anyway, you could also just modify the code I wrote here to eliminate the summary statistics you don't want and produce LaTeX output.

FYI: Quick R has a nice summary of some other methods for summarizing data. Of the methods at Quick R that I didn't describe, pastecs looks most like a method I would use.

2 comments:

  1. Cool post.
    I often use str() (or ls.str()) for such things. Thanks for the post.

    ReplyDelete
  2. Hi,

    There is also a 'describe' function in library "Hmisc" which provides pretty good summarization and can be converted to LaTex output and then to PDF. Also there is 'contents' function which gives quick overview as str() function.

    ReplyDelete