Friday, April 29, 2011

Forming Formulas

One of the first functions a new R user learns how to use is the lm() command, which involves stating the model formula.

lm(y~x1+x2, data=mydata)

After a while, this just becomes a natural way to say "I want a regression of y on x1 and x2 using mydata." Even though it is natural, the underlying structure of the formula is not as it first appears. This becomes apparent as soon as you need to work with formula objects when writing your own function.

Does ~ define a string? No. You can ask R directly whether the formula is a string (of class "character") using is.character().


As we will soon see, a formula isn't easily coerced into a string either.

Should we think of ~ as a fancy = sign? No. R doesn't think of ~ as defining an equation to be evaluated.

y ~ x1 + x2 + x3

It is better to think of ~ as defining the entire statement as an object with a special class called formula, which conveys the idea of the model to other functions in R.

For the rest of the post, I will show some calculations that illustrate more fundamentally what a formula object is. For concreteness, imagine you have a formula object with the general form y~x1+x2+z1+z2+z3.

You could define this object with the command:

form.ex = y~x1+x2+z1+z2+z3

Suppose our goal is to extract the first two explanatory variables (x1 and x2) and use them in a linear model for the response (y). If you're reading this for help, you likely have a more noble goal in mind. Take this as an illustrative example.

To appreciate the power of a method, it is useful to see the inelegant code we are avoiding. The inelegant code also shines some light on how a formula object differs from a string.

An inelegant method

We could gain access to the elements of the formula by coercing it into a character string using as.character(). If we do this in our example, we obtain a character vector of length three:

[1] "~"
[2] "y"
[3] "x1 + x2 + z1 + z2 + z3"

Here, we see that the RHS of the formula is stored in the third element of this vector. The response is in the second element of the vector. Let's store these into vectors we can easily reference and use.

rhs= as.character(form.ex)[3]
resp = as.character(form.ex)[2]

Next, we can splice the rhs vector to obtain the individual variable names using the strsplit command. Note: we want to split the vector apart using a delimiter " + " with the spaces. Because the + is an operator, we need to use " \\+ " instead. In addition, strsplit() returns a list of length two. We really want the first element of the list... (remember, this is the inelegant way).

rhs.sp = strsplit(rhs," \\+ ")[[1]]
[1] "x1" "x2" "z1" "z2" "z3"

Now, we have a vector of variable names for the right hand side of the equation. As our goal is to construct a formula object relating the response to the first two explanatory variables, extract the first two elements of this vector. rhs.sp[1:2]
[1] "x1" "x2"

And, paste this right hand side of the equation to the response variable name in a formula style using the following combination of paste() commands.

[1] "y ~ x1+x2"

Great. The only problem? The object short.form is not a formula object. It is a character string. This won't be a problem if we input it into something like lm(), which will coerce it to a formula object before using it.

lm(short.form, data=mkt.df)

lm(formula = short.form, data = mkt.df)

(Intercept) x1 x2
-25.329 2.335 4.239

If you want to do the formula coercion yourself, you can use the as.formula() command as follows.

y ~ x1 + x2

A more elegant method

The all.vars() command does the splitting, splicing and extracting that comprised the bulk of what we did above. Starting with the formula object form.ex, we could create a vector of variable names in our model statement as:

vars.ex = all.vars(form.ex)
[1] "y" "x1" "x2" "z1" "z2" "z3"

Now, just define the rhs and resp vectors as we did in the less elegant method.

rhs = vars.ex[2:3]
resp = vars.ex[1]

Is there a better way recombine these vectors than to create our own paste collapser? Yes! The reformulate() function makes it easy.

myform=reformulate(rhs.av, response=resp)
y ~ x1 + x2

And, for good measure, let's see what happens in the lm() command:

lm(myform, data=mkt.df)

lm(formula = myform, data = mkt.df)

(Intercept) x1 x2
-25.329 2.335 4.239

Tuesday, April 26, 2011

Automatically Save Your Plots to a Folder

Suppose you're working on a problem that involves a loop for calculations. At each iteration inside the loop, you want to construct a plot. Not only do you want to see the plot, but you would like to save each plot for a presentation, report or paper. Furthermore, the loop goes on for a while (say through the 26-letters of the alphabet).

The last thing you want to do in this situation is: (1) produce each plot one-by-one, (2) right click on each singly-produced plot to save, (3) give the plot a unique name, and (4) repeat. You'll spend too much of your time saving plots and not enough time thinking about whether they are the right plots. Just imagine, what if something went wrong and you need to produce the whole set of plots again?

RStudio has a nice feature in that it saves all of your plots in the plotting pane. It's no problem if you just produce the plot inside your dreaded loop in RStudio because it keeps all of your plots in the pane. Even with RStudio, if you produce the plots inside the loop, you still need to save each one individually. This isn't ideal. If you don't believe me, imagine that you have 1000 plots instead of 26. This manual-saving method becomes impractical quickly.

How then can you automatically save plots to a folder without spending too much time? That's today's task. I'll start by describing several building-block commands and then at the end I'll put together a loop that does it all (by way of contrived example).

The first command you need to know is jpeg() (Alternatively, bmp(), png() or tiff(), depending on your file-type preferences) paired with For our purposes, jpeg() takes a path argument that allows us to save (at the location of our choosing via the path) output to a plotting window. For example, this code will save the next plotting object to a jpeg file called myplot.jpg located at "C://R//SAVEHERE"

jpeg(file = "C://R//SAVEHERE//myplot.jpg")

After running the plotting object, you need to be sure to turn off the plotting device you created (with the jpeg() command). To save a scatter plot of the vectors x versus y to the location described above, run these three lines:

jpeg(file = "C://R//SAVEHERE//myplot.jpg")

This code is a good building block for automatically saving to a folder inside a loop, but we still need to know how to dynamically create file names at which to save our plots. Suppose we have a vector that gives us a list of identifiers called names. Presumably, these identifiers mean something in your setting. For our working example, names is a list of letters from A to Z (we are trying to produce a separate scatter plot for each letter from A to Z).

We can use the file.path() and paste() commands to help us out here. I have found paste() to be incredibly useful for other applications as well. For example,

paste("myplot_", i, ".jpg", sep="")

produces the character string myplot_50.jpg when i = 50 and myplot_51.jpg when i=51. If i changes for each iteration in the loop, this paste() command will create a unique file name at each iteration.

We can get fancier with our pasting. If we have a vector of names, we can extract the ith name from the names vector with the paste command:

paste("myplot_", names[i], ".jpg", sep="")

The file.path() function helps us in a different way. It is a special paste function that helps us construct file paths really easily. For example,

mypath=file.path("C:", "R", "SAVEHERE", filename)

returns "C://R//SAVEHERE//filename" and stores it in an object called mypath. Replace filename with the paste command and we have a way to automatically generate file names in a folder called SAVEHERE.

Now, you have the tools to understand my example code of how to automatically generate and save plots to a folder. Make sure the folder exists before saving, but subject to that constraint, this procedure may make your life easier.

R Bloggers

I recently found a great resource for R in the blogosphere, the R Bloggers Blog Aggregator. Basically, the site aggregates posts from a bunch of blogs about R (like this one!) into a giant feed of uses for R. If you are interested in learning more about R, the best way is to see how other people interact with it. What better way than to see what the R blogosphere has to say?

Monday, April 25, 2011

Merging Data Video Tutorial

Here's a video tutorial where I walk through some code that does what the previous post describes.

The FRED data is used extensively for macroeconomics. You might these data useful for joining in graph fights in the blogosphere.

Sunday, April 24, 2011

Merging Multiple Data Files into One Data Frame

We often encounter situations where we have data in multiple files, at different frequencies and on different subsets of observations, but we would like to match them to one another as completely and systematically as possible. In R, the merge() command is a great way to match two data frames together.

Just read the two data frames into R

mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)

Then, merge

myfulldata = merge(mydata1, mydata2)

As long as mydata1 and mydata2 have at least one common column with an identical name (that allows matching observations in mydata1 to observations in mydata2), this will work like a charm. It also takes three lines.

What if I have 20 files with data that I want to match observation-to-observation? Assuming they all have a common column that allows merging, I would still have to read 20 files in (20 lines of code) and merge() works two-by-two... so I could merge the 20 data frames together with 19 merge statements like this:

mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
mytempdata = merge(mytempdata, mydata20)

That's tedious. You may be looking for a simpler way. If you are, I wrote a function to solve your woes called multmerge().* Here's the code to define the function:

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)

After running the code to define the function, you are all set to use it. The function takes a path. This path should be the name of a folder that contains all of the files you would like to read and merge together and only those files you would like to merge. With this in mind, I have two tips:
  1. Before you use this function, my suggestion is to create a new folder in a short directory (for example, the path for this folder could be "C://R//mergeme") and save all of the files you would like to merge in that folder.
  2. In addition, make sure that the column that will do the matching is formatted the same way (and has the same name) in each of the files.
Suppose you saved your 20 files into the mergeme folder at "C://R//mergeme" and you would like to read and merge them. To use my function, you use the following syntax:

mymergeddata = multmerge("C://R//mergeme")

After running this command, you have a fully merged data frame with all of your variables matched to each other. Of course, most of the details in matching and merging data come down to making sure that the common column is specified correctly, but given that, this function can save you a lot of typing.

*Maybe a function like this exists out there already, but I think it is entertaining to write helpful functions. I also trust the functions more if I write them myself and test them on my own problems.

Saturday, April 23, 2011

Dates in R and the First Day of the Month

I spent some time this morning learning about how R thinks about dates in R. I found this website to be a useful guide.

Imagine that your data are dates in a standard format [YEAR-MONTH-DAY (as in 2011-23-04) is one such format] and you want a vector of the dates that are the first day in the month of your data set. You might, for example, be interested in this (as an intermediate calculation) because you wish to match some S&P 500 returns data to monthly data. Frustratingly, the S&P 500 returns data do not report on the weekends or on holidays when the exchange is closed.

Here's a function that will do return a vector of first-days-of-the-month if you give it a vector of dates. Maybe you will find this useful.


Thursday, April 21, 2011

Good Looking Maps in R

I haven't yet tried this, but it looks like a nice package.

In one recent project I needed to draw several maps and visualize different kinds of geographical data on it. I found the combination of R/ggplot/maps package extremely flexible and powerful, and produce nice looking map based visualizations.

Here is a short tutorial, monospace font indicates the code you need to run in R. You probably need some basic understanding of R to work through this tutorial.


Tuesday, April 19, 2011

Common Data Creation Commands

Here is a video tutorial where I go through some of the most commonly used commands in creating and manipulating data. As soon as I want to do more than just running a single regression, I use these commands more than any other set of commands (in some of the other videos, you may have seen these).

Here is the code I use in the video if you would like to try out these commands for yourself.

Monday, April 18, 2011

A Population Regression

Here's a video on some of the theory behind simple linear regression.

There's no R involved with this video, but the video provides some theory behind what it is that R's lm() command estimates.

Sunday, April 17, 2011

A Creative Use of R

Update (5/18/2011): Looks like Freakonomics approves as well. Let the record show that I approved first :)

I approve: "I use the open-source program R to create the patterns."

But, I'm not sure I approve of calling these distributions "evil."

In case you were wondering, here are the commands to create the density plots for these "evil" distributions.

Poisson: dpois
Weibull: dweibull
Gumbel: dgumbel
Cauchy: dcauchy
Erlang: dgamma. It's a special case of the Gamma distribution. A special case of Erlang is the Chi-squared distribution.

Thursday, April 14, 2011

Computing Statistical Power

Today's task is to compute a statistical power function in R. For a good definition of power, go here. Thinking about the power of a testing procedure is an often-overlooked aspect of hypothesis testing.

I also explain what power (in a statistical sense) is in the video, as well as use my shadenorm() function to show what power looks like on the classical hypothesis testing graph. Along the way, I demonstrate how writing a function can make your life easier. Here's the video:

Here is the script file I walk through in the video (copy into notepad for a better view):

My script file uses a new version of my shadenorm() command that is a little easier to use (only change: now, there are no justbelow or justabove arguments. If you just specify below, that's justbelow=TRUE...).

This is available here:

Tuesday, April 12, 2011

Video Tutorial on Robust Standard Errors

Update: I have included a modified version of this summaryR() command as part of my package tonymisc, which extends mtable() to report robust standard errors. The tonymisc package is available on CRAN through the install.packages() command.

If you have the right R commands at your disposal, it is simple to correct for heteroskedasticity using the robust correction that is commonly-used among economists. I recorded a video tutorial to describe the simplest (and most flexible) way I know to get R to compute robust standard errors.

The key is to use a "summary-style" command that has an option to correct for heteroskedasticity. The command I like to use is called summaryR(). Here is the script file with the summaryR() command.

I found this function on an R-help discussion board where several people were answering someone's question about extending the summary.lm() command.

I deserve none of the credit for writing this (credit goes to John Fox), but I consider it my duty to point out how nice this function is. I demonstrate how to use the function in this video

Here are is the script file I used in the video:

Here's a link to the data set.

Sunday, April 10, 2011

Video Tutorial on IV Regression

Update: [1 May 2011] I am working on a better augmentation of the current IV regression functions (specifically ivreg() in AER) in R. I will post a link here to my new method/function for IV regression when I finish debugging the code.

Update 2: [15 May 2011] The function I wrote here does not work with factors due to the way in which I constructed fitted values for the standard error correction in the second stage.

I recorded a new video tutorial on how to conduct an IV regression. This time I used my new ivregress() command, which has much better syntax than my old one. There are several reasons I needed to do this:

  1. Relative to my previous post on IV regression, I have added the ability to conduct overidentifying tests. Now, my ivregress() command has all of the basic functionality of Stata's 2sls option to ivregress.
  2. Relative to my previous video tutorial on IV regression, this video uses my newer command, which as much better syntax. As such, I will use this video tutorial to replace the old one.

Here is the R script file I used to define my ivregress() command and its companion summary command sum.iv().

To get ivregress() and sum.iv() to work, (1) copy the text in the above text box into an R script, (2) run all (or just download and run the R script) and (3) save into your default workspace if you want to use the commands into the future. Then, you're good to go. See my video demonstrating how to use the command here:

Here is the code I used to produce the output in the video.

Finally, here is a link to the synthetic "market share" data I used in the video.

I hope you find this command helpful.

Saturday, April 9, 2011

How did I make this plot?

To make this plot, I used R's plot(), points() and lines() commands. If you have been wanting to learn how to plot in R, watch it unfold in this video tutorial:

Also, here's the code I used:

Sunday, April 3, 2011

How to Shade Under a Normal Density in R

The easiest-to-find method for shading under a normal density is to use the polygon() command. That link is to the first hit on Google for "Shading Under a Normal Curve in R." It works (like a charm), but it is not the most intuitive way to let users produce plots of normal densities.

In response to the standard polygon() approach, I wrote a function called shadenorm() that will produce a plot of a normal density with whatever area you want shaded. The code is available here (in a newer post; I updated the function slightly).

Copy into an R script and run all of it if you want to use the shadenorm() command. To show you how to use it, I also recorded a video tutorial in R to demonstrate how to use the shadenorm() command.

Here is the code I use in the video:

There are a lot of applications where you may want to produce a plot with a normal density with some region shaded. I hope you find this function useful.