Friday, April 29, 2011

Forming Formulas

One of the first functions a new R user learns how to use is the lm() command, which involves stating the model formula.

lm(y~x1+x2, data=mydata)

After a while, this just becomes a natural way to say "I want a regression of y on x1 and x2 using mydata." Even though it is natural, the underlying structure of the formula is not as it first appears. This becomes apparent as soon as you need to work with formula objects when writing your own function.

Does ~ define a string? No. You can ask R directly whether the formula is a string (of class "character") using is.character().

is.character(y~x1+x2+x3)
[1] FALSE

As we will soon see, a formula isn't easily coerced into a string either.

Should we think of ~ as a fancy = sign? No. R doesn't think of ~ as defining an equation to be evaluated.

x1=5
x2=6
x3=7
y~x1+x2+x3
y ~ x1 + x2 + x3

It is better to think of ~ as defining the entire statement as an object with a special class called formula, which conveys the idea of the model to other functions in R.

For the rest of the post, I will show some calculations that illustrate more fundamentally what a formula object is. For concreteness, imagine you have a formula object with the general form y~x1+x2+z1+z2+z3.

You could define this object with the command:

form.ex = y~x1+x2+z1+z2+z3

Suppose our goal is to extract the first two explanatory variables (x1 and x2) and use them in a linear model for the response (y). If you're reading this for help, you likely have a more noble goal in mind. Take this as an illustrative example.

To appreciate the power of a method, it is useful to see the inelegant code we are avoiding. The inelegant code also shines some light on how a formula object differs from a string.

An inelegant method

We could gain access to the elements of the formula by coercing it into a character string using as.character(). If we do this in our example, we obtain a character vector of length three:

as.character(form.ex)
[1] "~"
[2] "y"
[3] "x1 + x2 + z1 + z2 + z3"

Here, we see that the RHS of the formula is stored in the third element of this vector. The response is in the second element of the vector. Let's store these into vectors we can easily reference and use.

rhs= as.character(form.ex)[3]
resp = as.character(form.ex)[2]

Next, we can splice the rhs vector to obtain the individual variable names using the strsplit command. Note: we want to split the vector apart using a delimiter " + " with the spaces. Because the + is an operator, we need to use " \\+ " instead. In addition, strsplit() returns a list of length two. We really want the first element of the list... (remember, this is the inelegant way).

rhs.sp = strsplit(rhs," \\+ ")[[1]]
rhs.sp
[1] "x1" "x2" "z1" "z2" "z3"

Now, we have a vector of variable names for the right hand side of the equation. As our goal is to construct a formula object relating the response to the first two explanatory variables, extract the first two elements of this vector.

rhs.new= rhs.sp[1:2]
rhs.new
[1] "x1" "x2"

And, paste this right hand side of the equation to the response variable name in a formula style using the following combination of paste() commands.

short.form=paste(resp,"~",paste(rhs.new,collapse="+"))
short.form
[1] "y ~ x1+x2"

Great. The only problem? The object short.form is not a formula object. It is a character string. This won't be a problem if we input it into something like lm(), which will coerce it to a formula object before using it.

lm(short.form, data=mkt.df)

Call:
lm(formula = short.form, data = mkt.df)

Coefficients:
(Intercept) x1 x2
-25.329 2.335 4.239

If you want to do the formula coercion yourself, you can use the as.formula() command as follows.

short.form=as.formula(short.form)
short.form
y ~ x1 + x2

A more elegant method

The all.vars() command does the splitting, splicing and extracting that comprised the bulk of what we did above. Starting with the formula object form.ex, we could create a vector of variable names in our model statement as:

vars.ex = all.vars(form.ex)
vars.ex
[1] "y" "x1" "x2" "z1" "z2" "z3"

Now, just define the rhs and resp vectors as we did in the less elegant method.

rhs = vars.ex[2:3]
resp = vars.ex[1]

Is there a better way recombine these vectors than to create our own paste collapser? Yes! The reformulate() function makes it easy.

myform=reformulate(rhs.av, response=resp)
myform
y ~ x1 + x2

And, for good measure, let's see what happens in the lm() command:

lm(myform, data=mkt.df)

Call:
lm(formula = myform, data = mkt.df)

Coefficients:
(Intercept) x1 x2
-25.329 2.335 4.239

Tuesday, April 26, 2011

Automatically Save Your Plots to a Folder

Suppose you're working on a problem that involves a loop for calculations. At each iteration inside the loop, you want to construct a plot. Not only do you want to see the plot, but you would like to save each plot for a presentation, report or paper. Furthermore, the loop goes on for a while (say through the 26-letters of the alphabet).

The last thing you want to do in this situation is: (1) produce each plot one-by-one, (2) right click on each singly-produced plot to save, (3) give the plot a unique name, and (4) repeat. You'll spend too much of your time saving plots and not enough time thinking about whether they are the right plots. Just imagine, what if something went wrong and you need to produce the whole set of plots again?

RStudio has a nice feature in that it saves all of your plots in the plotting pane. It's no problem if you just produce the plot inside your dreaded loop in RStudio because it keeps all of your plots in the pane. Even with RStudio, if you produce the plots inside the loop, you still need to save each one individually. This isn't ideal. If you don't believe me, imagine that you have 1000 plots instead of 26. This manual-saving method becomes impractical quickly.

How then can you automatically save plots to a folder without spending too much time? That's today's task. I'll start by describing several building-block commands and then at the end I'll put together a loop that does it all (by way of contrived example).

The first command you need to know is jpeg() (Alternatively, bmp(), png() or tiff(), depending on your file-type preferences) paired with dev.off(). For our purposes, jpeg() takes a path argument that allows us to save (at the location of our choosing via the path) output to a plotting window. For example, this code will save the next plotting object to a jpeg file called myplot.jpg located at "C://R//SAVEHERE"

jpeg(file = "C://R//SAVEHERE//myplot.jpg")

After running the plotting object, you need to be sure to turn off the plotting device you created (with the jpeg() command). To save a scatter plot of the vectors x versus y to the location described above, run these three lines:

jpeg(file = "C://R//SAVEHERE//myplot.jpg")
plot(x,y)
dev.off()


This code is a good building block for automatically saving to a folder inside a loop, but we still need to know how to dynamically create file names at which to save our plots. Suppose we have a vector that gives us a list of identifiers called names. Presumably, these identifiers mean something in your setting. For our working example, names is a list of letters from A to Z (we are trying to produce a separate scatter plot for each letter from A to Z).

We can use the file.path() and paste() commands to help us out here. I have found paste() to be incredibly useful for other applications as well. For example,

paste("myplot_", i, ".jpg", sep="")

produces the character string myplot_50.jpg when i = 50 and myplot_51.jpg when i=51. If i changes for each iteration in the loop, this paste() command will create a unique file name at each iteration.

We can get fancier with our pasting. If we have a vector of names, we can extract the ith name from the names vector with the paste command:

paste("myplot_", names[i], ".jpg", sep="")


The file.path() function helps us in a different way. It is a special paste function that helps us construct file paths really easily. For example,

mypath=file.path("C:", "R", "SAVEHERE", filename)

returns "C://R//SAVEHERE//filename" and stores it in an object called mypath. Replace filename with the paste command and we have a way to automatically generate file names in a folder called SAVEHERE.

Now, you have the tools to understand my example code of how to automatically generate and save plots to a folder. Make sure the folder exists before saving, but subject to that constraint, this procedure may make your life easier.

## ------------------------- ##
## An Example of ##
## Automating Plot Output ##
## ------------------------- ##
names = LETTERS[1:26] ## Gives a sequence of the letters of the alphabet
beta1 = rnorm(26, 5, 2) ## A vector of slopes (one for each letter)
beta0 = 10 ## A common intercept
for(i in 1:26){
x = rnorm(500, 105, 10)
y = beta0 + beta1[i]*x + 15*rnorm(500)
mypath <- file.path("C:","R","SAVEHERE",paste("myplot_", names[i], ".jpg", sep = ""))
jpg(file=mypath)
mytitle = paste("my title is", names[i])
plot(x,y, main = mytitle)
dev.off()
}
view raw Automating.R hosted with ❤ by GitHub

R Bloggers

I recently found a great resource for R in the blogosphere, the R Bloggers Blog Aggregator. Basically, the site aggregates posts from a bunch of blogs about R (like this one!) into a giant feed of uses for R. If you are interested in learning more about R, the best way is to see how other people interact with it. What better way than to see what the R blogosphere has to say?

Monday, April 25, 2011

Merging Data Video Tutorial

Here's a video tutorial where I walk through some code that does what the previous post describes.



The FRED data is used extensively for macroeconomics. You might these data useful for joining in graph fights in the blogosphere.

Sunday, April 24, 2011

Merging Multiple Data Files into One Data Frame

We often encounter situations where we have data in multiple files, at different frequencies and on different subsets of observations, but we would like to match them to one another as completely and systematically as possible. In R, the merge() command is a great way to match two data frames together.

Just read the two data frames into R


mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)


Then, merge

myfulldata = merge(mydata1, mydata2)

As long as mydata1 and mydata2 have at least one common column with an identical name (that allows matching observations in mydata1 to observations in mydata2), this will work like a charm. It also takes three lines.

What if I have 20 files with data that I want to match observation-to-observation? Assuming they all have a common column that allows merging, I would still have to read 20 files in (20 lines of code) and merge() works two-by-two... so I could merge the 20 data frames together with 19 merge statements like this:

mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
.
.
.
mytempdata = merge(mytempdata, mydata20)


That's tedious. You may be looking for a simpler way. If you are, I wrote a function to solve your woes called multmerge().* Here's the code to define the function:

multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)


After running the code to define the function, you are all set to use it. The function takes a path. This path should be the name of a folder that contains all of the files you would like to read and merge together and only those files you would like to merge. With this in mind, I have two tips:
  1. Before you use this function, my suggestion is to create a new folder in a short directory (for example, the path for this folder could be "C://R//mergeme") and save all of the files you would like to merge in that folder.
  2. In addition, make sure that the column that will do the matching is formatted the same way (and has the same name) in each of the files.
Suppose you saved your 20 files into the mergeme folder at "C://R//mergeme" and you would like to read and merge them. To use my function, you use the following syntax:

mymergeddata = multmerge("C://R//mergeme")

After running this command, you have a fully merged data frame with all of your variables matched to each other. Of course, most of the details in matching and merging data come down to making sure that the common column is specified correctly, but given that, this function can save you a lot of typing.

*Maybe a function like this exists out there already, but I think it is entertaining to write helpful functions. I also trust the functions more if I write them myself and test them on my own problems.

Saturday, April 23, 2011

Dates in R and the First Day of the Month

I spent some time this morning learning about how R thinks about dates in R. I found this website to be a useful guide.

Imagine that your data are dates in a standard format [YEAR-MONTH-DAY (as in 2011-23-04) is one such format] and you want a vector of the dates that are the first day in the month of your data set. You might, for example, be interested in this (as an intermediate calculation) because you wish to match some S&P 500 returns data to monthly data. Frustratingly, the S&P 500 returns data do not report on the weekends or on holidays when the exchange is closed.

Here's a function that will do return a vector of first-days-of-the-month if you give it a vector of dates. Maybe you will find this useful.

## --------------------------- ##
## Function takes a vector of ##
## dates as its input. ##
## ##
## It produces a vector of ##
## dates that are the first ##
## in their respective months ##
## --------------------------- ##
firstDayMonth=function(x)
{
x=as.Date(as.character(x))
day = format(x,format="%d")
monthYr = format(x,format="%Y-%m")
y = tapply(day,monthYr, min)
first=as.Date(paste(row.names(y),y,sep="-"))
as.factor(first)
}
view raw firstDayMonth.R hosted with ❤ by GitHub


Enjoy!

Thursday, April 21, 2011

Good Looking Maps in R

I haven't yet tried this, but it looks like a nice package.

In one recent project I needed to draw several maps and visualize different kinds of geographical data on it. I found the combination of R/ggplot/maps package extremely flexible and powerful, and produce nice looking map based visualizations.

Here is a short tutorial, monospace font indicates the code you need to run in R. You probably need some basic understanding of R to work through this tutorial.

Enjoy!

Tuesday, April 19, 2011

Common Data Creation Commands

Here is a video tutorial where I go through some of the most commonly used commands in creating and manipulating data. As soon as I want to do more than just running a single regression, I use these commands more than any other set of commands (in some of the other videos, you may have seen these).



Here is the code I use in the video if you would like to try out these commands for yourself.

## ------------------------ ##
## Data Creation in R ##
## Tutorial by Tony Cookson ##
## ------------------------ ##
## Simplest way to create a vector is to use the colon.
a = 1:4
b = 4:1
a
b
## But, they're just sequences
## c() can be useful for creating more interesting
## vectors
y = c(1,2,3,6) ## basic usage
y2 = c(a,b) ## binds vectors together into one vector
## More interesting... can create a vector
## recursively
weight = NULL ## It can be useful to initialize a null vector to define the vector recursively
for(i in 1:14){
temp= rnorm(10, mean = i*5, sd = 2)
weight = c(weight, mean(temp))
}
## seq() is a fancy way to create a vector
seq(1, 4) ## Same as 1:4
seq(0.5, 4) ## Default Step is 1
seq(1, 4, by = 0.5) ## Can Go in different number of steps
seq(1, 4, by = 0.4) ## Doesn't have to partition the interval (1,4)
## It can be useful to repeat a short sequence many times
## The rep() command is useful for this.
x1 = c(1,2,3,1,2,3,1,2,3,1,2,3) ## 1:3 repeated 4 times
rep(1:3, times = 4) ## More efficient
x2 = c(1,1,1,1, 2,2,2,2, 3,3,3,3) ## 1:3 each element repated 4 times
c(rep(1,4), rep(2,4), rep(3,4)) ## Can do it this way
rep(1:3, each = 4) ## Or use the "each" option (Most Efficient)
rep(1:3, each = 4, times = 2)
## Can be useful to organize into a matrix
y = rnorm(12, 8, 2) ## Creates a length 12 vector with a mean of 8 and sd = 2.
## I could also use c() then give the matrix dimensions
x.vec = c(y, x1, x2)
Xalt = matrix(x.vec, ncol = 3) ## coerces vectors into matrices
Xalt2 = matrix(x.vec, nrow= 12) ## Equivalent
Xmess = matrix(x.vec, ncol= 3, byrow=TRUE) ## Can reverse the way the matrix is filled
Xmess = matrix(x.vec, ncol= 12, byrow=TRUE) ## using byrow
## If I want x1, x2, and y in a matrix,
## I can use cbind() and rbind()
X = cbind(y, x1, x2)
Xrow = rbind(y, x1, x2)
## Maybe I want to use the names, too.
## Then the data.frame() command is useful
Xdf = data.frame(y, x1, x2)
Xdf ## The matrix
Xdf$y ## the vector named y...
## nice to have variable names
view raw DataCreation.R hosted with ❤ by GitHub

Monday, April 18, 2011

A Population Regression

Here's a video on some of the theory behind simple linear regression.



There's no R involved with this video, but the video provides some theory behind what it is that R's lm() command estimates.

Sunday, April 17, 2011

A Creative Use of R

Update (5/18/2011): Looks like Freakonomics approves as well. Let the record show that I approved first :)

I approve: "I use the open-source program R to create the patterns."



But, I'm not sure I approve of calling these distributions "evil."

In case you were wondering, here are the commands to create the density plots for these "evil" distributions.

Poisson: dpois
Weibull: dweibull
Gumbel: dgumbel
Cauchy: dcauchy
Erlang: dgamma. It's a special case of the Gamma distribution. A special case of Erlang is the Chi-squared distribution.

Thursday, April 14, 2011

Computing Statistical Power

Today's task is to compute a statistical power function in R. For a good definition of power, go here. Thinking about the power of a testing procedure is an often-overlooked aspect of hypothesis testing.

I also explain what power (in a statistical sense) is in the video, as well as use my shadenorm() function to show what power looks like on the classical hypothesis testing graph. Along the way, I demonstrate how writing a function can make your life easier. Here's the video:



Here is the script file I walk through in the video (copy into notepad for a better view):

## ----------------------- ##
## Computing and Graphing ##
## Statistical Power ##
## ----------------------- ##
## Parameters of the Problem ##
mu0 = 6 ## Null-hypothesized value
alpha = 0.05 ## Significance Level
s.sq = 16 ## Variance from pilot study
n1 = 64 ## Sample Size
alt = 6.5
se = sqrt(s.sq/n1)
## -------------------------- ##
## What is Statistical Power? ##
## -------------------------- ##
cuts = c(1-alpha/2, alpha/2)
crits = qnorm(cuts, mu0, sqrt(s.sq/n1)) ## Critical Values Based on Null
shadenorm(mu=mu0, sig = se, outside=crits) ## My own function
shadenorm(mu = 6.5, sig = se, lines=TRUE, outside=crits, col="blue")
## ------------------------------ ##
## Write a Function to Compute it ##
## ------------------------------ ##
power = function(theta, mu, var, n, alpha=0.05){
crit.l = qnorm(alpha/2, mu, sqrt(var/ n)) ## Critical Value Based on Null
crit.h = qnorm(1-alpha/2, mu, sqrt(var/ n)) ## Critical Value Based on Null
pr.high = pnorm(crit.h, theta, sd = sqrt(var/ n),lower.tail=FALSE) ## Prob Reject High
pr.low = pnorm(crit.l, theta, sd = sqrt(var/ n)) ## Prob Reject Low
pow = pr.low+pr.high
pow
}
power(thet=6.5, mu=6, var= 16, n = 64)
thet = seq(3,9, by=0.01)
pow = power(thet, mu0, s.sq, n1)
plot(thet, pow, type="l", ylim=c(0,1), main="My Power Plot")
n2 = 256 ## New Larger Sample Size
pow2 = power(thet, mu0, s.sq,n2)
lines(thet, pow2, lty="dashed", col="blue")
abline(h=0.05, lty="dotted")


My script file uses a new version of my shadenorm() command that is a little easier to use (only change: now, there are no justbelow or justabove arguments. If you just specify below, that's justbelow=TRUE...).

This is available here:

## -------------------------- ##
## An R Function that Shades ##
## under a normal density. ##
## ##
## This is a convenience ##
## function for polygon() ##
## -------------------------- ##
shadenorm = function(below=NULL, above=NULL, pcts = c(0.025,0.975), mu=0, sig=1, numpts = 500, color = "gray", dens = 40,
lines=FALSE,between=NULL,outside=NULL){
if(is.null(between)){
bothnull = is.null(below) & is.null(above)
if(bothnull==TRUE){
below = ifelse(is.null(below), qnorm(pcts[1],mu,sig), below)
above = ifelse(is.null(above), qnorm(pcts[2],mu,sig), above)
}
}
if(is.null(outside)==FALSE){
if(is.numeric(outside)==FALSE){if(outside==TRUE){outside=qnorm(pcts,mu,sig)}}
below = min(outside)
above = max(outside)
}
lowlim = mu - 4*sig
uplim = mu + 4*sig
x.grid = seq(lowlim,uplim, length= numpts)
dens.all = dnorm(x.grid,mean=mu, sd = sig)
if(lines==FALSE){
plot(x.grid, dens.all, type="l", xlab="X", ylab="Density")
}
if(lines==TRUE){
lines(x.grid,dens.all)
}
if(is.null(below)==FALSE){
x.below = x.grid[x.grid<below]
dens.below = dens.all[x.grid<below]
polygon(c(x.below,rev(x.below)),c(rep(0,length(x.below)),rev(dens.below)),col=color,density=dens)
}
if(is.null(above)==FALSE){
x.above = x.grid[x.grid>above]
dens.above = dens.all[x.grid>above]
polygon(c(x.above,rev(x.above)),c(rep(0,length(x.above)),rev(dens.above)),col=color,density=dens)
}
if(is.null(between)==FALSE){
if(is.numeric(between)==FALSE){if(between==TRUE){between=qnorm(pcts,mu,sig)}}
from = min(between)
to = max(between)
x.between = x.grid[x.grid>from&x.grid<to]
dens.between = dens.all[x.grid>from&x.grid<to]
polygon(c(x.between,rev(x.between)),c(rep(0,length(x.between)),rev(dens.between)),col=color,density=dens)
}
}
view raw shadenorm.R hosted with ❤ by GitHub

Tuesday, April 12, 2011

Video Tutorial on Robust Standard Errors

Update: I have included a modified version of this summaryR() command as part of my package tonymisc, which extends mtable() to report robust standard errors. The tonymisc package is available on CRAN through the install.packages() command.

If you have the right R commands at your disposal, it is simple to correct for heteroskedasticity using the robust correction that is commonly-used among economists. I recorded a video tutorial to describe the simplest (and most flexible) way I know to get R to compute robust standard errors.

The key is to use a "summary-style" command that has an option to correct for heteroskedasticity. The command I like to use is called summaryR(). Here is the script file with the summaryR() command.

## ---------------------------------------------------------------------------------------- ##
## Author: John Fox ##
## Source: http://r.789695.n4.nabble.com/R-extend-summary-lm-for-hccm-td815004.html ##
## Adapted by Tony Cookson. ##
## -- Only Change Made: Changed the name of the function (unwisely maybe) ##
## to summaryR from summaryHCCM.lm. I also changed the spelling of consistent ##
## ---------------------------------------------------------------------------------------- ##
summaryR.lm <- function(model, type=c("hc3", "hc0", "hc1", "hc2", "hc4"), ...){
if (!require(car)) stop("Required car package is missing.")
type <- match.arg(type)
V <- hccm(model, type=type)
sumry <- summary(model)
table <- coef(sumry)
table[,2] <- sqrt(diag(V))
table[,3] <- table[,1]/table[,2]
table[,4] <- 2*pt(abs(table[,3]), df.residual(model), lower.tail=FALSE)
sumry$coefficients <- table
p <- nrow(table)
hyp <- cbind(0, diag(p - 1))
sumry$fstatistic[1] <- linearHypothesis(model, hyp,white.adjust=type)[2,"F"]
print(sumry)
cat("Note: Heteroscedasticity-consistent standard errors using adjustment", type, "\n")
}
view raw summaryR.R hosted with ❤ by GitHub


I found this function on an R-help discussion board where several people were answering someone's question about extending the summary.lm() command.

I deserve none of the credit for writing this (credit goes to John Fox), but I consider it my duty to point out how nice this function is. I demonstrate how to use the function in this video



Here are is the script file I used in the video:

## -------------------------------------- ##
## R Tutorial on Robust Standard Errors ##
## Author: Tony Cookson ##
## -------------------------------------- ##
## Read Data
library(foreign)
ed = read.dta("C://R//edudat2.dta")
## Obtain regression object
educ.lm = lm(log(wage)~gender + educ + pareduc, data=ed)
## Summarize Output Assuming Homoskedasticity
summary(educ.lm)
## Need to load the car library
library(car)
## If you don't have car installed you'll need to install it
## Use this command:
## install.packages("car", dependencies = TRUE)
## Then, run the library command library(car)
## Robust Standard errors
summaryR(educ.lm) ## Special Function written by John Fox
## Can Specify Type of Heteroskedasticity correction
summaryR(educ.lm, type="hc0") ## White SE
summaryR(educ.lm, type="hc1") ## Stata's Default
summaryR(educ.lm, type="hc2") ## Unbiased under homoskedasticity
summaryR(educ.lm, type="hc3") ## Default (conservative)
## Test using linearHypothesis
linearHypothesis(educ.lm, "educ+pareduc =1", white.adjust="hc3")
linearHypothesis(educ.lm, "educ+pareduc =1", white.adjust="hc0")


Here's a link to the data set.

Sunday, April 10, 2011

Video Tutorial on IV Regression

Update: [1 May 2011] I am working on a better augmentation of the current IV regression functions (specifically ivreg() in AER) in R. I will post a link here to my new method/function for IV regression when I finish debugging the code.

Update 2: [15 May 2011] The function I wrote here does not work with factors due to the way in which I constructed fitted values for the standard error correction in the second stage.

I recorded a new video tutorial on how to conduct an IV regression. This time I used my new ivregress() command, which has much better syntax than my old one. There are several reasons I needed to do this:

  1. Relative to my previous post on IV regression, I have added the ability to conduct overidentifying tests. Now, my ivregress() command has all of the basic functionality of Stata's 2sls option to ivregress.
  2. Relative to my previous video tutorial on IV regression, this video uses my newer command, which as much better syntax. As such, I will use this video tutorial to replace the old one.

Here is the R script file I used to define my ivregress() command and its companion summary command sum.iv().

## ----------------------------------- ##
## A homegrown IV regression command ##
## that handles the case where there ##
## is one endogenous variable and many ##
## (or one) instrument. ##
## ##
## Disadvantage: The command doesn't ##
## handle multiple endogenous vars ##
## ##
## Advantage: testing for relevance, ##
## displaying the first stage and ##
## conducting overidentification tests ##
## are natural and easy. ##
## ----------------------------------- ##
ivregress = function(second, first, data){
s.names = all.vars(second)
f.names = all.vars(first)
data.names = names(data)
N = length(data[,1])
all.names = c(s.names,f.names)
resp = s.names[1]
endog = f.names[1]
inst = f.names[-1]
explan = s.names[-1]
exog = explan[explan!=endog]
exog.f = paste(exog,collapse="+")
inst.f = paste(inst, collapse="+")
RHS = paste(exog.f, inst.f, sep="+")
first.form = as.formula( paste(endog, "~", RHS))
first.lm = lm(first.form, data)
ftest = linearHypothesis(first.lm, inst, rep(0,length(inst)))
x.hat = fitted(first.lm)
data2 = cbind(data,x.hat)
iname = paste(endog,".hat",sep="")
names(data2) = c(names(data), iname)
data2.names = names(data2)
RHS2 = paste(exog.f,iname,sep="+")
second.form = as.formula(paste(resp, "~", RHS2))
second.lm = lm(second.form, data2)
Xmat = data2[c(exog,endog)]
Xmat2 = data2[c(exog,iname)]
z = summary(second.lm)
X = as.matrix(cbind(1,Xmat))
X2 = as.matrix(cbind(1,Xmat2))
Y = data[resp]
fit = X%*%second.lm$coefficients
res = Y - fit
## Tests for overidentifying restrictions ##
data3 = cbind(data2, res)
names(data3) = c(names(data2), "res")
## J test
J.test = as.data.frame(matrix(c(1,2),nrow=1))
names(J.test) = c("J.stat","P[J > J.stat ]")
## Sargan's Test
S.test = as.data.frame(matrix(c(1,2),nrow=1))
names(S.test) = c("S.stat","P[S > S.stat ]")
if(length(inst)>1){
J.form = as.formula(paste("res", "~", RHS))
J.lm = lm(J.form, data3)
f.test = linearHypothesis(J.lm,inst, rep(0,length(inst)))
J.stat = length(inst)*f.test$F[2]
S.stat = N*summary(J.lm)$r.squared
}
J.test[1,1] = J.stat
J.test[1,2] = 1-pchisq(J.stat, length(inst)-1)
S.test[1,1] = S.stat
S.test[1,2] = 1-pchisq(S.stat, length(inst)-1)
xPx = t(X2)%*%X2
xPx.inv = solve(xPx)
z$cov.unscaled = xPx.inv
z$residuals = res
z$sigma = sqrt(mean(res^2))
varcovmat = z$cov.unscaled*z$sigma
coef = z$coefficients[,1]
IV.SE = z$sigma*sqrt(diag(xPx.inv))
t.iv = coef/IV.SE
p.val = 2*(1-pnorm(abs(t.iv)))
z$coefficients[,2] = IV.SE
z$coefficients[,3] = t.iv
z$coefficients[,4] = p.val
result = list(summary(first.lm),z,ftest,J.test, S.test)
names(result) = c("first", "second", "ftest", "Jtest", "Sargan")
print("IV object successfully created. Use sum.iv() on object")
print("to learn about your 2SLS Regression")
return(invisible(result))
}
sum.iv = function(reg.iv, first=FALSE,
second=TRUE, ftest=FALSE, overid=FALSE) {
x= rep(0,5)
if(first==TRUE) x[1] = 1
if(second==TRUE) x[2]= 2
if(ftest==TRUE) x[3]= 3
if(overid==TRUE) x[4:5] = 4:5
print(reg.iv[x])
}
view raw ivregress.R hosted with ❤ by GitHub


To get ivregress() and sum.iv() to work, (1) copy the text in the above text box into an R script, (2) run all (or just download and run the R script) and (3) save into your default workspace if you want to use the commands into the future. Then, you're good to go. See my video demonstrating how to use the command here:



Here is the code I used to produce the output in the video.

## ------------------------ ##
## An Easy IV regression ##
## regression command for R ##
## ##
## Author: Tony Cookson ##
## Commands: ivregress() ##
## sum.iv() ##
## ------------------------ ##
## Read Data ##
library(foreign)
mkt.df = read.dta("C://R//mktshare.dta")
## Run IV Regression ##
library(car) ## Uses linearHypothesis() from car()
## Define the IV object
mkt.iv = ivregress(Y~x1+x2+p, p~z1+z2, mkt.df)
## Options to summarize the IV object
sum.iv(mkt.iv) ## Basic Output
sum.iv(mkt.iv, first=T) ## Show first stage
##-- like Stata's ", first"
sum.iv(mkt.iv, ftest=T) ## F-test for relevance
##-- same as Stata's "estat first"
sum.iv(mkt.iv, overid=T) ## Test for overidentifying restrictions
##-- same as Stata's "estat overid"
sum.iv(mkt.iv, overid=T, first=T, ftest=T)


Finally, here is a link to the synthetic "market share" data I used in the video.

I hope you find this command helpful.

Saturday, April 9, 2011

How did I make this plot?


To make this plot, I used R's plot(), points() and lines() commands. If you have been wanting to learn how to plot in R, watch it unfold in this video tutorial:



Also, here's the code I used:

## ---------------------------- ##
## Plotting Points *and* Lines ##
## ---------------------------- ##
## Define the Vectors I used in the video ##
pow = c(0.95, 0.6, 0.3, 0.15, 0.1, 0.05, 0.1, 0.15, 0.3, 0.6, 0.95)
pow2 = c(0.99, 0.75, 0.4, 0.2, 0.15, 0.05, 0.15, 0.2, 0.4, 0.75, 0.99)
thet = c(7:17)
plot(thet,pow) ## Plots points
plot(thet,pow2) ## Plots points on *new* plot
## But we want them on the same plot...
## Try options in plot
## Useful options in plot()
plot(thet,pow, type="l", col="red") ## Lines
plot(thet,pow, type="o", col="blue") ## Overplotted lines and points
plot(thet,pow, type="s", col="orange") ## Stair steps
## Less useful options
plot(thet,pow, type="b") ## "Both" lines and points
plot(thet,pow, type="c") ## "lines" part of "both"
plot(thet,pow, type="h", xlim=c(0, 20),ylim=c(0,1)) ## Histogram-like
plot(thet,pow, type="n",xlim=c(0, 20),ylim=c(0,1)) ## No plotting... ?
## Note: xlim = c(low,high) sets low to be the lower limit for the x-axis and high to be the upper limit.
## ylim = c(low,high) does the same for the y-axis. These can be useful for
## Points and Lines to put on top of the active plot
plot(thet,pow, type="n",xlim=c(0, 20),ylim=c(0,1)) ## No plotting... ?
points(thet,pow2, col="blue")
lines(thet,pow2, col="green")
lines(thet,pow, col="red", type="o") ## plot() options work for lines() and points()
title("My Power Plot") ## Sometimes it is nice to add a title

Sunday, April 3, 2011

How to Shade Under a Normal Density in R

The easiest-to-find method for shading under a normal density is to use the polygon() command. That link is to the first hit on Google for "Shading Under a Normal Curve in R." It works (like a charm), but it is not the most intuitive way to let users produce plots of normal densities.

In response to the standard polygon() approach, I wrote a function called shadenorm() that will produce a plot of a normal density with whatever area you want shaded. The code is available here (in a newer post; I updated the function slightly).

Copy into an R script and run all of it if you want to use the shadenorm() command. To show you how to use it, I also recorded a video tutorial in R to demonstrate how to use the shadenorm() command.




Here is the code I use in the video:

## ---------------------- ##
## Shading under a Normal ##
## in R ##
## ---------------------- ##
shadenorm() ## Defaults mu = 0, sig = 1, pcts = c(0.025, 0.975)
## Below, Above, Outside and Between
shadenorm(below=-1.5) ## Sets lower endpoint at -1.5
shadenorm(below=-1.5,justbelow=TRUE) ## Just plots the lower tail
shadenorm(above=1.5) ## Sets upper endpoint at 1.5
shadenorm(above=1.5, justabove=TRUE) ## Similar to below
shadenorm(below=-1.5,above=1.5) ## Plots "outside" of -1.5 and 1.5
shadenorm(outside = c(2.5, -2.5)) ## Alternative "outside" plotter
shadenorm(outside = c(1, 2, -1, -0.2, 2.4)) ## works by taking min and max
shadenorm(between = c(2.5, -2.5)) ## Plots between specified points
shadenorm(between = c(-1.75, 0, 2, 0.5, -1)) ## Plots between specified points
## Can specify different means and standard deviations with mu and sig arguments
shadenorm(mu=2, sig=10)
shadenorm(mu=2, sig=10, between=c(-2, 14)) ## works with between and outside
shadenorm(mu=2, sig=10, outside=c(-5, 14))
## Can Apply different colors and densities... color and dens
shadenorm(mu=2, sig=10, outside=c(-3, 12), col="blue")
shadenorm(mu=2, sig=10, outside=c(-3, 12), col="blue",dens=15) ## Dens default is 40
shadenorm(mu=2, sig=10, outside=c(-3, 12), col="blue",dens=150)
## Can plot one and then another ontop of it using lines = TRUE
shadenorm(mu=2, sig=10, outside=c(-3, 12), dens=15)
shadenorm(mu=2, sig=15, between=c(-3, 12),lines=TRUE, col="blue",dens=15)
## Example: Plotting a Hypothesis Test for the mean
## Truth: mu.true = 8
## Hypothesis: mu.ho = 6
## Generate Data Under Truth
mu.true = 5 ## Alternative Mean
mu.ho = 6
sig = 8
N = 250 ## Sample Size
std.err = sig/sqrt(N)
crits = qnorm(c(0.025,0.975),mean=mu.ho, sd = std.err)
shadenorm(outside = crits, mu = mu.ho, sig = std.err,dens=15)
shadenorm(between = crits, mu = mu.true, sig = std.err, lines=TRUE, color="green",dens=15)


There are a lot of applications where you may want to produce a plot with a normal density with some region shaded. I hope you find this function useful.