Sunday, May 23, 2010

Two-stage least squares

Update: Since this post, I discovered a bunch of better ways to conduct 2SLS regression in R. These methods do not allow the user to look under the hood to see what is going on (so this post may be instructive anyway). If you're interested, ivreg() in AER, tsls() in sem and systemfit() in systemfit are good options. My package tonymisc also has a command, iv() that works when some or all of these fail. I'm working on making that function better (5/18/2011)

Here is some code I wrote that allows you to easily perform two-stage least squares regression using R.

ivreg = function(Y0,X0,Z10=NULL,Z20,data.df){   
Y = data.df[[Y0]]
X = data.df[[X0]]
Z1 = NULL
if(length(Z10)>0){
Z1 = data.df[[Z10[1]]]
}
Z2 = data.df[[Z20[1]]]

if(length(Z10)>1){
for(i in 2:length(Z10)){
Z1=cbind(Z1,data.df[[Z10[i]]])
}
}

if(length(Z20)>1){
for(i in 2:length(Z20)){
Z2 =cbind(Z2,data.df[[Z20[i]]])
}
}

## First Stage
Z = cbind(1,Z1,Z2)

zPz = t(Z)%*%Z
zPx = t(Z)%*%X
zPz.inv = solve(zPz)
beta.f = zPz.inv%*%zPx

xhat = Z%*%beta.f
SSE.f = sum((X - xhat)^2)
df.d = length(data.df[,1])-length(Z[1,])
MSE.f = SSE.f/df.d
RMSE.f = sqrt(MSE.f)
se.f = RMSE.f*sqrt(diag(zPz.inv))
t.f = beta.f/se.f
pval.f = 2*pnorm(abs(t.f),lower.tail=F)

## Constructing F-test
Z.d = cbind(rep(1,
length(data.df[,1])),Z1)
zdPzd = t(Z.d)%*%Z.d
zdPx = t(Z.d)%*%X
zdPzd.inv = solve(zdPzd)
bhat.d = zdPzd.inv%*%zdPx
xhat.d = Z.d%*%bhat.d

SSE.d = sum((X-xhat.d)^2)
SSEx.z = SSE.d -SSE.f
MSEx = SSEx.z/length(Z20)
F.stat = MSEx/MSE.f
df.n = length(Z[1,]) -
length(Z.d[1,])
pval.Ftest = pf(F.stat, df.n, df.d ,
lower.tail=FALSE)

ftest = round(c(F.stat,
pval.Ftest,df.n), 4)
names(ftest) = c("F-stat", "P-value",
"# of instruments")
rn.f = c("constant",Z10,Z20)

first = data.frame(cbind(beta.f,se.f,
t.f, pval.f),row.names=rn.f)
names(first) = c("Estimate", "Std. Error",
"T-stat", "P-value")

## Second Stage
X.mat = cbind(1,Z1,xhat)
X.mat2 = cbind(1,Z1,X)

xPx = t(X.mat)%*%X.mat
xPx.inv = solve(xPx)
xPy = t(X.mat)%*%Y
beta2sls = xPx.inv%*%xPy

resid = Y-X.mat2%*%beta2sls
MSE = mean(resid^2)
RMSE = sqrt(MSE)
se2sls = RMSE*sqrt(diag(xPx.inv))
tstat = beta2sls/se2sls
pval = 2*pnorm(abs(tstat),lower.tail=F)

rn = c("constant",Z10,X0)

second = data.frame(cbind(beta2sls,se2sls,
tstat, pval),row.names=rn)
names(second) = c("Estimate", "Std. Error",
"T-stat", "P-value")
result = list(first,second,ftest)
names(result) = c("first", "second", "ftest")

print("IV object successfully created.
Use sum.iv() on object")
print("to learn about your 2SLS
Regression")
return(invisible(result))

}

The above function is ivreg(), which takes five arguments: (1) Y0 is the name of the dependent variable, (2) X0 is the name of the endogenous regressor, (3) Z10 is the set of names of exogenous regressors [second stage variables assumed to be exogenous], and (4) Z20 is the set of instruments.

The function will not print any output if run alone. The appropriate syntax is to save the function output into an "iv" object, which you can summarize using the following function:

sum.iv = function(reg.iv, first=FALSE,
second=TRUE, ftest=FALSE) {
x= rep(0,2)
if(first==TRUE) x[1] = 1
if(second==TRUE) x[2]= 2
if(ftest==TRUE) x[3]= 3

print(reg.iv[x])
}

Here is an example of how to use the functions to perform a two-stage least squares regression. Imagine that you have a data on prices and quantities of cars sold, and you have some data on attributes that directly affect quantity sold (falls into category Z10) and attributes that only affect quantity sold through the price (falls into category Z20).

You could estimate the model, and summarize its output using the following two commands:

demand.iv = ivreg("quantity", "price", c("horsepower", "mpg"), c("regulations", "cost", "competition"))

sum.iv(demand.iv, first=FALSE, second=TRUE, ftest=FALSE)

Note: The quotes in the ivreg command are necessary to get the function to work properly. This is an unusual artifact of how I wrote the function.

What is nice is that the sum.iv() command allows the user to specify what output he/she wants to see from the two-stage least squares regression. For the sum.iv() command, you could just leave off the logical arguments to get the same output (because these are the defaults), but you can choose to show the first-stage estimates and the F-test on the explanatory power of the instruments by specifying your own options.

Friday, May 21, 2010

Testing multiple linear restrictions

I recently wrote a function that performs multiple linear hypothesis tests on a linear model object in R. As this is a common use of regression models, I am sure there are other easy-to-use packages out there, but my function works well (so feel free to use it).

The function is called test.lin(), and here is the code that you'll need to run to get it to work:

test.lin = function(R, co, model,...){

z = summary(model)

cov = z$cov.unscaled
beta = z$coefficients[,1]
g = nrow(R)
sig.sq = (z$sigma)^2

Rb.h = R%*%beta
diff = Rb.h-co

var.con = R%*%cov%*%t(R)
var.con.inv = solve(var.con)

W = t(diff)%*%var.con.inv%*%diff/(g*sig.sq)

df= model$df.residual

p.val= pf(W,g,df,lower.tail=F)

d = list(Rb.h, co, W, p.val)
names(d) = c("Prediction","Ho Value","Test Stat",
"P-value")
d
}

Once you run the above code, the function test.lin() can be used to test multiple hypothesis of the form:

Ho: RBeta = co
Ha: RBeta !=co

Where R is a matrix and Beta is a vector of regression parameters. Each row of R defines a linear combination of regression parameters. Then, the function allows the user to specify the hypothesized value of each linear combination with the vector co.

Here's an example of how to use the function. To use my code, you will want to read in the data (found here) and estimate a linear model called school.lm:

> ## ----------------------------- ##
> ## Test that beta1=beta2=beta3=0 ##
> ## ----------------------------- ##
>
> R1 = c(0, 1, 0, 0, 0)
> R2 = c(0, 0, 1, 0, 0)
> R3 = c(0, 0, 0, 1, 0)
>
> R = rbind(R1, R2, R3)
> co = matrix(c(0,0,0),nrow=3)
>
> test.lin(R,co, school.lm)
$Prediction
[,1]
R1 0.6466691
R2 -0.4384634
R3 -0.4450893

$`Ho Value`
[,1]
[1,] 0
[2,] 0
[3,] 0

$`Test Stat`
[,1]
[1,] 341.2251

$`P-value`
[,1]
[1,] 1.276346e-111

There is a "canned" way to perform this hypothesis test (but not all linear hypothesis tests, read on). Namely, you could estimate the reduced model (i.e., the one where beta1=beta2=beta3=0), and then use the anova() command to compare the two models via a likelihood ratio test. This gives an identical test statistic.

> ## Alternative method: Estimate reduced model
> ## and use anova()
>
> school.red = lm(math_scr~enrl_tot,data=caschool.df)
>
> anova(school.lm, school.red)
Analysis of Variance Table

Model 1: math_scr ~ avginc + str + meal_pct + enrl_tot
Model 2: math_scr ~ enrl_tot
Res.Df RSS Df Sum of Sq F Pr(>F)
1 415 41988
2 418 145559 3 103571 341.23


But, the value in my function is that it works for linear restrictions for which you cannot specify a full-model vs. reduced-model relationship. For example, you may want to test a more complicated null hypothesis:

Ho: beta1+beta2+beta3 =1 and beta4-beta3 = 0

This is simple to specify in my framework, but it is a little hard to figure out what reduced model will give you these joint restrictions. Therefore, using anova() isn't likely to get you very far. Here's how to implement this test using my code (code with output).

> R1 = c(0,1,1,1,0)
> R2 = c(0,0,0,-1,1)
>
> R = rbind(R1,R2)
>
> co = matrix(c(1,0), nrow=2)
>
> test.lin(R,co,school.lm)
$Prediction
[,1]
R1 -0.2368836
R2 0.4449859

$`Ho Value`
[,1]
[1,] 1
[2,] 0

$`Test Stat`
[,1]
[1,] 156.0517

$`P-value`
[,1]
[1,] 2.909143e-51

As I mentioned at the outset, I think there must be functions that do what my function does. I just didn't conduct a search for such a function (In fact, I might have used one such function in the past). Feel free to share if you know the name and package of such a function.

UPDATE: I should have mentioned this in this post before now, but I discovered a much more user-friendly method for conducting linear hypothesis tests in the car package: linearHypothesis() in car. Here is a video tutorial where I show how to use linearHypothesis.

Saturday, May 15, 2010

Simulating from a T1EV (Gumbel) Distribution

I have recently starting writing my own functions in R, and this has been very helpful in getting some of my coding projects done. Because R allows you to write functions, you can use this feature to greatly augment R's already-powerful capabilities.

For example, I recently needed to simulate 1.5 million independent draws from a Type 1 Extreme Value distribution, which is known on Wikipedia as the Gumbel distribution. I tried using rgumbel(), but no such function exists in the standard packages. I also conducted a quick search of R help files... to no avail.

What's an R user to do? Write a function!

Fortunately, I remembered this really handy theorem that transforming any continuous random variable by its CDF F() always produces a random variable that is uniformly distributed on the [0,1] interval. As a practical matter, this means that you can simulate n random draws from a distribution by:
  1. Computing n random draws from a Uniform distribution on [0,1]
  2. Applying the inverse function of the CDF to the uniformly draw random variables.
I used this algorithm and the inverse formula for the Gumbel CDF to write the following function.

rt1ev = function(n){ ## Function that returns n
U = runif(n) ## independent draws from T1EV
t1ev = -log(-log(U))
t1ev
}

Because IO economists are fond of using the T1EV distribution and I am studying IO now, I can see myself using this function repeatedly.

Here's how it works. Suppose you want a vector x to have 100 draws from a T1EV distribution. After loading my code, all you need to do to use this function is type:

x = rt1ev(100)

Not only is this a useful function if you want to simulate from a Gumbel distribution, but it is a great illustration of why knowing how to write functions in R can make your life easier.

Thursday, May 13, 2010

A function that reports White Standard Errors

Update: Since writing this post, I found a natural extension written by John Fox that does what the function in this post does and I have extended it in my CRAN-contributed library tonymisc. In my package, this modified and extended function is called summaryR().

One of the frustrating things about using R for econometrics is that R has some cumbersome functionality for computing the standard (in econometrics, anyway) White standard errors. As one of my favorite references on R learning points out, there are two ways to retrieve these precious standard errors.
  1. You could install and load the sandwich library. Then, apply the function vcovHC to your lmHC0. That will give you the White covariance matrix. Next, take the square root of the diagonals, and you have retrieved your standard errors.
  2. Install and load the car library. Then, apply the function hccm to your lm object, specifying the option hc0. Again, this only gives the covariance matrix, which you have to transform to obtain standard errors.
Both options produce the exact same answers (I tried it myself), but both are clunky if you want to use the White standard errors as a routine check for whether it is important. To mitigate this issue (at least for myself), I wrote a function that does the work for you.

The function is called summarize.r(), and it works just like the summary command. In fact, my function calls summary.lm, and then modifies the contents that are printed using method 1 from above.

To run my function, (1) download and install the sandwich library, (2) run the code that defines the function (see below), and (3) after fitting your regression using lm(), you can use the function by calling summarize.r(yourregression.lm).

Without further ado, here's the code you should run.

summarize.r = function(reg.lm, ...){
z = summary.lm(reg.lm)
W.VCOV = vcovHC(reg.lm,type="HC0")
sig.sq = ((summary(reg.lm)$sigma)^2)
coef = z$coefficients[,1]


White.SE <- sqrt(diag(W.VCOV)) t.robust <- coef/White.SE ## Element-by-element division df <- reg.lm$df.residual p.val.robust <- 2*(1-pt(abs(t.robust),df)) z$cov.unscaled = W.VCOV/sig.sq z$coefficients[,2] = White.SE z$coefficients[,3] = t.robust z$coefficients[,4] = p.val.robust z }


And, here's an example of putting the code to use. First, note what the summary output gives (cutting the call, residuals, and significance codes). Next, see the output from my summarize.r command.

>  summary(mathcomp.lm)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 643.59 2.06 312.388 <> summarize.r(mathcomp.lm)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 643.587 2.053 313.495 <>

Tuesday, May 11, 2010

Summary Statistics in R

There are plenty of useful techniques for manipulating data frame objects in R. This post summarizes some of these useful techniques, and gives code to implement them. Before running the code in this post, you should read in the data set (which is available for download here). For tips on how to read Stata ".dta" files into an R data frame, see my previous post.
## Print the entire data frame to the screen by typing its name ##

caschool.df

## Look just as the variable names ##

names(caschool.df)

## Examine a particular variable using the $ extractor ##

caschool.df$testscr

## You can also think of the data frame as a matrix ##

caschool.df[,4] ## Extracts the 4th column
caschool.df[4,] ## Extracts the 4th observation
caschool.df[4,4:18]

## The last command extracts the 4th through 18th columns
## of the 4th observation.
##
## Note 4:18 is R shorthand for
## c(4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)

## Compute summary statistics on the entire data frame ##
## or just one variable ##

mean(caschool.df) ## Returns NA for categorical
round(mean(caschool.df),2) ## Rounding returns easier
## to read formatting
var(caschool.df) ## Returns var-cov matrix
diag(var(caschool.df)) ## Returns just the variances
mean(caschool.df$testscr) ## Returns mean of testscr
var(caschool.df$testscr) ## Returns scalar variance

## Compute the summary statistics on columns 6 through 18 ##

mean(caschool.df[,6:18])
var(caschool.df[,6:18]) ## Returns var-cov matrix

## If you don't like typing the data frame's name every ##
## time you want to explore the variable, you can use ##
## the attach() command.

attach(caschool.df)

## Now, the variables in the data frame can be accessed ##
## without extracting them with the $

mean(testscr)


##-------------------------------------------------- ##
## Creating/Storing new variables ##
## Just using arithmetic definition of new variables ##
##-------------------------------------------------- ##

math_read_avg = (read_scr+math_scr)/2

## -------------------------------------------##
## You can attach it to the data frame ##
## In fact, for regression, you want to do so ##
## -------------------------------------------##

caschool.df=cbind(caschool.df,math_read_avg)

## -------------------------------------------------##
## ... and its name will be what you called it when ##
## you defined the variable ##
## -------------------------------------------------##

names(caschool.df)
caschool.df$math_read_avg

## ------------------------------------------------- ##
## Too many datasets in R's active memory can cause
## problems math_scr might be a variable name in
## multiple data frames ... especially if you work
## with a lot of data sets
##
## In case of ambiguity, R will write over the old
## variable name with the new definition.
##
## To avoid this problem, when you are done with an
## attached data set, you should use detach()
## ------------------------------------------------ ##

detach(caschool.df)

## ------------------------------------------------ ##
## An added note: I like to leave my workspace image
## clutter-free. After working on some code, I save
## my code in a text file, but I do *not* save my
## workspace image.
##
## Following this practice can help you avoid
## referencing a variable name that you created months
## ago for a separate project
## ------------------------------------------------ ##

Reading in data in R

Reading in data can be a chore in R if you haven't done it before. In this post, I describe how to read data into R's most common data object -- the data frame.

## ----------------------------------------------------##
## The most basic structure in R is a data frame. The ##
## goal here is to read data from some external source ##
## into an R data frame ##
## ----------------------------------------------------##

## --------------------------------------------------- ##
## The standard space-delmited function is read.table()##
## ##
## To read in the data, you will want to save the file ##
## to a known directory. For me, this is "C:/R/" ##
## ##
## Time saving note: these commands sometimes require ##
## double backspaces in the file path. So, even if I ##
## saved my file at "C:/R/data.txt" I tend to type ##
## "C://R//data.txt" ##
## --------------------------------------------------- ##

## No header is default ##
traffic.df = read.table("C://R//traffic.txt")
traffic.df ## Display the data frame

## How to tell R to store variable names ##
traffic.df = read.table("C://R//traffic.txt",header=T)

## --------------------------------------------------- ##
## A common difficulty is that the basic command does ##
## not read in Excel Workbook files. For this, there ##
## are a couple of workarounds. My favorite is to ##
## use read.csv(). ##
## --------------------------------------------------- ##

## First, save your Excel file as type "csv"
## Second, use a command like the following:

insulin.df = read.csv("C://R//insulin.csv", header = T)

##----------------------------------------------------##
## Suppose you want to read in an Stata .dta file ##
## The command is read.dta(). ##
## ##
## This requires loading a package. ##
## In this case, the "foreign" library. ##
## ##
## Because external packages are not part of the base ##
## group of packages, you'll have to tell R to use ##
## them with the command library() ##
##----------------------------------------------------##

library(foreign)

caschool.df = read.dta("C://R//caschool.dta")

------------------------------------------------------
If you came across this page looking for advice on how to read data into R, you likely have a data set that you want to use. If you want to use some of the data referenced in this post, the traffic data set is here (just copy and paste into Notepad, and save to your favorite directory):


density speed
20.4 38.8
27.4 31.5
106.2 10.6
80.4 16.1
141.3 7.7
130.9 8.3
121.7 8.5
106.5 11.1
130.5 8.6
101.1 11.1
123.9 9.8
144.2 7.8
29.5 31.8
30.8 31.6
26.5 34.0
35.7 28.9
30.0 28.8
106.2 10.5
97.0 12.3
90.1 13.2
106.7 11.4
99.3 11.2
107.2 10.3
109.1 11.4

First Post

Keeping track of good coding suggestions that are useful for doing econometrics is difficult. I frequently stumble across coding suggestions that make my life easier, but two months later when I need to implement the method again, I forget how and I have to learn it all over again. Coding can be frustrating.

Usually, I resort to searching the internet for suggestions on my specific econometrics problems, and on some topics, I find some great suggestions. R-help discussion boards are useful, but difficult to search, and many of the applications from R-help are for statistics related to another field. Because of this mismatch, it requires some translation to understand what the suggestion means in econometrics language.

I am a young economist who is fond of using R to do econometrics. I don't consider myself to be a foremost expert in R, but I have passable R programming skills. I am mostly writing this blog to keep track of my own ideas for future reference, but other people might find my coding suggestions useful. If this is the case, I am happy to help out.