## Sunday, May 23, 2010

### Two-stage least squares

Update: Since this post, I discovered a bunch of better ways to conduct 2SLS regression in R. These methods do not allow the user to look under the hood to see what is going on (so this post may be instructive anyway). If you're interested, ivreg() in AER, tsls() in sem and systemfit() in systemfit are good options. My package tonymisc also has a command, iv() that works when some or all of these fail. I'm working on making that function better (5/18/2011)

Here is some code I wrote that allows you to easily perform two-stage least squares regression using R.

`ivreg = function(Y0,X0,Z10=NULL,Z20,data.df){     Y  = data.df[[Y0]]  X  = data.df[[X0]]  Z1 = NULL  if(length(Z10)>0){    Z1 = data.df[[Z10]]  }  Z2 = data.df[[Z20]]  if(length(Z10)>1){     for(i in 2:length(Z10)){        Z1=cbind(Z1,data.df[[Z10[i]]])     }  }  if(length(Z20)>1){     for(i in 2:length(Z20)){        Z2 =cbind(Z2,data.df[[Z20[i]]])     }  }## First Stage  Z       = cbind(1,Z1,Z2)          zPz     = t(Z)%*%Z  zPx     = t(Z)%*%X  zPz.inv = solve(zPz)  beta.f  = zPz.inv%*%zPx  xhat    = Z%*%beta.f  SSE.f = sum((X - xhat)^2)  df.d    = length(data.df[,1])-length(Z[1,])  MSE.f   = SSE.f/df.d  RMSE.f  = sqrt(MSE.f)  se.f    = RMSE.f*sqrt(diag(zPz.inv))  t.f     = beta.f/se.f  pval.f  = 2*pnorm(abs(t.f),lower.tail=F)## Constructing F-test  Z.d       = cbind(rep(1,                       length(data.df[,1])),Z1)  zdPzd     = t(Z.d)%*%Z.d  zdPx      = t(Z.d)%*%X  zdPzd.inv = solve(zdPzd)  bhat.d    = zdPzd.inv%*%zdPx  xhat.d    = Z.d%*%bhat.d  SSE.d      = sum((X-xhat.d)^2)  SSEx.z     = SSE.d -SSE.f  MSEx       = SSEx.z/length(Z20)  F.stat     = MSEx/MSE.f  df.n       = length(Z[1,]) -                    length(Z.d[1,])  pval.Ftest = pf(F.stat, df.n, df.d ,                lower.tail=FALSE)  ftest  = round(c(F.stat,                   pval.Ftest,df.n), 4)  names(ftest) = c("F-stat", "P-value",                  "# of instruments")  rn.f = c("constant",Z10,Z20)  first   = data.frame(cbind(beta.f,se.f,                      t.f, pval.f),row.names=rn.f)  names(first) = c("Estimate", "Std. Error",                  "T-stat", "P-value")## Second Stage  X.mat   = cbind(1,Z1,xhat)  X.mat2  = cbind(1,Z1,X)  xPx      = t(X.mat)%*%X.mat  xPx.inv  = solve(xPx)  xPy      = t(X.mat)%*%Y  beta2sls = xPx.inv%*%xPy  resid    = Y-X.mat2%*%beta2sls  MSE      = mean(resid^2)  RMSE     = sqrt(MSE)  se2sls   = RMSE*sqrt(diag(xPx.inv))  tstat    = beta2sls/se2sls  pval     = 2*pnorm(abs(tstat),lower.tail=F)  rn = c("constant",Z10,X0)  second   = data.frame(cbind(beta2sls,se2sls,                      tstat, pval),row.names=rn)  names(second) = c("Estimate", "Std. Error",                    "T-stat", "P-value")  result = list(first,second,ftest)  names(result) = c("first", "second", "ftest")  print("IV object successfully created.           Use sum.iv() on object")  print("to learn about your 2SLS         Regression")  return(invisible(result))}`

The above function is ivreg(), which takes five arguments: (1) Y0 is the name of the dependent variable, (2) X0 is the name of the endogenous regressor, (3) Z10 is the set of names of exogenous regressors [second stage variables assumed to be exogenous], and (4) Z20 is the set of instruments.

The function will not print any output if run alone. The appropriate syntax is to save the function output into an "iv" object, which you can summarize using the following function:

`sum.iv = function(reg.iv, first=FALSE,        second=TRUE, ftest=FALSE) {x= rep(0,2)if(first==TRUE) x = 1if(second==TRUE) x= 2if(ftest==TRUE) x= 3print(reg.iv[x])}`

Here is an example of how to use the functions to perform a two-stage least squares regression. Imagine that you have a data on prices and quantities of cars sold, and you have some data on attributes that directly affect quantity sold (falls into category Z10) and attributes that only affect quantity sold through the price (falls into category Z20).

You could estimate the model, and summarize its output using the following two commands:

demand.iv = ivreg("quantity", "price", c("horsepower", "mpg"), c("regulations", "cost", "competition"))

sum.iv(demand.iv, first=FALSE, second=TRUE, ftest=FALSE)

Note: The quotes in the ivreg command are necessary to get the function to work properly. This is an unusual artifact of how I wrote the function.

What is nice is that the sum.iv() command allows the user to specify what output he/she wants to see from the two-stage least squares regression. For the sum.iv() command, you could just leave off the logical arguments to get the same output (because these are the defaults), but you can choose to show the first-stage estimates and the F-test on the explanatory power of the instruments by specifying your own options.