Suppose you have a vector of names such that the first three words in the vector contain relevant information, but there is a bunch of extraneous stuff. For example,
Our goal is to collapse the first three words into one contiguous string (without the spaces) and we want to discard the extra words (because they're extraneous infromation). For a situation like this, my first thought is to use the strsplit() command, which splits vectors of strings by any specified character pattern you wish. For example, suppose that an object named myvec contains the vector of strings we would like to split into separate words. Then, the command
strsplit(myvec, " ")
produces a list of vectors of words. This is a little cumbersome because the vectors have a different length for each element in the list. Fortunately, all we want in this application is the first three words concatenated together, so some trimming will still give us the right answer. Even better, if you don't mind seeing an error message on account of trying to bind vectors of uneven lengths, we can rbind the list together using do.call() as follows:
do.call(rbind, strsplit(myvec, " "))
This still works because R recycles. For our example, we then obtain a fabulous 4x5 matrix of names:
Store this matrix in the object namemat, then reduce this matrix to containing only the first three words by selecting those columns:
namemat = namemat[, 1:3]
Think of each row of this resulting matrix as a vector. If we can paste this vector together into one string, we're done. Do this using the paste() command with collapse = "". To paste() simultaneously for each row, I like to use apply. The resulting command that goes from namemat to vector of cleaned-and-concatenated strings is:
apply(namemat, MARGIN = 1, FUN = function(x){ paste(x, collapse="")})
MARGIN = 1 is for rows. Putting the pieces together, the code we developed is:
namemat = do.call(rbind, strsplit(myvec, " "))
namemat = namemat[, 1:3]
apply(namemat, MARGIN = 1, FUN = function(x){ paste(x, collapse="")})
And, this should work if your list of names is 4 lines long or 40 million lines long. That said, there was a lot of heavy lifting along the way (we created a list, collapsed a list, produced an error message as a byproduct of binding the list into a matrix, pasted each row using the collapse option and pasted quickly using apply).
There's a strange way to do this same task that doesn't require the same heavy lifting (just different heavy lifting). Use write.table() and read.table(). Start by writing the unclean initial vector to a text file without row names.
write.table(myvec, "C://vec.txt", sep = "", quote = FALSE, row.names = FALSE)
Also, turn off quoting. We do this to trick R into thinking that the spaces we want to eliminate are space delimiters. Then, read in the file using read.table() and a space delimiter.
namemat = read.table("Data/Raw/crspnames.txt", skip=1, sep = " ", quote = "", fill = TRUE)
skip =1 is to avoid reading in the variable name. The fill = TRUE argument fills in the matrix so that read.table() doesn't return an error for trying to read a non-square table. Each row of the resulting matrix namemat has a full name to be collapsed, one word per column (similar to namemat from the previous method). To complete the task, we still need to paste the first three columns together for each row. For this, use the apply-the-paste technique we used in the first method.
On a final note, if we don't mind pasting the entire vector together without the spaces (including the extraneous stuff), the easy solution is to use the gsub() command to replace spaces with nothing.
gsub(" ", "", myvec)
This method will produce an answer that is close to the one that we wanted and with a simple one line of code. Depending on your objective, this may be the best way to go -- that is, if you don't care about having extra words appended to each string.
Friday, July 29, 2011
Monday, July 18, 2011
Avoiding Loops in R: An Example with Principal Minors
Yesterday, I found myself wanting to compute a large subset of the second order principal minors of a matrix (diagonal-preserving minors; the ones for which the rows and columns kept are the same). Don't judge me for wanting to do this, and bear with me, there is a lesson about R programming here (mostly for me, read the comments for testing).
If you don't know what a principal minor is or forgot what it was (because most of us don't spend much time with principal minors), here is a short refresher on how to get a second-order principal minor (other orders are possible and there's a lot of related theory, but let's try to focus on this one task for now).
Step 1: Deletion. Take any matrix. Delete all but two rows and all but two columns. For example, suppose the matrix is 5x5 and let's keep rows 1 and 3 and columns 2 and 4 (lots of other combinations are possible).
In the back of your mind, think how easy this is in R.
Step 2: Calculation. Once we delete the irrelevant parts of the big matrix to get our 2x2 submatrix, obtain the principal minor by taking the determinant of the resulting 2x2 matrix. For our example, we compute 3x2 - 3x1 = 3.
Both steps in this calculation are easy to implement in R. For the first step, merely use R's fantastic syntax for extracting rows and columns of matrices. If the matrix is called mat, we efficiently perform the first step by:
little_mat = mat[c(1,3), c(2,4)]
That's easy enough. With little_mat in hand, the second step is even easier.
pm = det(little_mat)
The real trick to this programming problem is picking out all of principal minors I want to compute (diagonal-preserving ones that keep the same-numbered rows and columns; little_mat won't do because c(1,3) is not c(2,4)), and doing it efficiently. I also don't want to rewrite my code as my matrix changes size.
So, what do we do?A really bad An idea that will eventually compute the right answer is to write this using for() loops.
Barring a coding error, this code should eventually work, but it will be slow (or fast depending on how you write it). If you want to run through this code many times (like I did yesterday), efficiency is at a premium and you should immediately think of faster techniques -- like using the apply() function (or preallocating your storage vector, which actually saves time).
Before moving to apply(), we need a quick way to enumerate all of the ways to keep two out of N rows. If we were counting, that's N choose 2, also known as a combination. Fortunately, this step has already been implemented in the gregmisc library with a function called combinations(). Conveniently, combinations(N, 2) will return a matrix with all of the ways to select two numbers less than or equal to N as its rows. There are other options, but this is the default, which is perfect for our application.
For example, here is some output for N=5.
With this way of computing the rows/ columns we want to keep, all we need to do is put our simple code for computing a second order principal minor into a function. Let's call it minor2(). Then, use apply() to apply this function to each row (MARGIN=1 is for rows) of the combinations matrix. Here's the complete set of code that will work if you have a matrix called mat for which you want the diagonal-second order principal minors.
Because we avoid doing loops, this method is much faster. For a big project, this added efficiency is worth it. More generally, this technique of using apply() to avoid repeatedly looping is a good lesson to apply to other programming problems.
Update: Thanks to Berend for going through the trouble of system.time() testing the code in this post (and spiceman for a good comment as well). Read on into the comments for more, but for a preview of what we learned: (1) The copying via c() is what slows the loop down, (2) apply is slightly slower than writing the loop yourself, (3) combn(), the combinations() alternative, is really slow, and (4) one should always stand behind system.time() methods before talking about the efficiency of a method.
I much appreciate the feedback. One of the reasons I have this blog is to learn more about R (and those comments were definitely informative).
If you don't know what a principal minor is or forgot what it was (because most of us don't spend much time with principal minors), here is a short refresher on how to get a second-order principal minor (other orders are possible and there's a lot of related theory, but let's try to focus on this one task for now).
Step 1: Deletion. Take any matrix. Delete all but two rows and all but two columns. For example, suppose the matrix is 5x5 and let's keep rows 1 and 3 and columns 2 and 4 (lots of other combinations are possible).
In the back of your mind, think how easy this is in R.
Step 2: Calculation. Once we delete the irrelevant parts of the big matrix to get our 2x2 submatrix, obtain the principal minor by taking the determinant of the resulting 2x2 matrix. For our example, we compute 3x2 - 3x1 = 3.
Both steps in this calculation are easy to implement in R. For the first step, merely use R's fantastic syntax for extracting rows and columns of matrices. If the matrix is called mat, we efficiently perform the first step by:
little_mat = mat[c(1,3), c(2,4)]
That's easy enough. With little_mat in hand, the second step is even easier.
pm = det(little_mat)
The real trick to this programming problem is picking out all of principal minors I want to compute (diagonal-preserving ones that keep the same-numbered rows and columns; little_mat won't do because c(1,3) is not c(2,4)), and doing it efficiently. I also don't want to rewrite my code as my matrix changes size.
So, what do we do?
Barring a coding error, this code should eventually work, but it will be slow (or fast depending on how you write it). If you want to run through this code many times (like I did yesterday), efficiency is at a premium and you should immediately think of faster techniques -- like using the apply() function (or preallocating your storage vector, which actually saves time).
Before moving to apply(), we need a quick way to enumerate all of the ways to keep two out of N rows. If we were counting, that's N choose 2, also known as a combination. Fortunately, this step has already been implemented in the gregmisc library with a function called combinations(). Conveniently, combinations(N, 2) will return a matrix with all of the ways to select two numbers less than or equal to N as its rows. There are other options, but this is the default, which is perfect for our application.
For example, here is some output for N=5.
With this way of computing the rows/ columns we want to keep, all we need to do is put our simple code for computing a second order principal minor into a function. Let's call it minor2(). Then, use apply() to apply this function to each row (MARGIN=1 is for rows) of the combinations matrix. Here's the complete set of code that will work if you have a matrix called mat for which you want the diagonal-second order principal minors.
Update: Thanks to Berend for going through the trouble of system.time() testing the code in this post (and spiceman for a good comment as well). Read on into the comments for more, but for a preview of what we learned: (1) The copying via c() is what slows the loop down, (2) apply is slightly slower than writing the loop yourself, (3) combn(), the combinations() alternative, is really slow, and (4) one should always stand behind system.time() methods before talking about the efficiency of a method.
I much appreciate the feedback. One of the reasons I have this blog is to learn more about R (and those comments were definitely informative).
Labels:
coding tips,
functions,
inelegant coding,
matrix notation,
R,
theory,
useful resources
Monday, June 20, 2011
Today's Assignment: Assignment
A new R user quickly discovers that there are multiple ways to store information into an object -- the technical term for this is assignment. There's = as in:
x = c(1,2,3)
and there's <- as in:
x <- c(1,2,3)
R help on assignOps offers this explanation of the difference:
if(codename = "thedoctor") watch = "yes"
In this line of code, the problem is that codename = "thedoctor" isn't a logical statement. It is a statement that would assign the string "thedoctor" to an object called codename.... if the statement were allowed by R's language. Fortunately, R prevents this.
This feature of the R programming language helps me because I -- by default -- use = for assignment rather than <-. My style is not conventional, however. Popular R coding style guides (Google, 4D pie charts) suggest that users use <- option for assignment, rather than =.
Deep down, I think I use this unconventional default because I had the = versus == distinction (more generally, assignment versus logical statement) drilled into me in my first computer science class (not in R; Java). On reflection, I have three reasons to use = rather than <-. First, = is easier to type, using one keystroke and no shift keys. Second, the equals style looks cleaner to me (completely a personal preference). Third, I think assignment should take place out in the open (i.e., exactly where = is allowed). If the line of code is out in the open, both <- and = work the same.
On a final note, the <- is not without its problems for logical statements. For example, the statement
if(agediff<-2) status = "I'm older"
will cause problems and it won't let you know about these problems until you get some really strange results later on. Here's the code from an interactive R session where I show what can go wrong.
> agediff = -3
> if(agediff<-2) status = "I'm older"
> status
[1] "I'm older"
> agediff
[1] 2
> agediff = 4
> if(agediff<-2) status = "I'm older"
> status
[1] "I'm older"
> agediff
[1] 2
> status=NULL
> if(agediff<-2) status = "I'm older"
> status
[1] "I'm older"
> agediff
[1] 2
The problem is two-fold in this code. First, the if statement isn't acting as a logical statement splitting by whether "agediff< -2" Second, the value of agediff is changed because "<-" means assignment. The result is that both status and agediff have (potentially) the wrong values at the end of running the code.
As this error is a problem whether you use = or <- for assignment, it is worth knowing regardless of your style preference. Adhering to good spacing guidelines is a good way to avoid this error.
> agediff = 4
> if(agediff < -2) status = "I'm older"
> status
NULL
> agediff
[1] 4
There are some other examples in the Stack Overflow forum discussion on this topic (actually multiple threads on this), all of which are useful. It is interesting reading to be sure.
x = c(1,2,3)
and there's <- as in:
x <- c(1,2,3)
R help on assignOps offers this explanation of the difference:
The operatorsFor catching subtle coding errors with =, this is a smart feature of R. When I am tired, I sometimes find myself writing a line of code that looks like this:<-
and=
assign into the environment in which they are evaluated. The operator<-
can be used anywhere, whereas the operator=
is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions.
if(codename = "thedoctor") watch = "yes"
In this line of code, the problem is that codename = "thedoctor" isn't a logical statement. It is a statement that would assign the string "thedoctor" to an object called codename.... if the statement were allowed by R's language. Fortunately, R prevents this.
This feature of the R programming language helps me because I -- by default -- use = for assignment rather than <-. My style is not conventional, however. Popular R coding style guides (Google, 4D pie charts) suggest that users use <- option for assignment, rather than =.
Deep down, I think I use this unconventional default because I had the = versus == distinction (more generally, assignment versus logical statement) drilled into me in my first computer science class (not in R; Java). On reflection, I have three reasons to use = rather than <-. First, = is easier to type, using one keystroke and no shift keys. Second, the equals style looks cleaner to me (completely a personal preference). Third, I think assignment should take place out in the open (i.e., exactly where = is allowed). If the line of code is out in the open, both <- and = work the same.
On a final note, the <- is not without its problems for logical statements. For example, the statement
if(agediff<-2) status = "I'm older"
will cause problems and it won't let you know about these problems until you get some really strange results later on. Here's the code from an interactive R session where I show what can go wrong.
> agediff = -3
> if(agediff<-2) status = "I'm older"
> status
[1] "I'm older"
> agediff
[1] 2
> agediff = 4
> if(agediff<-2) status = "I'm older"
> status
[1] "I'm older"
> agediff
[1] 2
> status=NULL
> if(agediff<-2) status = "I'm older"
> status
[1] "I'm older"
> agediff
[1] 2
The problem is two-fold in this code. First, the if statement isn't acting as a logical statement splitting by whether "agediff< -2" Second, the value of agediff is changed because "<-" means assignment. The result is that both status and agediff have (potentially) the wrong values at the end of running the code.
As this error is a problem whether you use = or <- for assignment, it is worth knowing regardless of your style preference. Adhering to good spacing guidelines is a good way to avoid this error.
> agediff = 4
> if(agediff < -2) status = "I'm older"
> status
NULL
> agediff
[1] 4
There are some other examples in the Stack Overflow forum discussion on this topic (actually multiple threads on this), all of which are useful. It is interesting reading to be sure.
Sunday, June 5, 2011
An application of aggregate() and merge()
Today, I encountered an interesting problem while processing a data set of mine. My data have observations on businesses that are repeated over time. My data set also contains information on longitude and latitude of the business location, but unfortunately, my source only started providing the longitude and latitude information in the middle of the data set. As a result, the information is missing for the first few periods for every business, but across businesses, I have latitude and longitude information for virtually all of my data set.
As long as the business hasn't shifted location, this isn't a problem. I can just substitute the lat-long information from the later periods into the earlier periods for each business. This sounds simple enough to do, but when one has a data set with 70,000 observations, an automatic process saves a lot of time and can help avoid mistakes.
Let's start by building common ground. Here is a script that you can use to simulate a data set to follow along.
Once you run this script, the data frame mydat is a balanced panel of 125 businesses observed over 25 time periods (my actual data set is an unbalanced panel and the solution I present works just as well for unbalanced as with balanced. It is just easier to simulate a balanced panel). Each observation has a business identifier (BusID), a time period identifier (per), as well as simulated data on latitude and longitude.
To see the problem with this data set, examine the data for the first business using the command mydat[mydat$BusID==1,]. Here's a screenshot of the output.
This problem occurs throughout the data set and we would like to just substitute the lat-long values we have for the right missing ones.
How did I solve my NA problem? I used aggregate(), a function that quickly and conveniently collapses a data frame by a factor or list of factors. With aggregate(), you can take the mean, compute the max, compute the min, sum the values or compute the standard deviation by the levels of the collapsing factor. The aggregate() function also has a na.rm option that allows us to easily ignore the NA values (useful here). In our example, we could collapse by taking the mean or computing the max or the min; the result will be the same because latitude and longitude do not vary for the same business over time.
Let's collapse by taking the mean withing each level of BusID. The result of my use of aggregate() is a data frame that upon merging with the original data frame (using merge()), fills in the missing latitude and longitude values. The following code implements this strategy to fill in the NAs in our simulated data set.
Before leaving you to have fun with aggregate() and merge(), several points are worth mentioning.
First, the aggregate() function requires that the collapsing factor(s) is (are) stored in a list. This is why the code uses list(BusID=mydat$BusID) when mydat$BusID would seem more natural. The list() part is necessary to get aggregate() to work.
Second, it is useful for merging in the second step to give the attribute in the list the same name as the variable name in the original data frame. The BusID= part of the list(BusID=mydat$BusID) command is a convenient way to prepare the resulting data frame for merging.
Third, my code uses -c(3,4) to drop the 3rd and 4th columns of the original data frame (the ones with the original lat-long data) before merging it with the collapsed latlong data frame. This is important so that the merge() command doesn't get confused about which columns to use for matching. If you leave these columns in mydat, R's merge will try to match the values in those columns as well (because they have a common name with those columns in latlong).
Finally, the data in this example are in a different order now that we merged on BusID. If this bothers you, you can use order to sort the data set by time period as in
myfinaldat=mycleandat[order(mycleandat$per),]
As this last step is purely a cosmetic change to the data set that will not affect subsequent analyses, you can leave it out.
Although this example was my solution to a rather specific problem, it illustrates how to use some useful commands. To the extent that these commands are widely applicable to other problems, I hope you find uses for these commands in related applications.
As long as the business hasn't shifted location, this isn't a problem. I can just substitute the lat-long information from the later periods into the earlier periods for each business. This sounds simple enough to do, but when one has a data set with 70,000 observations, an automatic process saves a lot of time and can help avoid mistakes.
Let's start by building common ground. Here is a script that you can use to simulate a data set to follow along.
Once you run this script, the data frame mydat is a balanced panel of 125 businesses observed over 25 time periods (my actual data set is an unbalanced panel and the solution I present works just as well for unbalanced as with balanced. It is just easier to simulate a balanced panel). Each observation has a business identifier (BusID), a time period identifier (per), as well as simulated data on latitude and longitude.
To see the problem with this data set, examine the data for the first business using the command mydat[mydat$BusID==1,]. Here's a screenshot of the output.
This problem occurs throughout the data set and we would like to just substitute the lat-long values we have for the right missing ones.
How did I solve my NA problem? I used aggregate(), a function that quickly and conveniently collapses a data frame by a factor or list of factors. With aggregate(), you can take the mean, compute the max, compute the min, sum the values or compute the standard deviation by the levels of the collapsing factor. The aggregate() function also has a na.rm option that allows us to easily ignore the NA values (useful here). In our example, we could collapse by taking the mean or computing the max or the min; the result will be the same because latitude and longitude do not vary for the same business over time.
Let's collapse by taking the mean withing each level of BusID. The result of my use of aggregate() is a data frame that upon merging with the original data frame (using merge()), fills in the missing latitude and longitude values. The following code implements this strategy to fill in the NAs in our simulated data set.
Before leaving you to have fun with aggregate() and merge(), several points are worth mentioning.
First, the aggregate() function requires that the collapsing factor(s) is (are) stored in a list. This is why the code uses list(BusID=mydat$BusID) when mydat$BusID would seem more natural. The list() part is necessary to get aggregate() to work.
Second, it is useful for merging in the second step to give the attribute in the list the same name as the variable name in the original data frame. The BusID= part of the list(BusID=mydat$BusID) command is a convenient way to prepare the resulting data frame for merging.
Third, my code uses -c(3,4) to drop the 3rd and 4th columns of the original data frame (the ones with the original lat-long data) before merging it with the collapsed latlong data frame. This is important so that the merge() command doesn't get confused about which columns to use for matching. If you leave these columns in mydat, R's merge will try to match the values in those columns as well (because they have a common name with those columns in latlong).
Finally, the data in this example are in a different order now that we merged on BusID. If this bothers you, you can use order to sort the data set by time period as in
myfinaldat=mycleandat[order(mycleandat$per),]
As this last step is purely a cosmetic change to the data set that will not affect subsequent analyses, you can leave it out.
Although this example was my solution to a rather specific problem, it illustrates how to use some useful commands. To the extent that these commands are widely applicable to other problems, I hope you find uses for these commands in related applications.
Monday, May 30, 2011
Recreational R: Simulating a Card Trick
In this post, I simulate an interesting card trick, which was described by a friend of mine named Xan. Here's Xan's description of the card trick:
The following script file is a crude first approximation to simulating this card trick for the first 150 cards. This crude strategy is simple: (1) Simulate a sequence of 150 cards, (2) Pick a starting value -- might as well do all of them at once, and (3) Implement Xan's trick on all of them (that's what the for loop does).
Here's the result of running this script three times and plotting 10 secretnum sequences: one for each starting value. Each run starts with 10 distinct secret numbers (with a default chance of 0.1 of the trick "working"), and in my three runs, every starting value of the secret number eventually converges on the same sequence of secret numbers. More on the intuition for this here.
There are a couple of things wrong with this simulation that may have stacked the deck in favor of it working. I fixed these before writing this post, but I think it is useful to see the quick and dirty solution before adding frills and realism (which is what I do now).
Once you define this function, we can save the probability of the trick working for 1000 different shuffles of the cards. This script does it for one deck and one pass through the deck.
Then, we can take the average probability, which feels like the probability the trick works (unconditional on the particular shuffle or guess of a secret number). In the simulations I have run, this number turns out to be around 0.74 for one deck and one pass through the deck and about 0.86 using one deck and passing through twice. There's not much improvement after the second pass through of the deck.
In case you're interested, here are the results for up to 3 decks and up to 5 passes through the set of cards. It is always better to have more decks, but two passes through two decks is about enough to virtually guarantee that the trick works.
Update: In case you are wondering, I also coded it up with a 0.01 probability of forgetting to update on any given flip of the card. Here's the table of results in this case.
It's no surprise: For forgetful people, the trick is less successful and there isn't much point in reaching for a second deck.
But I don't just stop on any card. I stop on your current special card. Which I'm not supposed to know.
- I put a deck of cards down face up on the table. Meanwhile you think of a secret number between 1 and 10 -- for exposition let's say you pick 4.
- One by one, I discard cards from the top of the deck. When we get to the 4th card -- and let's call it your special card -- you look at the number. For exposition, let's say your special card happens to be a 7. Then 7 secretly becomes your new number. Note that I don't know your special card.
- I keep flipping cards, and 7 cards later, you have a new special card and number yet again. Note that I still don't know your special card.
- This process continues -- me flipping cards at a constant rate, you secretly updating your special card and counting up to the next one -- until I decide to stop.
The following script file is a crude first approximation to simulating this card trick for the first 150 cards. This crude strategy is simple: (1) Simulate a sequence of 150 cards, (2) Pick a starting value -- might as well do all of them at once, and (3) Implement Xan's trick on all of them (that's what the for loop does).
Here's the result of running this script three times and plotting 10 secretnum sequences: one for each starting value. Each run starts with 10 distinct secret numbers (with a default chance of 0.1 of the trick "working"), and in my three runs, every starting value of the secret number eventually converges on the same sequence of secret numbers. More on the intuition for this here.
There are a couple of things wrong with this simulation that may have stacked the deck in favor of it working. I fixed these before writing this post, but I think it is useful to see the quick and dirty solution before adding frills and realism (which is what I do now).
- I merely simulated my 150 cards using a uniform distribution on 1 to 13 (then forcing the face cards 11, 12 and 13 to be 10s). This is equivalent to sampling with replacement, which may have different properties than sampling without replacement.
- There are only 52 cards in a deck. How would this work with only one shuffle, but passing through the deck (almost) three times until 150 cards? This is a different problem. Maybe you wouldn't expect much improvement after one pass through the deck (because you've seen that sequence of cards before).
Once you define this function, we can save the probability of the trick working for 1000 different shuffles of the cards. This script does it for one deck and one pass through the deck.
Then, we can take the average probability, which feels like the probability the trick works (unconditional on the particular shuffle or guess of a secret number). In the simulations I have run, this number turns out to be around 0.74 for one deck and one pass through the deck and about 0.86 using one deck and passing through twice. There's not much improvement after the second pass through of the deck.
In case you're interested, here are the results for up to 3 decks and up to 5 passes through the set of cards. It is always better to have more decks, but two passes through two decks is about enough to virtually guarantee that the trick works.
Update: In case you are wondering, I also coded it up with a 0.01 probability of forgetting to update on any given flip of the card. Here's the table of results in this case.
It's no surprise: For forgetful people, the trick is less successful and there isn't much point in reaching for a second deck.
Saturday, May 28, 2011
An Application of boot() to IV regression
Bootstrapping standard errors can be a useful technique when obtaining closed form for the standard error formula is difficult or intractable. In this post, I give an example of how to use R to create a bootstrap sampling distribution in the context of IV regression. Specifically, I use boot() to automatically augment a function of mine to resample the indices of my data set with replacement (see the code in the function below).
In my application, I present a function that uses the boot library to report bootstrap standard errors for an instrumental variables regression. My ivboot() function builds on the iv() command I wrote for the tonymisc library.*
Now, onto the ivboot() example. Start by loading the tonymisc library and some data from the library, as well as the boot library.
We can run the following R script to define the function ivboot().
When applied to data and an IV regression model, the ivboot() function creates an IV regression object that -- when tonymisc is loaded -- is compatible with mtable() output. The only difference between an ivboot()-created object and an iv() object is that the ivboot() object has standard errors that are based on the bootstrap distribution of the coefficient estimates (Statistics other than the second-stage coefficient estimates are not bootstrapped).
Here is some code to compare the bootstrap output to the analytical standard error output:
On this small sample (N=100) of simulated data, the ivboot() command took less than a minute to run on my computer (timing of may vary depending on your computer). For much larger data sets, this will be slower. If you have a larger problem or lower standards (or higher standards, your choice), you can use the boots option to ivboot() to specify the number of bootstrap samples. Currently, I have set the default to 500, but you could specify boots = 200 if you want the command to run faster (boots = 10 will make it run even faster, but I don't recommend that!).
Here is the mtable() output, which can easily be ported into LaTeX using the toLatex() command.
*This standard output from an mtable() extension to my iv() command provides quite a bit of information in a convenient format. Another nice feature of iv() is that iv()-created objects have first-stage summary information readily stored in the object for extraction and analysis.
In my application, I present a function that uses the boot library to report bootstrap standard errors for an instrumental variables regression. My ivboot() function builds on the iv() command I wrote for the tonymisc library.*
Now, onto the ivboot() example. Start by loading the tonymisc library and some data from the library, as well as the boot library.
We can run the following R script to define the function ivboot().
When applied to data and an IV regression model, the ivboot() function creates an IV regression object that -- when tonymisc is loaded -- is compatible with mtable() output. The only difference between an ivboot()-created object and an iv() object is that the ivboot() object has standard errors that are based on the bootstrap distribution of the coefficient estimates (Statistics other than the second-stage coefficient estimates are not bootstrapped).
Here is some code to compare the bootstrap output to the analytical standard error output:
On this small sample (N=100) of simulated data, the ivboot() command took less than a minute to run on my computer (timing of may vary depending on your computer). For much larger data sets, this will be slower. If you have a larger problem or lower standards (or higher standards, your choice), you can use the boots option to ivboot() to specify the number of bootstrap samples. Currently, I have set the default to 500, but you could specify boots = 200 if you want the command to run faster (boots = 10 will make it run even faster, but I don't recommend that!).
Here is the mtable() output, which can easily be ported into LaTeX using the toLatex() command.
*This standard output from an mtable() extension to my iv() command provides quite a bit of information in a convenient format. Another nice feature of iv() is that iv()-created objects have first-stage summary information readily stored in the object for extraction and analysis.
Friday, May 20, 2011
Defaults, Lists and Classes: A Functional Post
In this post, I demonstrate a couple of useful tricks to writing functions in R. The context is a function I wrote called called samp() that allows for an easy demonstration of sampling distribution properties.
Defaults
By default, this function draws K = 500 samples of size N from a normal distribution with a mean of mu = 0 and a standard deviation of std = 1. As you can see, the syntax for specifying a default for an argument to the function is to set it equal to its default value inside the function statement. Arguments can be given no default by naming them as an argument to the function without specifying their default value.
Sensibly chosen defaults can save a lot of time for people who use your function. For example, although this samp() function has four arguments, I don't need to type samp(100, 0, 1, 500) every time I wish to take 500 samples of size N=100 from a Normal(0,1) random variable. If sampling from a Normal(0,1) is especially common, setting the defaults this way can save a lot of keystrokes.
On the other hand, using defaults instead of hard coding the Normal(0,1) choice has the distinct advantage that users get flexibility from your function giving them the option to specify arguments other than the default. The right use of defaults strikes the perfect balance between ease of use and flexibility.
Returning a List... with Names
If your function warrants the special treatment, you have probably performed several useful calculations that you would like to return. Maybe you have conducted several related hypothesis tests that you would like to reference later (report and/or use in another calculation). A great way to do this is to store the results of your function into a list.
For example, in the samp() function, I wanted to store vectors of the sample means, sample standard deviations and standard errors (500, one for each replication). For good measure, I wrote the function to store the means and standard deviations of these vectors as well as the true parameter values that went into the function.
Although it is often a good idea to store the results in a list, it is a better idea to name the elements of the list for easy extraction. If I ran the command
mysamp = samp(100)
The object mysamp would contain a list of the results from the function samp(). Suppose I want to extract the vector of means (for example, to compute a histogram). Without running the names command, I would have to remember that the vector of means was in the first position in the list and use the command
hist(mysamp[1])
to plot a histogram of the means vector from my sampling object. As it is easy to forget the precise order in which the elements of a function are stored, this syntax can lead to too much thinking. It is better to name the first list element something like means. Naming the list elements allows the user to use the $ extractor on objects created with the function. That is, the syntax becomes
hist(mysamp$means)
which is much easier to remember and easier to read. This latter fact is often unappreciated, but if you get in the habit of naming elements in a list, you will be a better collaborative coder. Even if you're not into collaboration, when you return to that project after two weeks of doing something else, it is much easier to remember where you left off.
A Benefit of Keeping It Classy
Another coding practice that often goes unappreciated is the use of classes to make your life easier (and your console cleaner). At this point, if you type mysamp into your console window, R will bombard you with more than a screen's worth of output. Most of this is output that I don't want to see printed out. I might want to save it for later, but I don't want to save it.
This is where the class(result) = "samp" command comes in handy. By "classing" my object this way and writing a short a print method for this class (named print.samp()), I can cut down on the amount of output I have to see when I inspect my object. Here's code to define a print method.
Now, try typing mysamp into R. You will see much less output (only the output from print.samp), but the object mysamp still has all of the information you want to keep around. If you don't believe me, just type mysamp$means to be sure.
Just as we could "class and print," we can also "class and plot." This can be handy if you want there to be a standard meaning for the command plot(mysamp). In our working example, let's make this command mean plot three histograms, one for means, one for standard deviations and one for standard errors. Here's some code that imposes this meaning on objects with class "samp"
In case you were wondering what this plotting out looks like, here is the picture that is produced from the command plot(mysamp).
Defaults
By default, this function draws K = 500 samples of size N from a normal distribution with a mean of mu = 0 and a standard deviation of std = 1. As you can see, the syntax for specifying a default for an argument to the function is to set it equal to its default value inside the function statement. Arguments can be given no default by naming them as an argument to the function without specifying their default value.
Sensibly chosen defaults can save a lot of time for people who use your function. For example, although this samp() function has four arguments, I don't need to type samp(100, 0, 1, 500) every time I wish to take 500 samples of size N=100 from a Normal(0,1) random variable. If sampling from a Normal(0,1) is especially common, setting the defaults this way can save a lot of keystrokes.
On the other hand, using defaults instead of hard coding the Normal(0,1) choice has the distinct advantage that users get flexibility from your function giving them the option to specify arguments other than the default. The right use of defaults strikes the perfect balance between ease of use and flexibility.
Returning a List... with Names
If your function warrants the special treatment, you have probably performed several useful calculations that you would like to return. Maybe you have conducted several related hypothesis tests that you would like to reference later (report and/or use in another calculation). A great way to do this is to store the results of your function into a list.
For example, in the samp() function, I wanted to store vectors of the sample means, sample standard deviations and standard errors (500, one for each replication). For good measure, I wrote the function to store the means and standard deviations of these vectors as well as the true parameter values that went into the function.
Although it is often a good idea to store the results in a list, it is a better idea to name the elements of the list for easy extraction. If I ran the command
mysamp = samp(100)
The object mysamp would contain a list of the results from the function samp(). Suppose I want to extract the vector of means (for example, to compute a histogram). Without running the names command, I would have to remember that the vector of means was in the first position in the list and use the command
hist(mysamp[1])
to plot a histogram of the means vector from my sampling object. As it is easy to forget the precise order in which the elements of a function are stored, this syntax can lead to too much thinking. It is better to name the first list element something like means. Naming the list elements allows the user to use the $ extractor on objects created with the function. That is, the syntax becomes
hist(mysamp$means)
which is much easier to remember and easier to read. This latter fact is often unappreciated, but if you get in the habit of naming elements in a list, you will be a better collaborative coder. Even if you're not into collaboration, when you return to that project after two weeks of doing something else, it is much easier to remember where you left off.
A Benefit of Keeping It Classy
Another coding practice that often goes unappreciated is the use of classes to make your life easier (and your console cleaner). At this point, if you type mysamp into your console window, R will bombard you with more than a screen's worth of output. Most of this is output that I don't want to see printed out. I might want to save it for later, but I don't want to save it.
This is where the class(result) = "samp" command comes in handy. By "classing" my object this way and writing a short a print method for this class (named print.samp()), I can cut down on the amount of output I have to see when I inspect my object. Here's code to define a print method.
Now, try typing mysamp into R. You will see much less output (only the output from print.samp), but the object mysamp still has all of the information you want to keep around. If you don't believe me, just type mysamp$means to be sure.
Just as we could "class and print," we can also "class and plot." This can be handy if you want there to be a standard meaning for the command plot(mysamp). In our working example, let's make this command mean plot three histograms, one for means, one for standard deviations and one for standard errors. Here's some code that imposes this meaning on objects with class "samp"
In case you were wondering what this plotting out looks like, here is the picture that is produced from the command plot(mysamp).
Labels:
basics,
coding tips,
functions,
output,
useful functions
Thursday, May 19, 2011
Porting Stata-like Marginal Effects to LaTeX
The TimeSeriesIreland Blog posted an excellent start on a function that automatically computes marginal effects for probit and logit models. As I have always wanted an easy-to-use function that computes and reports marginal effects in R, I was elated to see the function and I couldn't wait to use it.
My initial impression when I saw the post was: "I should write an mtable() extension for that to go with my other extensions." With this in mind, I started familiarizing myself with the code and I contacted the author to make sure he was fine with me extending his code (and adding it to my package with mtable() extensions). As he was fine with my goal of extending the function, here is a post where I introduce you to a new extension to mtable().
Important Caveat: At this point, the function does not do the technically "right" thing (delta method) to obtain the standard errors on the marginal effects. Instead, it supposes that the z-score will remain constant across the transformation, which isn't quite right. Even though it isn't perfect, in most of the examples I have tried, the results are almost identical to Stata output. You tend to get more divergence from the Stata solution for dummy variable regressors (where Stata computes a discrete jump rather than a derivative).
One would think that an extension of mfx() would use mfx(), but that didn't turn out to be the easy way to go. After many unsuccessful attempts (more about this in a future post; I learned some things worth sharing), here's the solution I landed on.
First, define an "mfx" object with a function that changes the class attribute of a fitted glm object to. Here's code to define a function to do this.
Second, make sure you have the memisc library loaded with the command library(memisc) -- or library(tonymisc), which loads memisc among other excellent packages.
Third, until I get the kinks worked out and port this into my package, you'll have to run code to define a getSummary.mfx() method to extend mtable. Here's some code that will do this.
Now, I suppose you want to know how this extension works. Here's an example that uses some data from the wonderful Ecdat library (a great source for Econometrics data sets).
Here's the output to R's console. Note that feat.yoplait is a dummy variable for whether Yoplait had a featured advertisement that particular week. Based on my caveat, you wouldn't expect the output to be identical to Stata.
And, as you can see (compare with Stata output below), it isn't exactly the same, but if you are trying to understand the qualitative importance of your estimates, it isn't too bad.
Two bugs (quirks, kludges) remain: (1) I want to get the standard errors precisely right, (2) I want to convince mtable() to stop reporting the intercept for the marginal effects (as the intercept doesn't change, it can't have a marginal effect!). Even before I get around to this, I hope this code will have a positive marginal effect.
---
Update: The original post to which this post refers is no longer available because the TimeSeriesIreland Blog appears to have been taken down. I leave this post up subject to two caveats: (1) the standard errors are not quite right for the marginal effects and (2) the marginal effects for dummy regressors are not computed for a discrete jump from 0 to 1. Otherwise, you may find this tool useful.
Subject to the caveats in this post, if you want access to this function and a related function that is more summary()-like, I have included it in my CRAN-contributed package tonymisc (see the tab at the top of this page).
My initial impression when I saw the post was: "I should write an mtable() extension for that to go with my other extensions." With this in mind, I started familiarizing myself with the code and I contacted the author to make sure he was fine with me extending his code (and adding it to my package with mtable() extensions). As he was fine with my goal of extending the function, here is a post where I introduce you to a new extension to mtable().
Important Caveat: At this point, the function does not do the technically "right" thing (delta method) to obtain the standard errors on the marginal effects. Instead, it supposes that the z-score will remain constant across the transformation, which isn't quite right. Even though it isn't perfect, in most of the examples I have tried, the results are almost identical to Stata output. You tend to get more divergence from the Stata solution for dummy variable regressors (where Stata computes a discrete jump rather than a derivative).
One would think that an extension of mfx() would use mfx(), but that didn't turn out to be the easy way to go. After many unsuccessful attempts (more about this in a future post; I learned some things worth sharing), here's the solution I landed on.
First, define an "mfx" object with a function that changes the class attribute of a fitted glm object to. Here's code to define a function to do this.
Second, make sure you have the memisc library loaded with the command library(memisc) -- or library(tonymisc), which loads memisc among other excellent packages.
Third, until I get the kinks worked out and port this into my package, you'll have to run code to define a getSummary.mfx() method to extend mtable. Here's some code that will do this.
Now, I suppose you want to know how this extension works. Here's an example that uses some data from the wonderful Ecdat library (a great source for Econometrics data sets).
Here's the output to R's console. Note that feat.yoplait is a dummy variable for whether Yoplait had a featured advertisement that particular week. Based on my caveat, you wouldn't expect the output to be identical to Stata.
And, as you can see (compare with Stata output below), it isn't exactly the same, but if you are trying to understand the qualitative importance of your estimates, it isn't too bad.
Two bugs (quirks, kludges) remain: (1) I want to get the standard errors precisely right, (2) I want to convince mtable() to stop reporting the intercept for the marginal effects (as the intercept doesn't change, it can't have a marginal effect!). Even before I get around to this, I hope this code will have a positive marginal effect.
---
Update: The original post to which this post refers is no longer available because the TimeSeriesIreland Blog appears to have been taken down. I leave this post up subject to two caveats: (1) the standard errors are not quite right for the marginal effects and (2) the marginal effects for dummy regressors are not computed for a discrete jump from 0 to 1. Otherwise, you may find this tool useful.
Subject to the caveats in this post, if you want access to this function and a related function that is more summary()-like, I have included it in my CRAN-contributed package tonymisc (see the tab at the top of this page).
Labels:
coding tips,
inelegant coding,
output,
R,
useful functions,
useful resources
Friday, May 13, 2011
Describing Data: Frequently Used Commands
Obtaining a coherent numerical summary of data is a common task, and it is common to want to port these summary statistics into a table of results. When I am in interactive mode with my data, I use the summary() command applied to my data frame. For example, the following code loads and summarizes a data frame on Yogurt advertising and prices:
library(Ecdat) ## Econometrics Data (useful!)
data(Yogurt) ## Loads Yogurt from Ecdat
summary(Yogurt) ## Summarizes Yogurt
For each quantitative variable, the summary() command provides a five-number summary (min, max, Q1, Q3, median) plus the mean. For categorical variables, the counts of each level are provided. This provides an excellent summary measure of each variable, but you may prefer a richer set of information (especially when it comes to typing up tables).
I recently discovered a great way to obtain a richer set of information on a data frame. This method involves using the psych library, which contains functions describe() and describe.by(). Continuing with the code from above, here is the basic syntax:
library(psych)
describe(Yogurt) ## Describes in more detail the Yogurt data frame
Suppose you also want to break your summary statistics into two (or four) tables for comparison sake (perhaps to illustrate stark differences across select subsets of your data). The describe.by() command is a convenient technique to break the data down by the levels of a factor. Here's an example with on the Yogurt data.
describe.by(Yogurt, Yogurt$choice)
Finally, you may want to port your data into LaTeX format and/or select particular summary statistics from the list. I wrote a function that serves as a convenience interface to describe.by() and toLatex(). As toLatex() does not work directly on objects created using describe.by(), you might find this helpful.
If you do not like knowing about the kurtosis of your data, you could read up on the options of describe.by() to learn about how to shut it down. If you're going to port it into a LaTeX table anyway, you could also just modify the code I wrote here to eliminate the summary statistics you don't want and produce LaTeX output.
FYI: Quick R has a nice summary of some other methods for summarizing data. Of the methods at Quick R that I didn't describe, pastecs looks most like a method I would use.
library(Ecdat) ## Econometrics Data (useful!)
data(Yogurt) ## Loads Yogurt from Ecdat
summary(Yogurt) ## Summarizes Yogurt
For each quantitative variable, the summary() command provides a five-number summary (min, max, Q1, Q3, median) plus the mean. For categorical variables, the counts of each level are provided. This provides an excellent summary measure of each variable, but you may prefer a richer set of information (especially when it comes to typing up tables).
I recently discovered a great way to obtain a richer set of information on a data frame. This method involves using the psych library, which contains functions describe() and describe.by(). Continuing with the code from above, here is the basic syntax:
library(psych)
describe(Yogurt) ## Describes in more detail the Yogurt data frame
Suppose you also want to break your summary statistics into two (or four) tables for comparison sake (perhaps to illustrate stark differences across select subsets of your data). The describe.by() command is a convenient technique to break the data down by the levels of a factor. Here's an example with on the Yogurt data.
describe.by(Yogurt, Yogurt$choice)
Finally, you may want to port your data into LaTeX format and/or select particular summary statistics from the list. I wrote a function that serves as a convenience interface to describe.by() and toLatex(). As toLatex() does not work directly on objects created using describe.by(), you might find this helpful.
If you do not like knowing about the kurtosis of your data, you could read up on the options of describe.by() to learn about how to shut it down. If you're going to port it into a LaTeX table anyway, you could also just modify the code I wrote here to eliminate the summary statistics you don't want and produce LaTeX output.
FYI: Quick R has a nice summary of some other methods for summarizing data. Of the methods at Quick R that I didn't describe, pastecs looks most like a method I would use.
Monday, May 9, 2011
Extending mtable() to ivreg, gls and robust standard errors
I have written several extensions of the mtable() command in the memisc library that may come in handy. The methods are available in a package I have written called tonymisc (now, available on CRAN). A zipped folder with my package files is available for download here.
My extensions to mtable() allow users to easily summarize (and port into LaTeX tables) lm() objects with robust standard errors, gls() objects (created using nlme's gls) and ivreg() objects (created using AER's ivreg). If you would like to see how the output looks, here's some example code:
This code produces the LaTeX code behind this table.
I also implemented a feature through my robust() command to allow the user to select which variables he/she would like to see in the table output. Here's some example code for that feature:
The output from this code is this table:
To get this package to work for you, simply use the command install.packages("tonymisc", dependencies=TRUE), as you would normally do to install packages from CRAN. copy the tonymisc folder (available for download here) into your R library of packages. Once it is in the right place, all you need to do is load it with the library(tonymisc) command.
I am still working on the details of the package (help files, extra features, extensions of what I have done, internal consistency, loading other packages by default, ease of use, etc.), but I would be happy if people find it useful at this stage. In particular, I still need to implement some ideas from the comments (the next version will contain these). If you have other suggestions, please leave a comment to let me know.
My extensions to mtable() allow users to easily summarize (and port into LaTeX tables) lm() objects with robust standard errors, gls() objects (created using nlme's gls) and ivreg() objects (created using AER's ivreg). If you would like to see how the output looks, here's some example code:
This code produces the LaTeX code behind this table.
I also implemented a feature through my robust() command to allow the user to select which variables he/she would like to see in the table output. Here's some example code for that feature:
The output from this code is this table:
To get this package to work for you, simply use the command install.packages("tonymisc", dependencies=TRUE), as you would normally do to install packages from CRAN.
I am still working on the details of the package (help files, extra features, extensions of what I have done, internal consistency, loading other packages by default, ease of use, etc.), but I would be happy if people find it useful at this stage. In particular, I still need to implement some ideas from the comments (the next version will contain these). If you have other suggestions, please leave a comment to let me know.
Tuesday, May 3, 2011
Putting Robust Standard Errors into LaTeX Tables: An Extension of mtable
I recently discovered the mtable() command in the memisc library and its use with toLatex() to produce nice summary output for lm and glm objects in a nicely formatted table like this:
Once you have your linear model objects, all you need is one command -- a composite of toLatex() and mtable(). Here's the R code I used to generate the LaTeX code:
All you need to do from here is to copy the code into your LaTeX document (or in my case LyX document with evil red text), where you use the dcolumn and booktabs packages (specify this in the preamble using \uspackage{} for each package).
This is really straightforward ... as long as you just want to use lm or glm objects with no deviations from that baseline. But, what if you -- as many econometricians do -- want to estimate your linear model with heteroskedasticity-consistent standard errors?
How to do it with "robust" standard errors
My favorite way to robustify my regression in R is to use some code that John Fox wrote (and I found in an R-help forum). He called it summaryHCCM.lm(). In my own applications, I have renamed it summaryR() because "R" makes me think "robust" and it is fewer keystrokes than HCCM. I modified the output of the code to work better with what follows:
Suppose I would like to produce nicely-typed output like the above table, but with robust standard errors. Unfortunately, the summaryR() solution doesn't map naturally into our mtable() solution to producing nice LaTeX output. But, fortunately, there is a solution.
(If you just want this to work, copy and run all of the code from here to "so we are all set." No need to understand the implementation details. If you're interested in understanding how I arrived at my solution, here come the details)
It turns out that there is a way to map the summaryR() command to the mtable() command to produce nice LaTeX output with robust standard errors. Once we have the commands, the syntax is just as easy as using summaryR() directly and it makes creating LaTeX table output easy. So, what do we do?
First, we need to adapt the methods behind mtable() to allow it to recognize something that I call a "robust" object. To do this, we need to write a function called getSummary.robust()
and we need to set some formatting options using the setSummaryTemplate() command:
Second, we need to convert a linear models object into something called a "robust" object. The robust() command does this:
So, we are all set. If you run all of the preliminary code in this post, you're in good shape to produce LaTeX tables with robust standard errors like this table:
Once the functions are defined, the syntax is straightforward. Here is the code I used to put together the the table (plus, one step to copy the output into my LyX file)
I hope this is helpful. If you're interested in other extensions of mtable(), here's an excellent post.
Finally,I am still learning I learned how to bundle functions into a package (so I haven't made this available as a package yet). Today, I was able to successfully package this set of functions into a package called tonymisc on my computer. Once I organize the help/example/other fun parts of the package, I will see about making this and my extension for ivreg objects available. This package (in its first version) is available on CRAN through install.packages("tonymisc").
Once you have your linear model objects, all you need is one command -- a composite of toLatex() and mtable(). Here's the R code I used to generate the LaTeX code:
All you need to do from here is to copy the code into your LaTeX document (or in my case LyX document with evil red text), where you use the dcolumn and booktabs packages (specify this in the preamble using \uspackage{} for each package).
This is really straightforward ... as long as you just want to use lm or glm objects with no deviations from that baseline. But, what if you -- as many econometricians do -- want to estimate your linear model with heteroskedasticity-consistent standard errors?
How to do it with "robust" standard errors
My favorite way to robustify my regression in R is to use some code that John Fox wrote (and I found in an R-help forum). He called it summaryHCCM.lm(). In my own applications, I have renamed it summaryR() because "R" makes me think "robust" and it is fewer keystrokes than HCCM. I modified the output of the code to work better with what follows:
Suppose I would like to produce nicely-typed output like the above table, but with robust standard errors. Unfortunately, the summaryR() solution doesn't map naturally into our mtable() solution to producing nice LaTeX output. But, fortunately, there is a solution.
(If you just want this to work, copy and run all of the code from here to "so we are all set." No need to understand the implementation details. If you're interested in understanding how I arrived at my solution, here come the details)
It turns out that there is a way to map the summaryR() command to the mtable() command to produce nice LaTeX output with robust standard errors. Once we have the commands, the syntax is just as easy as using summaryR() directly and it makes creating LaTeX table output easy. So, what do we do?
First, we need to adapt the methods behind mtable() to allow it to recognize something that I call a "robust" object. To do this, we need to write a function called getSummary.robust()
and we need to set some formatting options using the setSummaryTemplate() command:
Second, we need to convert a linear models object into something called a "robust" object. The robust() command does this:
So, we are all set. If you run all of the preliminary code in this post, you're in good shape to produce LaTeX tables with robust standard errors like this table:
Once the functions are defined, the syntax is straightforward. Here is the code I used to put together the the table (plus, one step to copy the output into my LyX file)
I hope this is helpful. If you're interested in other extensions of mtable(), here's an excellent post.
Finally,
Friday, April 29, 2011
Forming Formulas
One of the first functions a new R user learns how to use is the lm() command, which involves stating the model formula.
lm(y~x1+x2, data=mydata)
After a while, this just becomes a natural way to say "I want a regression of y on x1 and x2 using mydata." Even though it is natural, the underlying structure of the formula is not as it first appears. This becomes apparent as soon as you need to work with formula objects when writing your own function.
Does ~ define a string? No. You can ask R directly whether the formula is a string (of class "character") using is.character().
is.character(y~x1+x2+x3)
[1] FALSE
As we will soon see, a formula isn't easily coerced into a string either.
Should we think of ~ as a fancy = sign? No. R doesn't think of ~ as defining an equation to be evaluated.
lm(y~x1+x2, data=mydata)
After a while, this just becomes a natural way to say "I want a regression of y on x1 and x2 using mydata." Even though it is natural, the underlying structure of the formula is not as it first appears. This becomes apparent as soon as you need to work with formula objects when writing your own function.
Does ~ define a string? No. You can ask R directly whether the formula is a string (of class "character") using is.character().
is.character(y~x1+x2+x3)
[1] FALSE
As we will soon see, a formula isn't easily coerced into a string either.
Should we think of ~ as a fancy = sign? No. R doesn't think of ~ as defining an equation to be evaluated.
x1=5
x2=6
x3=7
y~x1+x2+x3
y ~ x1 + x2 + x3
It is better to think of ~ as defining the entire statement as an object with a special class called formula, which conveys the idea of the model to other functions in R.
For the rest of the post, I will show some calculations that illustrate more fundamentally what a formula object is. For concreteness, imagine you have a formula object with the general form y~x1+x2+z1+z2+z3.
You could define this object with the command:form.ex = y~x1+x2+z1+z2+z3
Suppose our goal is to extract the first two explanatory variables (x1 and x2) and use them in a linear model for the response (y). If you're reading this for help, you likely have a more noble goal in mind. Take this as an illustrative example.
To appreciate the power of a method, it is useful to see the inelegant code we are avoiding. The inelegant code also shines some light on how a formula object differs from a string.
An inelegant method
We could gain access to the elements of the formula by coercing it into a character string using as.character(). If we do this in our example, we obtain a character vector of length three:as.character(form.ex)
[1] "~"
[2] "y"
[3] "x1 + x2 + z1 + z2 + z3"
Here, we see that the RHS of the formula is stored in the third element of this vector. The response is in the second element of the vector. Let's store these into vectors we can easily reference and use.rhs= as.character(form.ex)[3]Next, we can splice the rhs vector to obtain the individual variable names using the strsplit command. Note: we want to split the vector apart using a delimiter " + " with the spaces. Because the + is an operator, we need to use " \\+ " instead. In addition, strsplit() returns a list of length two. We really want the first element of the list... (remember, this is the inelegant way).
resp = as.character(form.ex)[2]rhs.sp = strsplit(rhs," \\+ ")[[1]]
rhs.sp
[1] "x1" "x2" "z1" "z2" "z3"
Now, we have a vector of variable names for the right hand side of the equation. As our goal is to construct a formula object relating the response to the first two explanatory variables, extract the first two elements of this vector.rhs.new= rhs.sp[1:2]
rhs.new
[1] "x1" "x2"
And, paste this right hand side of the equation to the response variable name in a formula style using the following combination of paste() commands.short.form=paste(resp,"~",paste(rhs.new,collapse="+"))
short.form
[1] "y ~ x1+x2"
Great. The only problem? The object short.form is not a formula object. It is a character string. This won't be a problem if we input it into something like lm(), which will coerce it to a formula object before using it.lm(short.form, data=mkt.df)
Call:
lm(formula = short.form, data = mkt.df)
Coefficients:
(Intercept) x1 x2
-25.329 2.335 4.239
If you want to do the formula coercion yourself, you can use the as.formula() command as follows.short.form=as.formula(short.form)
short.form
y ~ x1 + x2
A more elegant method
The all.vars() command does the splitting, splicing and extracting that comprised the bulk of what we did above. Starting with the formula object form.ex, we could create a vector of variable names in our model statement as:vars.ex = all.vars(form.ex)
vars.ex
[1] "y" "x1" "x2" "z1" "z2" "z3"
Now, just define the rhs and resp vectors as we did in the less elegant method.rhs = vars.ex[2:3]
resp = vars.ex[1]
Is there a better way recombine these vectors than to create our own paste collapser? Yes! The reformulate() function makes it easy.And, for good measure, let's see what happens in the lm() command:myform=reformulate(rhs.av, response=resp)
myform
y ~ x1 + x2lm(myform, data=mkt.df)
Call:
lm(formula = myform, data = mkt.df)
Coefficients:
(Intercept) x1 x2
-25.329 2.335 4.239
Labels:
coding tips,
inelegant coding,
potentially useful functions,
R
Tuesday, April 26, 2011
Automatically Save Your Plots to a Folder
Suppose you're working on a problem that involves a loop for calculations. At each iteration inside the loop, you want to construct a plot. Not only do you want to see the plot, but you would like to save each plot for a presentation, report or paper. Furthermore, the loop goes on for a while (say through the 26-letters of the alphabet).
The last thing you want to do in this situation is: (1) produce each plot one-by-one, (2) right click on each singly-produced plot to save, (3) give the plot a unique name, and (4) repeat. You'll spend too much of your time saving plots and not enough time thinking about whether they are the right plots. Just imagine, what if something went wrong and you need to produce the whole set of plots again?
RStudio has a nice feature in that it saves all of your plots in the plotting pane. It's no problem if you just produce the plot inside your dreaded loop in RStudio because it keeps all of your plots in the pane. Even with RStudio, if you produce the plots inside the loop, you still need to save each one individually. This isn't ideal. If you don't believe me, imagine that you have 1000 plots instead of 26. This manual-saving method becomes impractical quickly.
How then can you automatically save plots to a folder without spending too much time? That's today's task. I'll start by describing several building-block commands and then at the end I'll put together a loop that does it all (by way of contrived example).
The first command you need to know is jpeg() (Alternatively, bmp(), png() or tiff(), depending on your file-type preferences) paired with dev.off(). For our purposes, jpeg() takes a path argument that allows us to save (at the location of our choosing via the path) output to a plotting window. For example, this code will save the next plotting object to a jpeg file called myplot.jpg located at "C://R//SAVEHERE"
jpeg(file = "C://R//SAVEHERE//myplot.jpg")
After running the plotting object, you need to be sure to turn off the plotting device you created (with the jpeg() command). To save a scatter plot of the vectors x versus y to the location described above, run these three lines:
jpeg(file = "C://R//SAVEHERE//myplot.jpg")
plot(x,y)
dev.off()
This code is a good building block for automatically saving to a folder inside a loop, but we still need to know how to dynamically create file names at which to save our plots. Suppose we have a vector that gives us a list of identifiers called names. Presumably, these identifiers mean something in your setting. For our working example, names is a list of letters from A to Z (we are trying to produce a separate scatter plot for each letter from A to Z).
We can use the file.path() and paste() commands to help us out here. I have found paste() to be incredibly useful for other applications as well. For example,
paste("myplot_", i, ".jpg", sep="")
produces the character string myplot_50.jpg when i = 50 and myplot_51.jpg when i=51. If i changes for each iteration in the loop, this paste() command will create a unique file name at each iteration.
We can get fancier with our pasting. If we have a vector of names, we can extract the ith name from the names vector with the paste command:
paste("myplot_", names[i], ".jpg", sep="")
The file.path() function helps us in a different way. It is a special paste function that helps us construct file paths really easily. For example,
mypath=file.path("C:", "R", "SAVEHERE", filename)
returns "C://R//SAVEHERE//filename" and stores it in an object called mypath. Replace filename with the paste command and we have a way to automatically generate file names in a folder called SAVEHERE.
Now, you have the tools to understand my example code of how to automatically generate and save plots to a folder. Make sure the folder exists before saving, but subject to that constraint, this procedure may make your life easier.
The last thing you want to do in this situation is: (1) produce each plot one-by-one, (2) right click on each singly-produced plot to save, (3) give the plot a unique name, and (4) repeat. You'll spend too much of your time saving plots and not enough time thinking about whether they are the right plots. Just imagine, what if something went wrong and you need to produce the whole set of plots again?
RStudio has a nice feature in that it saves all of your plots in the plotting pane. It's no problem if you just produce the plot inside your dreaded loop in RStudio because it keeps all of your plots in the pane. Even with RStudio, if you produce the plots inside the loop, you still need to save each one individually. This isn't ideal. If you don't believe me, imagine that you have 1000 plots instead of 26. This manual-saving method becomes impractical quickly.
How then can you automatically save plots to a folder without spending too much time? That's today's task. I'll start by describing several building-block commands and then at the end I'll put together a loop that does it all (by way of contrived example).
The first command you need to know is jpeg() (Alternatively, bmp(), png() or tiff(), depending on your file-type preferences) paired with dev.off(). For our purposes, jpeg() takes a path argument that allows us to save (at the location of our choosing via the path) output to a plotting window. For example, this code will save the next plotting object to a jpeg file called myplot.jpg located at "C://R//SAVEHERE"
jpeg(file = "C://R//SAVEHERE//myplot.jpg")
After running the plotting object, you need to be sure to turn off the plotting device you created (with the jpeg() command). To save a scatter plot of the vectors x versus y to the location described above, run these three lines:
jpeg(file = "C://R//SAVEHERE//myplot.jpg")
plot(x,y)
dev.off()
This code is a good building block for automatically saving to a folder inside a loop, but we still need to know how to dynamically create file names at which to save our plots. Suppose we have a vector that gives us a list of identifiers called names. Presumably, these identifiers mean something in your setting. For our working example, names is a list of letters from A to Z (we are trying to produce a separate scatter plot for each letter from A to Z).
We can use the file.path() and paste() commands to help us out here. I have found paste() to be incredibly useful for other applications as well. For example,
paste("myplot_", i, ".jpg", sep="")
produces the character string myplot_50.jpg when i = 50 and myplot_51.jpg when i=51. If i changes for each iteration in the loop, this paste() command will create a unique file name at each iteration.
We can get fancier with our pasting. If we have a vector of names, we can extract the ith name from the names vector with the paste command:
paste("myplot_", names[i], ".jpg", sep="")
The file.path() function helps us in a different way. It is a special paste function that helps us construct file paths really easily. For example,
mypath=file.path("C:", "R", "SAVEHERE", filename)
returns "C://R//SAVEHERE//filename" and stores it in an object called mypath. Replace filename with the paste command and we have a way to automatically generate file names in a folder called SAVEHERE.
Now, you have the tools to understand my example code of how to automatically generate and save plots to a folder. Make sure the folder exists before saving, but subject to that constraint, this procedure may make your life easier.
R Bloggers
I recently found a great resource for R in the blogosphere, the R Bloggers Blog Aggregator. Basically, the site aggregates posts from a bunch of blogs about R (like this one!) into a giant feed of uses for R. If you are interested in learning more about R, the best way is to see how other people interact with it. What better way than to see what the R blogosphere has to say?
Monday, April 25, 2011
Merging Data Video Tutorial
Here's a video tutorial where I walk through some code that does what the previous post describes.
The FRED data is used extensively for macroeconomics. You might these data useful for joining in graph fights in the blogosphere.
The FRED data is used extensively for macroeconomics. You might these data useful for joining in graph fights in the blogosphere.
Sunday, April 24, 2011
Merging Multiple Data Files into One Data Frame
We often encounter situations where we have data in multiple files, at different frequencies and on different subsets of observations, but we would like to match them to one another as completely and systematically as possible. In R, the merge() command is a great way to match two data frames together.
Just read the two data frames into R
mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)
Then, merge
myfulldata = merge(mydata1, mydata2)
As long as mydata1 and mydata2 have at least one common column with an identical name (that allows matching observations in mydata1 to observations in mydata2), this will work like a charm. It also takes three lines.
What if I have 20 files with data that I want to match observation-to-observation? Assuming they all have a common column that allows merging, I would still have to read 20 files in (20 lines of code) and merge() works two-by-two... so I could merge the 20 data frames together with 19 merge statements like this:
mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
.
.
.
mytempdata = merge(mytempdata, mydata20)
That's tedious. You may be looking for a simpler way. If you are, I wrote a function to solve your woes called multmerge().* Here's the code to define the function:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
After running the code to define the function, you are all set to use it. The function takes a path. This path should be the name of a folder that contains all of the files you would like to read and merge together and only those files you would like to merge. With this in mind, I have two tips:
mymergeddata = multmerge("C://R//mergeme")
After running this command, you have a fully merged data frame with all of your variables matched to each other. Of course, most of the details in matching and merging data come down to making sure that the common column is specified correctly, but given that, this function can save you a lot of typing.
*Maybe a function like this exists out there already, but I think it is entertaining to write helpful functions. I also trust the functions more if I write them myself and test them on my own problems.
Just read the two data frames into R
mydata1 = read.csv(path1, header=T)
mydata2 = read.csv(path2, header=T)
Then, merge
myfulldata = merge(mydata1, mydata2)
As long as mydata1 and mydata2 have at least one common column with an identical name (that allows matching observations in mydata1 to observations in mydata2), this will work like a charm. It also takes three lines.
What if I have 20 files with data that I want to match observation-to-observation? Assuming they all have a common column that allows merging, I would still have to read 20 files in (20 lines of code) and merge() works two-by-two... so I could merge the 20 data frames together with 19 merge statements like this:
mytempdata = merge(mydata1, mydata2)
mytempdata = merge(mytempdata, mydata3)
.
.
.
mytempdata = merge(mytempdata, mydata20)
That's tedious. You may be looking for a simpler way. If you are, I wrote a function to solve your woes called multmerge().* Here's the code to define the function:
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=T)})
Reduce(function(x,y) {merge(x,y)}, datalist)
After running the code to define the function, you are all set to use it. The function takes a path. This path should be the name of a folder that contains all of the files you would like to read and merge together and only those files you would like to merge. With this in mind, I have two tips:
- Before you use this function, my suggestion is to create a new folder in a short directory (for example, the path for this folder could be "C://R//mergeme") and save all of the files you would like to merge in that folder.
- In addition, make sure that the column that will do the matching is formatted the same way (and has the same name) in each of the files.
mymergeddata = multmerge("C://R//mergeme")
After running this command, you have a fully merged data frame with all of your variables matched to each other. Of course, most of the details in matching and merging data come down to making sure that the common column is specified correctly, but given that, this function can save you a lot of typing.
*Maybe a function like this exists out there already, but I think it is entertaining to write helpful functions. I also trust the functions more if I write them myself and test them on my own problems.
Saturday, April 23, 2011
Dates in R and the First Day of the Month
I spent some time this morning learning about how R thinks about dates in R. I found this website to be a useful guide.
Imagine that your data are dates in a standard format [YEAR-MONTH-DAY (as in 2011-23-04) is one such format] and you want a vector of the dates that are the first day in the month of your data set. You might, for example, be interested in this (as an intermediate calculation) because you wish to match some S&P 500 returns data to monthly data. Frustratingly, the S&P 500 returns data do not report on the weekends or on holidays when the exchange is closed.
Here's a function that will do return a vector of first-days-of-the-month if you give it a vector of dates. Maybe you will find this useful.
Enjoy!
Imagine that your data are dates in a standard format [YEAR-MONTH-DAY (as in 2011-23-04) is one such format] and you want a vector of the dates that are the first day in the month of your data set. You might, for example, be interested in this (as an intermediate calculation) because you wish to match some S&P 500 returns data to monthly data. Frustratingly, the S&P 500 returns data do not report on the weekends or on holidays when the exchange is closed.
Here's a function that will do return a vector of first-days-of-the-month if you give it a vector of dates. Maybe you will find this useful.
Enjoy!
Labels:
coding tips,
data,
fun stuff,
potentially useful functions
Thursday, April 21, 2011
Good Looking Maps in R
I haven't yet tried this, but it looks like a nice package.
In one recent project I needed to draw several maps and visualize different kinds of geographical data on it. I found the combination of R/ggplot/maps package extremely flexible and powerful, and produce nice looking map based visualizations.
Here is a short tutorial,
monospace
font indicates the code you need to run in R. You probably need some basic understanding of R to work through this tutorial.
Enjoy!
Tuesday, April 19, 2011
Common Data Creation Commands
Here is a video tutorial where I go through some of the most commonly used commands in creating and manipulating data. As soon as I want to do more than just running a single regression, I use these commands more than any other set of commands (in some of the other videos, you may have seen these).
Here is the code I use in the video if you would like to try out these commands for yourself.
Here is the code I use in the video if you would like to try out these commands for yourself.
Labels:
basics,
coding tips,
matrix notation,
useful functions,
video tutorials
Subscribe to:
Posts (Atom)