Sunday, June 5, 2011

An application of aggregate() and merge()

Today, I encountered an interesting problem while processing a data set of mine. My data have observations on businesses that are repeated over time. My data set also contains information on longitude and latitude of the business location, but unfortunately, my source only started providing the longitude and latitude information in the middle of the data set. As a result, the information is missing for the first few periods for every business, but across businesses, I have latitude and longitude information for virtually all of my data set.

As long as the business hasn't shifted location, this isn't a problem. I can just substitute the lat-long information from the later periods into the earlier periods for each business. This sounds simple enough to do, but when one has a data set with 70,000 observations, an automatic process saves a lot of time and can help avoid mistakes.

Let's start by building common ground. Here is a script that you can use to simulate a data set to follow along.



Once you run this script, the data frame mydat is a balanced panel of 125 businesses observed over 25 time periods (my actual data set is an unbalanced panel and the solution I present works just as well for unbalanced as with balanced. It is just easier to simulate a balanced panel). Each observation has a business identifier (BusID), a time period identifier (per), as well as simulated data on latitude and longitude.

To see the problem with this data set, examine the data for the first business using the command mydat[mydat$BusID==1,]. Here's a screenshot of the output.



This problem occurs throughout the data set and we would like to just substitute the lat-long values we have for the right missing ones.

How did I solve my NA problem? I used aggregate(), a function that quickly and conveniently collapses a data frame by a factor or list of factors. With aggregate(), you can take the mean, compute the max, compute the min, sum the values or compute the standard deviation by the levels of the collapsing factor. The aggregate() function also has a na.rm option that allows us to easily ignore the NA values (useful here). In our example, we could collapse by taking the mean or computing the max or the min; the result will be the same because latitude and longitude do not vary for the same business over time.

Let's collapse by taking the mean withing each level of BusID. The result of my use of aggregate() is a data frame that upon merging with the original data frame (using merge()), fills in the missing latitude and longitude values. The following code implements this strategy to fill in the NAs in our simulated data set.



Before leaving you to have fun with aggregate() and merge(), several points are worth mentioning.

First, the aggregate() function requires that the collapsing factor(s) is (are) stored in a list. This is why the code uses list(BusID=mydat$BusID) when mydat$BusID would seem more natural. The list() part is necessary to get aggregate() to work.

Second, it is useful for merging in the second step to give the attribute in the list the same name as the variable name in the original data frame. The BusID= part of the list(BusID=mydat$BusID) command is a convenient way to prepare the resulting data frame for merging.

Third, my code uses -c(3,4) to drop the 3rd and 4th columns of the original data frame (the ones with the original lat-long data) before merging it with the collapsed latlong data frame. This is important so that the merge() command doesn't get confused about which columns to use for matching. If you leave these columns in mydat, R's merge will try to match the values in those columns as well (because they have a common name with those columns in latlong).

Finally, the data in this example are in a different order now that we merged on BusID. If this bothers you, you can use order to sort the data set by time period as in

myfinaldat=mycleandat[order(mycleandat$per),]

As this last step is purely a cosmetic change to the data set that will not affect subsequent analyses, you can leave it out.

Although this example was my solution to a rather specific problem, it illustrates how to use some useful commands. To the extent that these commands are widely applicable to other problems, I hope you find uses for these commands in related applications.

2 comments:

  1. hey there, love your blog.. any idea what happened to timeseriesireland? he had great R blog also, was really useful stuff.

    ReplyDelete
  2. Thanks for the compliment.

    timeseriesireland had interesting stuff, indeed. I don't know all the details about what happened, but it is too bad he is not still R blogging.

    ReplyDelete