Showing posts with label matrix notation. Show all posts
Showing posts with label matrix notation. Show all posts

Monday, July 18, 2011

Avoiding Loops in R: An Example with Principal Minors

Yesterday, I found myself wanting to compute a large subset of the second order principal minors of a matrix (diagonal-preserving minors; the ones for which the rows and columns kept are the same). Don't judge me for wanting to do this, and bear with me, there is a lesson about R programming here (mostly for me, read the comments for testing).

If you don't know what a principal minor is or forgot what it was (because most of us don't spend much time with principal minors), here is a short refresher on how to get a second-order principal minor (other orders are possible and there's a lot of related theory, but let's try to focus on this one task for now).

Step 1: Deletion. Take any matrix. Delete all but two rows and all but two columns. For example, suppose the matrix is 5x5 and let's keep rows 1 and 3 and columns 2 and 4 (lots of other combinations are possible).

In the back of your mind, think how easy this is in R.

Step 2: Calculation. Once we delete the irrelevant parts of the big matrix to get our 2x2 submatrix, obtain the principal minor by taking the determinant of the resulting 2x2 matrix. For our example, we compute 3x2 - 3x1 = 3.

Both steps in this calculation are easy to implement in R. For the first step, merely use R's fantastic syntax for extracting rows and columns of matrices. If the matrix is called mat, we efficiently perform the first step by:

little_mat = mat[c(1,3), c(2,4)]

That's easy enough. With little_mat in hand, the second step is even easier.

pm = det(little_mat)

The real trick to this programming problem is picking out all of principal minors I want to compute (diagonal-preserving ones that keep the same-numbered rows and columns; little_mat won't do because c(1,3) is not c(2,4)), and doing it efficiently. I also don't want to rewrite my code as my matrix changes size.

So, what do we do? A really bad An idea that will eventually compute the right answer is to write this using for() loops.




Barring a coding error, this code should eventually work, but it will be slow (or fast depending on how you write it). If you want to run through this code many times (like I did yesterday), efficiency is at a premium and you should immediately think of faster techniques -- like using the apply() function (or preallocating your storage vector, which actually saves time).

Before moving to apply(), we need a quick way to enumerate all of the ways to keep two out of N rows. If we were counting, that's N choose 2, also known as a combination. Fortunately, this step has already been implemented in the gregmisc library with a function called combinations(). Conveniently, combinations(N, 2) will return a matrix with all of the ways to select two numbers less than or equal to N as its rows. There are other options, but this is the default, which is perfect for our application.

For example, here is some output for N=5.


With this way of computing the rows/ columns we want to keep, all we need to do is put our simple code for computing a second order principal minor into a function. Let's call it minor2(). Then, use apply() to apply this function to each row (MARGIN=1 is for rows) of the combinations matrix. Here's the complete set of code that will work if you have a matrix called mat for which you want the diagonal-second order principal minors.



Because we avoid doing loops, this method is much faster. For a big project, this added efficiency is worth it. More generally, this technique of using apply() to avoid repeatedly looping is a good lesson to apply to other programming problems.

Update: Thanks to Berend for going through the trouble of system.time() testing the code in this post (and spiceman for a good comment as well). Read on into the comments for more, but for a preview of what we learned: (1) The copying via c() is what slows the loop down, (2) apply is slightly slower than writing the loop yourself, (3) combn(), the combinations() alternative, is really slow, and (4) one should always stand behind system.time() methods before talking about the efficiency of a method.

I much appreciate the feedback. One of the reasons I have this blog is to learn more about R (and those comments were definitely informative).

Tuesday, April 19, 2011

Common Data Creation Commands

Here is a video tutorial where I go through some of the most commonly used commands in creating and manipulating data. As soon as I want to do more than just running a single regression, I use these commands more than any other set of commands (in some of the other videos, you may have seen these).



Here is the code I use in the video if you would like to try out these commands for yourself.

Tuesday, March 1, 2011

How Does R Deal with Categorical Explanatory Variables?

Here's a video tutorial where I demonstrate an answer to this question.



This video is designed to instill a baseline level of practical knowledge. There is more to how R treats factors in regression models. An interested reader should Google "R contrasts" for more information.

Here is the code I used in the video:



Here is a link to the data set.

Tuesday, May 11, 2010

Summary Statistics in R

There are plenty of useful techniques for manipulating data frame objects in R. This post summarizes some of these useful techniques, and gives code to implement them. Before running the code in this post, you should read in the data set (which is available for download here). For tips on how to read Stata ".dta" files into an R data frame, see my previous post.
## Print the entire data frame to the screen by typing its name ##

caschool.df

## Look just as the variable names ##

names(caschool.df)

## Examine a particular variable using the $ extractor ##

caschool.df$testscr

## You can also think of the data frame as a matrix ##

caschool.df[,4] ## Extracts the 4th column
caschool.df[4,] ## Extracts the 4th observation
caschool.df[4,4:18]

## The last command extracts the 4th through 18th columns
## of the 4th observation.
##
## Note 4:18 is R shorthand for
## c(4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)

## Compute summary statistics on the entire data frame ##
## or just one variable ##

mean(caschool.df) ## Returns NA for categorical
round(mean(caschool.df),2) ## Rounding returns easier
## to read formatting
var(caschool.df) ## Returns var-cov matrix
diag(var(caschool.df)) ## Returns just the variances
mean(caschool.df$testscr) ## Returns mean of testscr
var(caschool.df$testscr) ## Returns scalar variance

## Compute the summary statistics on columns 6 through 18 ##

mean(caschool.df[,6:18])
var(caschool.df[,6:18]) ## Returns var-cov matrix

## If you don't like typing the data frame's name every ##
## time you want to explore the variable, you can use ##
## the attach() command.

attach(caschool.df)

## Now, the variables in the data frame can be accessed ##
## without extracting them with the $

mean(testscr)


##-------------------------------------------------- ##
## Creating/Storing new variables ##
## Just using arithmetic definition of new variables ##
##-------------------------------------------------- ##

math_read_avg = (read_scr+math_scr)/2

## -------------------------------------------##
## You can attach it to the data frame ##
## In fact, for regression, you want to do so ##
## -------------------------------------------##

caschool.df=cbind(caschool.df,math_read_avg)

## -------------------------------------------------##
## ... and its name will be what you called it when ##
## you defined the variable ##
## -------------------------------------------------##

names(caschool.df)
caschool.df$math_read_avg

## ------------------------------------------------- ##
## Too many datasets in R's active memory can cause
## problems math_scr might be a variable name in
## multiple data frames ... especially if you work
## with a lot of data sets
##
## In case of ambiguity, R will write over the old
## variable name with the new definition.
##
## To avoid this problem, when you are done with an
## attached data set, you should use detach()
## ------------------------------------------------ ##

detach(caschool.df)

## ------------------------------------------------ ##
## An added note: I like to leave my workspace image
## clutter-free. After working on some code, I save
## my code in a text file, but I do *not* save my
## workspace image.
##
## Following this practice can help you avoid
## referencing a variable name that you created months
## ago for a separate project
## ------------------------------------------------ ##