Friday, April 29, 2011

Forming Formulas

One of the first functions a new R user learns how to use is the lm() command, which involves stating the model formula.

lm(y~x1+x2, data=mydata)

After a while, this just becomes a natural way to say "I want a regression of y on x1 and x2 using mydata." Even though it is natural, the underlying structure of the formula is not as it first appears. This becomes apparent as soon as you need to work with formula objects when writing your own function.

Does ~ define a string? No. You can ask R directly whether the formula is a string (of class "character") using is.character().


As we will soon see, a formula isn't easily coerced into a string either.

Should we think of ~ as a fancy = sign? No. R doesn't think of ~ as defining an equation to be evaluated.

y ~ x1 + x2 + x3

It is better to think of ~ as defining the entire statement as an object with a special class called formula, which conveys the idea of the model to other functions in R.

For the rest of the post, I will show some calculations that illustrate more fundamentally what a formula object is. For concreteness, imagine you have a formula object with the general form y~x1+x2+z1+z2+z3.

You could define this object with the command:

form.ex = y~x1+x2+z1+z2+z3

Suppose our goal is to extract the first two explanatory variables (x1 and x2) and use them in a linear model for the response (y). If you're reading this for help, you likely have a more noble goal in mind. Take this as an illustrative example.

To appreciate the power of a method, it is useful to see the inelegant code we are avoiding. The inelegant code also shines some light on how a formula object differs from a string.

An inelegant method

We could gain access to the elements of the formula by coercing it into a character string using as.character(). If we do this in our example, we obtain a character vector of length three:

[1] "~"
[2] "y"
[3] "x1 + x2 + z1 + z2 + z3"

Here, we see that the RHS of the formula is stored in the third element of this vector. The response is in the second element of the vector. Let's store these into vectors we can easily reference and use.

rhs= as.character(form.ex)[3]
resp = as.character(form.ex)[2]

Next, we can splice the rhs vector to obtain the individual variable names using the strsplit command. Note: we want to split the vector apart using a delimiter " + " with the spaces. Because the + is an operator, we need to use " \\+ " instead. In addition, strsplit() returns a list of length two. We really want the first element of the list... (remember, this is the inelegant way).

rhs.sp = strsplit(rhs," \\+ ")[[1]]
[1] "x1" "x2" "z1" "z2" "z3"

Now, we have a vector of variable names for the right hand side of the equation. As our goal is to construct a formula object relating the response to the first two explanatory variables, extract the first two elements of this vector. rhs.sp[1:2]
[1] "x1" "x2"

And, paste this right hand side of the equation to the response variable name in a formula style using the following combination of paste() commands.

[1] "y ~ x1+x2"

Great. The only problem? The object short.form is not a formula object. It is a character string. This won't be a problem if we input it into something like lm(), which will coerce it to a formula object before using it.

lm(short.form, data=mkt.df)

lm(formula = short.form, data = mkt.df)

(Intercept) x1 x2
-25.329 2.335 4.239

If you want to do the formula coercion yourself, you can use the as.formula() command as follows.

y ~ x1 + x2

A more elegant method

The all.vars() command does the splitting, splicing and extracting that comprised the bulk of what we did above. Starting with the formula object form.ex, we could create a vector of variable names in our model statement as:

vars.ex = all.vars(form.ex)
[1] "y" "x1" "x2" "z1" "z2" "z3"

Now, just define the rhs and resp vectors as we did in the less elegant method.

rhs = vars.ex[2:3]
resp = vars.ex[1]

Is there a better way recombine these vectors than to create our own paste collapser? Yes! The reformulate() function makes it easy.

myform=reformulate(rhs.av, response=resp)
y ~ x1 + x2

And, for good measure, let's see what happens in the lm() command:

lm(myform, data=mkt.df)

lm(formula = myform, data = mkt.df)

(Intercept) x1 x2
-25.329 2.335 4.239

1 comment:

  1. The all.vars approach does not necessarily return what you expect for the case below, whereas your "inelegant" method does. Perhaps you have a suggestion for an elegant solution?

    > all.vars(y~x1+I(x1^2))
    [1] "y" "x1" "x1"