# In R formulas, why do I have to use the I() function on power terms, like y ~ I(x^3)

The tilde operator is actually a function that returns an unevaluated expression, a type of language object. The expression then gets interpreted by modeling functions in a manner that is different than the interpretation of operators operating on numeric objects.

The issue here is how formulas and specifically the “+, “:”, and “^” operators in them are interpreted. (A side note: the correct statistical procedure would be to use the function `poly` when attempting to make higher order terms in a regression formula.) Within R formulas the infix operators “+”, “*”, “:” and “^” have entirely different meanings than when used in calculations with numeric vectors. In a formula the tilde (`~`) separates the left hand side from the right hand side. The `^` and `:` operators are used to construct interactions so `x` = `x^2` = `x^3` rather than becoming perhaps expected mathematical powers. (A variable interacting with itself is just the same variable.) If you had typed `(x+y)^2` the R interpreter would have produced (for its own good internal use), not a mathematical: `x^2 +2xy +y^2` , but rather a symbolic: `x + y +x:y` where `x:y` is an interaction term without its main effects. (The `^` gives you both main effects and interactions.)

```?formula

```

The `I()` function acts to convert the argument to “as.is”, i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.

The `~` should be thought of as saying “is distributed as” or “is dependent on” when seen in regression functions. The `~` is an infix function in its own right. You can see that `LHS ~ RHS` is almost shorthand for `formula(LHS, RHS)` by typing this at the console:

````~`(LHS,RHS)
#LHS ~ RHS

class( `~`(LHS,RHS) )
#[1] "formula"

identical( `~`(LHS,RHS), as.formula("LHS~RHS") )
#[1] TRUE   # cannot use `formula` since it interprets its first argument

```

In regression functions the an error term in model descriptions will be in whatever form that regression function presumes or is specifically called for in the parameters for `family`. The mean for the base level will generally be labelled `(Intercept)`. The function context and arguments may also further determine a link function such as log() or logit() from the `family` value, and it is also possible to have a non-canonical family/link combination.

The “+” symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.

In plot()-ting functions it basically reverses the usual `( x, y )` order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more “mathematical” mode of communicating with R. In the `graphics::plot.formula``curve`, and ‘lattice’ and ‘ggplot’ functions, it governs how multiple factors or numeric vectors are displayed and “facetted”.

The overloading of the “+” operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results. There it acting as a pass-through and layering operator. Some aggregation functions have a formula method which use “+” as an “arrangement” and grouping operator.

Categories r