The tilde operator is actually a function that returns an unevaluated expression, a type of language object. The expression then gets interpreted by modeling functions in a manner that is different than the interpretation of operators operating on numeric objects.

The issue here is *how* formulas and specifically the “+, “:”, and “^” operators in them are interpreted. (A side note: the correct statistical procedure would be to use the function `poly`

when attempting to make higher order terms in a regression formula.) Within R formulas the infix operators “+”, “*”, “:” and “^” have entirely different meanings than when used in calculations with numeric vectors. In a formula the tilde (`~`

) separates the left hand side from the right hand side. The `^`

and `:`

operators are used to construct interactions so `x`

= `x^2`

= `x^3`

rather than becoming perhaps expected mathematical powers. (A variable interacting with itself is just the same variable.) If you had typed `(x+y)^2`

the R interpreter would have produced (for its own good internal use), not a mathematical: `x^2 +2xy +y^2`

, but rather a symbolic: `x + y +x:y`

where `x:y`

is an interaction term without its main effects. (The `^`

gives you both main effects and interactions.)

```
?formula
```

The `I()`

function acts to convert the argument to “as.is”, i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.

The `~`

should be thought of as saying “is distributed as” or “is dependent on” when seen in regression functions. The `~`

is an infix function in its own right. You can see that `LHS ~ RHS`

is almost shorthand for `formula(LHS, RHS)`

by typing this at the console:

```
`~`(LHS,RHS)
#LHS ~ RHS
class( `~`(LHS,RHS) )
#[1] "formula"
identical( `~`(LHS,RHS), as.formula("LHS~RHS") )
#[1] TRUE # cannot use `formula` since it interprets its first argument
```

In regression functions the an error term in model descriptions will be in whatever form that regression function presumes or is specifically called for in the parameters for `family`

. The mean for the base level will generally be labelled `(Intercept)`

. The function context and arguments may also further determine a link function such as log() or logit() from the `family`

value, and it is also possible to have a non-canonical family/link combination.

The “+” symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.

In plot()-ting functions it basically reverses the usual `( x, y )`

order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more “mathematical” mode of communicating with R. In the `graphics::plot.formula`

, `curve`

, and ‘lattice’ and ‘ggplot’ functions, it governs how multiple factors or numeric vectors are displayed and “facetted”.

The overloading of the “+” operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results. There it acting as a pass-through and layering operator. Some aggregation functions have a formula method which use “+” as an “arrangement” and grouping operator.