Chapter 21 ANOVA is just regression
You’ll hear the title of this chapter from certain statistics instructors and fans of linear regression. Sometimes ANOVA is taught completely in the context of linear regression, and not in the traditional way like we have in this book of ratios of mean squares and F distributions. I think the main difference between the two methods of understanding and teaching ANOVA is cultural; the traditional way emphasizes statistical significance in the differences of the means, and the regression way emphasizes how those means are calculated.
In any case, I’ve found a good description of why “ANOVA is just regression” lacking. So this is my best shot at it.
21.1 3 Equations and 3 Unknowns
First we’ll start with a crash course on linear regression. We begin with basic matrix multiplication. For example, here’s how you compute the product of a 3x3 matrix with a 3x1 column vector. We’ll call the 3x3 matrix A, and the 3x1 column vector x and the product y. This is how to multiply A times x to get y:
A x y
[111123101]×[012]=[(1)(0)+(1)(1)+(1)(2)(1)(0)+(2)(1)+(3)(2)(1)(0)+(0)(1)+(1)(2)]=[382]
If it has been a while, take a look at how this works. When multiplying a matrix times a vector we perform dot products across the rows of the matrix A with the column of vector x.
Another way to think about it is that the product of A and x is equal to the sum of the columns of A scaled by the values in the vector x. In this example, the result is 0 times the first column plus 1 times the second column plus 2 times the third column of A. In other words the product of matrix A and a column vector x is a linear combination of the columns of A weighted by the values in x.
Suppose that you know the values in the matrix A and the answer y, but not the values in the vector x. If this were algebra with numbers, we’d just solve for x by dividing both sides of the equation by A. That is, if
Ax=y then x=yA
But since A is a matrix, instead of dividing by A we compute the inverse of A which we then multiply on both sides of the equation Ax=y. The inverse of A, written as A−1 is the matrix for which A−1A=I, where I is the identity matrix (a matrix with all zeros with 1’s along the diagonal). Remember, I has as the property that Ix=x.
Multiplying A−1 on both sides of Ax=y gives:
A−1Ax=A−1y
Ix=A−1y
x=A−1y
For our example, the inverse of A is:
A−1
[1−0.50.510−1−10.50.5]
You can verify that A−1A=I:
A−1 A I
[1−0.50.510−1−10.50.5]×[111123101]=[100010001]
So, if we multiply A−1 by y we get our original value of x:
A−1 y x
[1−0.50.510−1−10.50.5]×[382]=[(1)(3)+(−0.5)(8)+(0.5)(2)(1)(3)+(0)(8)+(−1)(2)(−1)(3)+(0.5)(8)+(0.5)(2)]=[012] This shows that you can use A−1 to solve Ax=y for x.
Where did A−1 come from? You may have computed a matrix inverse in the past using methods like the Gauss-Jordan method or through numerical methods, which is how your computer does it.
21.2 5 Equations and 2 Unknowns
Now consider the case where A is not square, but has more rows than columns. For example, consider this equation for Ax=y:
A x y
[1021304150]×[x1x2]=[0.50.51.51.42.5]
To find the values x1 and x2 we need to satisfy 5 equations with only 2 unknowns. This is an overconstrained set of equations and may not have an exact solution. Indeed, for this example there is no set of values x1 and x2 that satisfy this equation.
The best we can do is find values of ˆx1 and ˆx2 so that the product Aˆx is close to the correct answer (we pronounce ˆx as ‘x-hat’). By close, we mean the answer that minimizes the sums of squared difference between Aˆx=ˆy and y. This is called the least squares solution to the problem. The process of finding these best values is called linear regression.
Matrix notation is really just shorthand for writing out systems of equations. For this example, linear regression is the process of finding the values of ^x1 and ^x2 that make the following sum as small as possible:
(1^x1+0^x2−0.5)2 + (2^x1+1^x2−0.5)2 + (3^x1+0^x2−1.5)2 + (4^x1+1^x2−1.4)2 + (5^x1+0^x2−2.5)2
In the first example, when A was a 3x3 matrix, we solved the equation by computing the inverse of A, A−1. In this second example A is not a square matrix, so it doesn’t have an inverse. Instead, there’s something called a pseudoinverse (sometimes called the Moore-Penrose generalized inverse) that can be used to find the least squared solution. Like the real inverse, the pseudoinverse, denoted as A+, also has the property that A+A=I.
A+ is calculated by:
A+=(ATA)−1AT
where AT is the transpose of A (the matrix A with rows and columns switched).
Calculating A+ would be a nightmare by hand, but there are algorithms that allow computers to find it. For our example,
A+ is:
A+
[0.027−0.0270.08110.0270.1351−0.08110.5811−0.24320.4189−0.4054]
Analogous to what we did when A was square, can use the pseudoinverse of A to find the least squared solution, ˆx, by multiplying A+ on the left side both sides of the equation Aˆx=y:
A+Aˆx=A+y
Since A+A=I,
ˆx=A+y
For our example:
[0.027−0.0270.08110.0270.1351−0.08110.5811−0.24320.4189−0.4054]×[0.50.51.51.42.5]=[0.4973−0.5419]
So, if ˆx1=0.4973 and ˆx2=−0.5419, then the product Aˆx gives us a vector, which we call ˆy, that is as close to y (in terms of least squares) as possible. For our example, Aˆx=ˆy:
A ˆx ˆy
[1021304150]×[0.4973−0.5419]=[0.49730.45271.49191.44732.4865]
Notice that the values in ˆy are pretty close to the desired values of y:
y
[0.50.51.51.42.5]
How close? In statistics we like to measure differences as sums of squared deviations. The sums of squared difference between y and ˆy is:
(0.5−0.4973)2+(0.5−0.4527)2+...+(2.5−2.4865)2=0.00473
You can search as long and hard as you want, but you won’t find a better pair of values than x1=0.4973 and x2=−0.5419 that will give you an answer that gives you a smaller sums of squared deviation than 0.00473.
If you think of the matrix multiplication in Ax as computing a linear combination of the columns of A weighted by x, then linear regression is finding the ‘weights’ needed for each column of A that gives us an answer that is closest to y. Each column of A is called a regressor, and the values of x that give you the least-squares solution are called the regression coeficients (sometimes called beta weights for some reason, though the term beta weight should only be used for ‘standardized’ linear regression).
21.3 Predicting student’s heights from mother’s heights
When we think of linear regression we often think of fitting straight lines to a scatterplot like we did in the last chapter on correlation and regression. We’ll go through a similar example from that chapter on the heights of women and their mothers, but in the context of linear algebra and the pseudoinverse described above. From the survey, I’ve chosen women in the class that were born in 1999 to keep the sample size smallish.
Here’s some standard R code to pull out the student’s heights into student
and mother’s heights into mother
:
survey <-
read.csv("http://www.courses.washington.edu/psy315/datasets/Psych315W21survey.csv")
survey$mheight <- as.numeric(survey$mheight)
id <- survey$year == "1999" & survey$gender== "Woman" &
!is.na(survey$height) & !is.na(survey$mheight) & !is.na(survey$year)
student <- survey$height[id]
mother <- survey$mheight[id]
Here’s a table of their heights:
student | mother |
---|---|
66 | 62 |
62 | 60 |
67 | 65 |
62 | 66 |
62 | 60 |
62 | 64 |
64 | 63 |
61 | 65 |
62 | 61 |
61 | 60 |
63 | 61 |
62 | 66 |
62 | 61 |
67 | 66 |
66 | 62 |
68 | 68 |
65 | 60 |
64 | 69 |
66 | 66 |
65 | 66 |
67 | 72 |
61 | 63 |
68 | 63 |
64 | 60 |
67 | 64 |
67 | 65 |
65 | 62 |
68 | 65 |
63 | 65 |
66 | 69 |
65 | 67 |
66 | 65 |
63 | 64 |
Suppose you want to come up with a model that predicts each student’s height based on their mother’s height. A simple model would be a linear model like this:
student=x1+x2∗mother
Where x1 and x2 are constants, or ‘free parameters’.
We can write this linear model as a matrix multiplication like this:
A x student
[162160165166160164163165161160161166161166162168160169166166172163163160164165162165165169167165164]×[x1x2]=[666267626262646162616362626766686564666567616864676765686366656663]
Here the first column of A is all 1’s and the second column contains mother’s heights. See how this matrix multiplication performs the calculation in the linear model. For each student,
x1+mother×x2=student
Notice how this is a set of 33 (# of students) and 2 unknowns. We can use linear regression to find the values of x1 and x2 that minimize the sum of squares between the predicted heights and real student heights.
We can do this with the pseudoinverse, A+, like we did before and then calculating ˆx=(A+)(mother). If you get your computer to do this you get these regression weights:
ˆx
[43.97150.3196]
The equation that best predicts student heights is therefore: student=43.9715+0.3196×mother
With this best fit we can generate the predicted student heights, ˆy by multiplying A by ˆx:
A ˆx ˆy
[162160165166160164163165161160161166161166162168160169166166172163163160164165162165165169167165164]×[43.97150.3196]=[63.786363.147164.745165.064763.147164.425564.105964.745163.466763.147163.466765.064763.466765.064763.786365.703963.147166.023565.064765.064766.982264.105964.105963.147164.425564.745163.786364.745164.745166.023565.384364.745164.4255]
The sums of squared deviation between the student’s heights and predicted student’s heights is
(66−63.7863)2+(62−63.1471)2+...+(63−64.4255)2=0.0047297
You’ll recognize that our model is simply the equation of a line. Given a scatterplot with mother’s heights on the x-axis and student height on the y-axis, solving for the regression weights finds the best-fitting line to the data in the scatterplot.
21.4 Statistical significance of adding regressors
Solving a linear regression problem will always give us a least-squares solution, even if the fit isn’t very good. How can we evaluate the goodness of fit? More specifically, how can we determine if we need all of the regressors to predict the data?
One way to think about the fit of our regression line is to ask if if we really need a slope to predict the student’s heights. Or in hypothesis test terms, is the slope significantly different from zero?
Let’s see how well we can predict the student’s heights with a simpler model that doesn’t have a regressor for the slope. This is just finding the least square solution for Ax=y where A is just a column of ones:
A x student
[111111111111111111111111111111111]×[x1]=[666267626262646162616362626766686564666567616864676765686366656663]
The least-square solution to this is x1 = 64.4545.
(66−64.4545)2+(62−64.4545)2+...+(63−64.4545)2=168.1818
Compare this to the sums of squared deviation for the regression with the slope: 138.8954.
The regression equation with the slope has a smaller sums of squared deviation. Does that mean it’s a better predictor of the student’s heights? Not necessarily. Adding regressors will always improve the fit, since the equation with the slope is a more flexible version of the equation without the slope. We call these two regression equations nested, with one a more general version of the other.
Since adding regressors always improves the fit, we need a hypothesis test to determine if the additional regressor improves the fit enough to justify it. This can be done by computing an F statistic and the corresponding p-value.
We define the model with more regressors as the ‘complete’ model,and the model with fewer regressors as the ‘restricted’ model. For our example, the complete model is the one with both an intercept and a slope, and the restricted model has just the one constant. The sums of squared deviation for the least-square fit for the two models are defined as SSr and SSc for the restricted and complete models respectively. For this example,
SSr=168.1818
and
SSc=138.8954
Let n be the total number of data points (n=33 students in our example), kr be the number of regressors in the restricted model (kr=1, the mean) and kc be the number of regressors in the complete model (kc=2, intercept and slope).
The F-statistic is computed as:
F=(SSr−SSc)/(kc−kr)SSc/(n−kc) With kc−kr degrees of freedom for the numerator and n−kc degrees of freedom for the denominator.
For our example,
F=(168.1818−138.8954)/(2−1)138.8954/(33−2)=6.5364
The p-value for this value of F, with 1 and 31 degrees of freedom is:
## [1] 0.01568644
A small p-value indicates that adding the slope regressor significantly improves the fit of the linear regression model to the data. In other words, if p is small, then the slope is significantly different from zero, because if the slope is zero we have the restricted model.
If we use our favorite criterion value of α=0.05, we conclude that adding the slope regressor does significantly improve the fit. We therefore conclude that the slope is significantly different from zero. If we assume causality, it means that there is a significant influence of the mother’s height on the student’s height.
Recall that in the previous chapter we showed that if you use R’s cor.test
function to run a two-tailed test on the null hypothesis that the correlation in the population between student and mother’s heights is zero you’ll get the same p-value.
21.5 ANOVA as regression
We’re ready to talk about ANOVA. We’ll first show that given a 1-factor ANOVA data set you can calculate group means using linear regression. Consider the following data set which are UW GPA’s for the men in the class grouped by whether they prefer to sit in the front, middle, or the back of the lecture hall. We can use a 1-factor ANOVA to determine if the means of each group are significantly different from each other.
Near the front | In the middle | Toward the back |
---|---|---|
3.66 | 3.23 | 3.30 |
3.84 | 3.65 | 3.68 |
3.95 | 3.80 | 3.83 |
3.90 | 2.60 | |
3.43 | 3.80 | |
3.85 | 2.00 | |
3.00 | 2.89 | |
3.35 | 4.00 | |
3.31 | 3.66 | |
3.91 | 3.51 | |
3.07 | ||
3.53 | ||
3.20 | ||
3.30 | ||
3.65 |
Here are the summary statistics showing the group means, SSwithin for each of the 3 groups:
mean | n | sd | sem | |
---|---|---|---|---|
Near the front | 3.8167 | 3 | 0.1464 | 0.0845 |
In the middle | 3.4787 | 15 | 0.3025 | 0.0781 |
Toward the back | 3.3270 | 10 | 0.6394 | 0.2022 |
And here’s the result of the ‘omnibus’ F test:
df | SS | MS | F | p | |
---|---|---|---|---|---|
Between | 2 | 0.5637 | 0.2819 | 1.4083 | p = 0.2633 |
Within | 25 | 5.0037 | 0.2001 | ||
Total | 27 | 5.5674 |
The p-value for this hypothesis test is 0.2633, so using our magic value of α=0.05, it seems that GPA does not vary with where these students prefer to sit in class.
Just in case you’re wondering, if we had included the whole class of 150, there is a significant difference in GPA at UW across where they like to sit.
Next we’ll use regression to estimate the 3 means. Here we’ll write a new equation Ax=y where y is a column vector containing all of the GPA’s. For this example, the first 3 values in y contain the GPAs in the ‘Near the front’ group. The second 15 values are GPA’s from the second group and so on. We’ll generate a funny looking matrix for A which contain zeros and ones like this:
A x y
[100100100010010010010010010010010010010010010010010010001001001001001001001001001001]×[x1x2x3]=[3.663.843.953.233.653.83.93.433.8533.353.313.913.073.533.23.33.653.33.683.832.63.822.8943.663.51]
Think about what happens when A is multiplied by x to predict y. Notice that the first 3 predicted values are all equal to x1: (1x1+0x2+0x2). What single value of x1 is closest, in the least-squares sense, to all of the first 3 GPA’s? It’s the mean of those 3 GPA’s from the ‘Near the front’ group. Similarly, the best-fitting value of x2 will be the mean of the ‘In the middle’ group and x3 will be the mean of the ‘Toward the back’ group. Sure enough, if we use the pseudoinverse to calculate the regression weights, we get our three group means:
ˆx
[3.81673.47873.327]
The predicted GPA’s based on these regression weights are just multiple copies of the group means:
A ˆx ˆy
[100100100010010010010010010010010010010010010010010010001001001001001001001001001001]×[3.81673.47873.327]=[3.81673.81673.81673.47873.47873.47873.47873.47873.47873.47873.47873.47873.47873.47873.47873.47873.47873.47873.3273.3273.3273.3273.3273.3273.3273.3273.3273.327] The sums of squared deviations from ˆy and y is 5.0036. Can you find this number in the ANOVA table above? It’s the same as SSwithin. That’s because SSwithin is the sums of squared deviation of each GPA from the mean of each group.
So far, this is just a complicated way of calculated the group means. What about the statistical significance of the test? The trick is to think of this regression model as the ‘complete’ model, which has a regressor for each group mean.
We can compare the fit to this model to a ‘restricted’ model that does not allow for a different mean for each group. Instead, we’ll try to predict our GPA’s from just one mean. Here’s the restricted model:
A x y
[1111111111111111111111111111]×[x1]=[3.663.843.953.233.653.83.93.433.8533.353.313.913.073.533.23.33.653.33.683.832.63.822.8943.663.51]
Here’s the least-squared solution:
ˆx
[3.4607]
Can you guess what this number 3.4607 is? It’s the grand mean.
The sums of squared deviation for this restricted model is 5.5674. Can you find this value in the ANOVA summary table? It’s SStotal, which is the sums of squared deviation of all GPA’s from the grand mean.
You may have guessed where we’re going. A nested F-test will determine if the improvement of fit with the complete model justifies including a separate mean for each group. In other words, do we really need separate means to predict the data, or can we only conclude that the data came from a single population with a fixed mean?
For a nested F-test, the F-statistic is:
F=(SSr−SSc)/(kc−kr)SSc/(n−kc)
With kc−kr degrees of freedom for the numerator and kc degrees of freedom for the denominator.
For this specific example, let’s define k to be the number of groups. We can replace the terms in the equations with:
SSC=SSwithin
SSR=SStotal
kc=k
kr=1
And, for this example,
SSr−SSc=SStotal−SSwithin
Remember,
SStotal−SSwithin=SSbetween
So we can re-write the F-statistic as:
F=SSbetween/(k−1)SSwithin/(n−k)
With k−1 degrees of freedom for the numerator and n−k degrees of freedom for the denominator.
This is exactly the formula for computing the F statistic and degrees of freedom for the 1-factor ANOVA. So we get the same F-statistic as the original ANOVA analysis:
F=(5.5674−5.0036)/(3−1)5.0036/(28−3)=0.5637/25.0036/25=0.28190.2001=1.4083
Which, with 2 and 25 degrees of freedom gives us a p-value of 0.2633, the same as in the 1-factor ANOVA.
In terms of the nested model, we conclude that including separate regressors for each group to predict group means does not significantly improve the fit of the regression model over the restricted model that just has a single mean.
One caveat. For this example, our complete model is not really a more general version of the restricted model because it’s not the restricted model with added regressors (there is no column of 1’s in the complete model). However, you can run a different version of the complete model, where the first column is all ones, you get a general version of the restricted model that fits exactly as well:
A x y
[100100100110110110110110110110110110110110110110110110101101101101101101101101101101]×[x1x2x3]=[3.663.843.953.233.653.83.93.433.8533.353.313.913.073.533.23.33.653.33.683.832.63.822.8943.663.51]
The regression weights for this complete model are:
ˆx
[3.8167−0.338−0.4897]
The sums of squared deviation for this regression is the same as for the previous complete model: 5.0036.
But what do these regression weights mean? The first weight, 3.8167 is still the mean for the first, ‘Near the front’, group of GPAs.
If you look at the matrix A you’ll see that the predicted GPA for the second group is the sum of the first two regression weights. So the predicted GPA for the second group is the mean for the first group plus ˆx2. Check that ˆx1+ˆx2=3.8167−0.338=3.4787, which is the mean for the second group. Similarly, ˆx1+ˆx3=3.8167−0.4897=3.327, which is the mean for the third group.
This new complete model has the same information in the regression weights, and since it’s a true generalization of the restricted model, it’s technically the one that should be used for the nested model F test. But since the SS and df’s are the same, it will give you the same F statistic as for the first, more intuitive, complete model that gave us the group means.
In summary, when someone tells you that ‘ANOVA is just regression’, you should now understand that you can use regression to find group means using a least squares to to Ax=y with a clever A matrix of zeroes and ones. This trick of using regressors consisting of zeros and ones to turn ANOVA into regression has many names, including ‘dummy variable’, ‘indicator variables’,‘design variable’,‘Boolean indicator’, ‘binary variable’, or ‘qualitative variable’.
Using a nested F-test to compare the goodness of fit of this complete model with the fit to restricted model provides the same measure of statistical significance as the traditional ANOVA. In fact, most software packages use this sort of dummy coding and regression to calculate F statistics for ANOVA. More complicated ANOVA tests, like main effects, interactions, simple effects, and random effects just require more complicated dummy variables that the computer figures out for you.
21.6 Effect Size
Here I’ll show you that η2, one of three measures of effect size for ANOVA, is the square of the correlation between the predicted and real data.
η2 is defined as:
η2=SSbetweenSStotal
which is the proportion of the total sums of squares that is associated with the differences between the means.
For our example:
η2=SSbetweenSStotal=0.56375.5674=0.1013
There’s another interpretation of η2 with respect to regression. Here’s a scatterplot showing the real GPA’s on the x-axis and the predicted GPA’s from the complete model on the y-axis:
The scatterplot looks a little funny because there are only three levels of predicted GPA’s. That’s because, of course, the model predicts a specific GPA for each of the three levels of where students like to sit.
I’ve labeled the names of each group at the height of their means on the right side of the y-axis, and I’ve placed a yellow triangle at the location of the mean for each group.
The scatterplot lets you visualize each data point within each of the three groups. See how this scatterplot is a nice graphical representation of how the total variability of the points along the x-axis, SStotal, is broken down into SSwithin and SSbetween. SSwithin is seen as the spread of points along the x-axis within each of the three groups (yellow triangles). SSbetween is associated with the differences in the three means, which are the three values along the y-axis.
Consider the correlation between the real and predicted GPAs. If there were no variability between the means, all the predicted values would be the same on the y-axis and the correlation would be zero. On the other hand, if there was no variability within each group, then all GPA’s would be equal to the mean GPA for that group, so the points would all fall on a line defined by the three red boxes. The correlation would be equal to 1.
For this example, the correlation r=0.3182 and its square is r2=0.1013.
Note that r2=η2
So another interpretation of the effect size measure, η2, is that it’s the square of the correlation between the predicted and real data. This is sometimes called the ‘proportion of the variance in X explained by Y’. In our case, the proportion of the variance in GPA’s explained by where students choose to sit is 0.1013.