Statistics NZ > Find info for secondary > Teachers > Residuals and R squared for teachers

Residuals and R squared - teachers page

Secondary activities


The drift to the north activity

Curriculum links


Mathematics: Statistics strand level 8

  • Investigate relationships between two continuous variables, using graphical methods (including linear regression), calculate correlation coefficients to estimate the strength of linear relations, and discuss the appropriateness of any regression line or correlation.

Background

The new achievement standard AS 90645 has two significant changes for Merit. First, the coefficient of determination, R2, is specifically mentioned, along with the correlation coefficient. Second, students are asked to use residuals.

This page is adapted from a workshop run by Mike Camden (Statistics New Zealand) and Ian Westbrooke (Department of Conservation) at NZAMT 9. It covers the interpretation of the coefficient of determination and the use of residuals to improve analysis of bivariate data. The full resource, with extra datasets,  can be found here. The workshop was on the Friday session.





Coefficient of determination


The coefficient of determination (R2) is a measure of how well a model explains the data. Let's use a small dataset of just three values for x and y to show how it works.

To calculate R2, we first calculate the sum of squares for the total (SST). This measures how far away our y values are from their mean. As with calculating the standard deviation, we square before adding them.


x

y

mean of y

(y – mean)2

2

2

4

4

4

5

4

1

6

5

4

1

 

 

SST

6



Here are some diagrams showing this process.


The difference of y from its mean


The difference squared



The next step is to fit a regression line to the points and follow a similar process. This time, however, we calculate the squared difference from the y values calculated from the regression line. This is the sum of the squares of the residuals (SSR).

x

y

ŷ

(y – ŷ)2

2

2

2.5

0.25

4

5

4

1

6

5

5.5

0.25

 

 

SSR

1.5


A residual is just the difference between the actual value and the value calculated from our model.








The least squares regression process fits the model that makes the square boxes as small as possible. The area of the boxes in the top graph is larger than the area in the bottom graph, SSR. This difference, SST - SSR, is the area that has been explained away by using the model. The coefficient of determination shows the percentage of the variation explained by the model; that is (SST - SSR) / SST.

In this case, R2 = (6 – 1.5) / 6 = 75%

The spreadsheet below shows this example and gives the opportunity to play with the data values.



Correlation coefficients


The correlation coefficient (r) measures the strength of a linear relationship between two variables. It contains almost the same information as the R2 value for the best fit linear model; in fact, R2 = r squared. However, r can be positive or negative, which relates to whether the linear model has positive or negative slope.

In Excel, the function CORREL will give the r value for the linear model. The R2 value may vary as different models are fitted.

Residuals


As explained above, the residuals are found by subtracting the fitted y values, found by substituting in the regression equation, from the actual y values. A scatterplot of residuals against x will make the non-linear features of x and y's relationship more visible.

The powerpoint below shows how to calulate residuals and gives a worked example of using them to help analyse some data. This data is given in the spreadsheets so that students can work through the example as it is demonstrated.

Residuals and R sq.ppt

Residuals and R sq data.xls

Solver.xls