"

11.3 Calculating the correlation coefficient (r)

The correlation coefficient (r) is defined as how close a set of data points associate with a line (linear regression line) based on those points. You calculate it by taking the ratio of the covariance of the two variables normalised to the square root of their variances. Practically, you are unlikely to calculate this value by hand, especially for a large dataset, but it can be of value to follow a calculation to understand how this coefficient is derived. The formula is:

 

  \displaystyle \text{r} = \frac {n(\sum xy) - (\sum x)(\sum y)}{\sqrt{n \sum x^{2} - (\sum x)^{2}} \sqrt {n \sum y^{2} - (\sum y)^{2}}}

 

So, to calculate r for the data for graph A in the previous figure, first calculate the mean.

x y
1 2
3 4
4 7
5 5
6 7
7 9
8 10
Mean 4.86 6.29

Now you can calculate the deviation from the mean for each of the values followed by the square of each.

x y x – xmean y – ymean (x – xmean)2 (yymean)2
1 2 –3.9 –4.3 14.9 18.4
3 4 –1.9 –2.3 3.4 5.2
4 7 –0.9 0.7 0.7 0.5
5 5 0.1 –1.3 0.0 1.7
6 7 1.1 0.7 1.3 0.5
7 9 2.1 2.7 4.6 7.4
8 10 3.1 3.7 9.9 13.8
= 4.9
= 6.3
∑ = 0 ∑ = 0 ∑ = 34.9 ∑ = 47.4

Now you calculate the sum of the cross product of these deviation scores (e.g. for the first line this is –3.9 × –4.3).

x y x xmean y ymean (x xmean)2 (y ymean)2 Cross product
1 2 –3.9 –4.3 14.9 18.4 16.5
3 4 –1.9 –2.3 3.4 5.2 4.2
4 7 –0.9 0.7 0.7 0.5 –0.6
5 5 0.1 –1.3 0.0 1.7 –0.2
6 7 1.1 0.7 1.3 0.5 0.8
7 9 2.1 2.7 4.6 7.4 5.8
8 10 3.1 3.7 9.9 13.8 11.7
= 4.9
= 6.3
∑ = 0 ∑ = 0 ∑ = 34.9 ∑ = 47.4 ∑ = 38.3

We can now put these values into the equation:

 \displaystyle r = \frac{38.3}{(\sqrt{34.9})(\sqrt{47.4})}

 \displaystyle r = 0.94

A very strong positive correlation!

Note

To calculate the correlation coefficient, the data for both variables should be continuous or on an interval scale (you cannot calculate the correlation coefficient for categorical data). Also, the data for at least one variable should be normally distributed and should have a linear relationship – which can be read from looking at a scatter plot of the data.

Calculating linear regression

We can also use these values to help calculate the line of regression (line of best fit for the data points).A straight line can be described using the equation:

 \displaystyle y = a + bx

where a = the intercept and b = the slope of the line.

The slope can be determined by the following equation:

 \displaystyle b = \frac {\sum (x - \overline{x}) \sum (y - \overline{y})}{\sum (x - \overline{x})^{2}}

Therefore, in this example:

 \displaystyle b = \frac {38.3}{34.9}

 \displaystyle  b = 1.09

We can then calculate the intercepts by substituting the mean values for x and y into our linear regression equation:

 \displaystyle  y = a + bx

 \displaystyle  6.3 = a + 1.09 \times 4.9

 \displystyle a = 0.96

So now our equation is:

 \displystyle y = 0.96 + 1.09x

And we could represent the data using the following graph.

A graph is shown along with a linear regression equation showing a very strong positive correlation (r = 0.94).
Figure 11.3: A linear regression equation is shown on the graph and its function has been plotted. This makes it easier to visualise how the actual data points deviate from this. The correlation coefficient has also been provided indicated a strong positive correlation.