15 Simple Linear Regression

Learning Outcomes

At the end of this chapter you should be able to:

  1. understand the concept of simple linear regression;
  2. fit a linear regression model to data in R;
  3. interpret the results of regression analysis;
  4. perform model diagnostics;
  5. evaluate models.

 

 

15.1 Introduction

Many situations require an assessment of which of the available variables affect a particular response or output.

Example 15.1 Case Study: Nelson beetle data

Nelson (1964) investigate the effects of starvation and humidity  on weight loss in flour beetle. The data consists of the variables  Weightloss (mg) and Humidity (%).

Objective:  To quantify the relationship between Weightloss and Humidity.

Reference: Nelson, V. E. (1964). The effects of starvation and humidity on water content in Tribolium confusum Duval (Coleoptera). PhD Thesis, University of Colorado.

Regression is the study of relationships between variables, and is a very important statistical tool because of its wide applicability. Simple linear regression involves only two variables:

X = independent or explanatory variable;

Y = dependent or response variable;

and they are related by a straight line.
The observations are (X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n).
Example 15.2
Let X = height and Y = weight of person.
People of the same height can have different weights. On average as height increases, weight also increases. Given the height, the weight has random variations from some mean weight for that height.
Assumptions of the regression model. The graph is a scatterplot of Weight against Height. For a given height, several weights are plotted, with the weight increasing linearly with height. A simple linear regression line is also plotted. For each set of weights for a given a height, a normal distribution curve is plotted. All the normal curves have the same spread, indicating that at different values of height, weight is normally distribute with a homogeneous variance.
Assumptions of the regression model.

The Model

The general univariate statistical mode is

    \[y_ i = \mu_i + \epsilon_i,\]

where \mu_i is the mean of y_i and \epsilon_i is the random variation term. Linear regression assumes that \mu_i is a linear function of X, that is

    \[\E(Y_i)=\mu_i = \beta_0+\beta_1 X_i.\]

THE MODEL

    \[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \quad i=1,2,\cdots,n\]

where

Y_i =response (or dependent) variable,

X_i =explanatory (or independent) variable,

\beta_0 = intercept,
\beta_1 = slope,
\epsilon_i = error or random variation.

Model assumptions

Observed data \left(X_i,Y_i\right), and

    \[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \quad i = 1,2,\ldots,n.\]

We treat Y_i as random variables corresponding to observations X_i.

ASSUMPTIONS

  1. A linear model is appropriate, that is, {\rm E}\ (Y_i) = \beta_0 + \beta_1 X_i.
  2. The error terms \epsilon_i are normally distributed.
  3. The error terms \epsilon_i have constant variance.
  4. The error terms \epsilon_i are uncorrelated.

Variance

Note that

    \[{\rm Var}(Y_i) = {\rm Var}(\beta_0 + \beta_1 X_i + \epsilon_i).\]

But \beta_0 + \beta_1 X_i is a constant, so

    \[{\rm Var}(Y_i) = {\rm Var}(\epsilon_i) = \sigma^2,\]

assumed constant (that is, \sigma^2 does not depend on i). The assumptions then imply that

    \[Y_i \stackrel{iid}{\sim} {\rm N}\left(\beta_0 + \beta_1 X_i, \sigma^2\right).\]

Further,

    \[{\rm E}\left(\epsilon_i\right) = {\rm E}\left[Y_i-\left(\beta_0 + \beta_1 X_i\right)\right] =0,\]

since {\rm E}(X-\mu) = 0. This implies that

    \[\epsilon_i \stackrel{iid}{\sim} {\rm N}(0,\sigma^2).\]

15.2 Parameter Estimation — Method of Least Squares

The model has three parameters, \beta_0,\beta_1 and \sigma^2, and these need to be estimated from the data. Denote by B_0 and B_1 respectively the estimates of \beta_0 and \beta_1.

    \begin{align*} \textrm{Fitted value} \quad  \hat{Y}_i &= B_0+B_1 X_i\\ \textrm{Residual} \quad r_i &= Y_i-\hat{Y}_i\\ &= Y_i-(B_0+B_1 X_i) \end{align*}

We use the method of least squares to estimate B_0 and B_1 so as to minimise \sum_{i=1}^n r_i^2.

Recall

  1. \sum_{i=1}^n (y_i-c)^2 is a minimum when c=\overline{y}, from Chapter 2.
  2. ax^2+bx+c is a minimum when x= -\dfrac{b}{2a}.
  3. (a \pm b)^2 = a^2 \pm 2ab+b^2.
Now

    \begin{align*} \sum_{i=1}^n r_i^2 &= \sum_{i=1}^n \left[Y_i-(B_0+B_1 X_i)\right]^2\\ &= \sum_{i=1}^n \left[(Y_i-B_1 X_i) - B_0\right]^2 \end{align*}

 

This is a minimum when B_0 = \overline Y - B_1 \overline X. Then

    \begin{align*} \sum_{i=1}^n r_i^2 &= \sum_{i=1}^n \left[Y_i-(\overline Y -B_1 \overline X)\right]^2\\ &= \sum_{i=1}^n \left[(Y_i- \overline Y) - B_1(X_i - \overline X)]^2 \end{align*}

Using (a-b)^2 = a^2 - 2ab + b^2, the last line gives

    \[r_i^2= \sum_{i=1}^n\left(Y_i-\overline Y\right)^2 -2B_1 \sum_{i=1}^n\left(X_i-\overline X\right)\left(Y_i- \overline Y\right)+ B_1^2\sum_{i=1}^n\left(X_i-\overline X\right).\]

Define

    \begin{align*} \textrm{SS}_Y  &= \sum_{i=1}^n (Y_i-\overline Y)^2\\ \textrm{SS}_X &= \sum_{i=1}^n (X_i- \overline X)^2\\ \textrm{SS}_{XY} &= \sum_{i=1}^n (X_i - \overline X)Y_i \end{align*}

    \[\sum_{i=1}^n r_i^2=B_1^2 \textrm{SS}_X-2B_1 \textrm{SS}_{XY}+ \textrm{SS}_Y\]

Now the quadratic function y=ax^2+bx+c has a minimum when x=-\dfrac{b}{2a}. So this is a quadratic function in B_1, and has a minimum at \hat{B}_1=\dfrac{2\textrm{SS}_{XY}}{2\textrm{SS}_X}.
Thus the least squares estimates of the slope and intercept parameters are:

    \[\hat{B}_1=b_1=\dfrac{\textrm{SS}_{XY}}{\textrm{SS}_X} \hspace{1cm} \hat{B}_0=b_0=\overline{Y}-\hat{B}_1 \overline{X}\]

Exercise

Show that

    \begin{align*} \textrm{SS}_{XY} &= \sum_{i=1}^n (X_i - \overline X)(Y_i-\overline Y)\\ &= \sum_{i=1}^n (X_i - \overline{X}) Y_i\\ &= \sum_{i=1}^n (Y_i-\overline{Y}) X_i\\ &= \sum_{i=1}^n X_iY_i  - n {\overline{X}}\ {\overline{Y}}\\ \textrm{SS}_X &= \sum_{i=1}^n (X_i - \overline{X})^2\\ &= \sum_{i=1}^n (X_i - \overline{X})X_i\\ &= \sum_{i=1}^n X_i^2 - n \overline{X}^2\\ \textrm{SS}_Y &= \sum_{i=1}^n (Y_i-\overline{Y})^2\\ &= \sum_{i=1}^n (Y_i - \overline{Y})Y_i\\ &= \sum_{i=1}^n Y_i^2 - n\overline{Y}^2 \end{align*}

Note

    \[B_0=\overline{Y}-B_1 \overline{X} \Rightarrow \overline{Y}=B_0+B_1 \overline{X}.\]

That is, the point (\overline{X},\overline{Y}) satisfies the equation of regression, i.e., (\overline{X},\overline{Y}) ALWAYS lies on the line of regression.

Example 15.3 Beetle data

Nine beetles were used in an experiment to determine the effects of starvation and humidity  on water loss in flour beetle. here X = Humidity, Y = Weight.

Summary statistics

\sum_{i=1}^{9} x_i=453.5 \qquad \sum_{i=1}^{9} x_i^2=31 152.75 \qquad \sum_{i=1}^{9} y_i=54.2
\sum_{i=1}^{9} y_i^2=350.535 \qquad \sum_{i=1}^{9} x_iy_i=2 289.26 \qquad \overline x = 50.38889
\overline y = 6.02222

    \begin{align*} \textrm{SS}_Y  &= \sum_{i=1}^9 y_i^2 - 9\overline{y}^2 = 350.535 - 9\times 6.022^2 = 24.13056\\ \textrm{SS}_X &= \sum_{i=1}^9 x_i^2- \overline{x}^2 = 31 152.75 - 9\times 50.39^2 = 8 301.389\\ \textrm{SS}_{XY}  &= \sum_{i=1}^9 x_iy_i -9\times\overline{x}\ \overline{y} = 2 289.26 - 9 \times 50.39 \times 6.022 = -441.8178\\ b_1 &= \frac{SS_{XY}}{SS_X} = \frac{-441.8178}{8 301.389} = -0.05322\\ b_0 &=  \overline y - b_1 \overline x = 6.022 + 0.05322 \times 50.389 = 8.7040\end{align*}

    \[\widehat{\rm Weightloss} = 8.740 - 0.05322 {\rm \ Humidity}\]

\widehat{\rm Weightloss} = Estimated average Weightloss for a given humidity

Intercept = 8.7040 = Average Weightloss when Humidity is zero.

Slope = -0.05322 = If Humidity increases by 1 %  then Weightloss decreases by 0.05322 mg.

That is, the beetles prefer a more humid environment in which the weightloss is lower.

15.3 Partitioning the sum of squares

    \[\textrm{Residual SS} = \textrm{SS}_{\textrm{Res}} = \sum_{i=1}^n r_i^2.\]

Now

    \begin{align*} \textrm{SS}_{XY} & = \sum_{i=1}^n (X_i - \overline X)(Y_i-\overline Y)\\ &= \sum_{i=1}^n (X_i - \overline{X}) Y_i\\ &= \sum_{i=1}^n (Y_i - \overline{Y})X_i\\ &= \sum_{i=1}^n X_i Y_i - n {\overline{X}}\ {\overline{Y}}\\ \textrm{SS}_X &= \sum_{i=1}^n (X_i-\overline{X})^2\\ &= \sum_{i=1}^n (X_i -\overline{X})X_i\\ &= \sum_{i=1}^n X_i^2 - n\overline{X}^2\\ \textrm{SS}_Y &= \sum_{i=1} ^n (Y_i -\overline{Y})^2\\ &= \sum_{i=1}^n (Y_i - \overline{Y}) Y_i\\ &= \sum_{i=1} ^n Y_i^2 -n {\overline{Y}}^2.\end{align*}

Define

    \begin{align*} \textrm{SS}_{\textrm{Regression}} & = b_1^2\ \textrm{SS}_X\\ \textrm{SS}_{\textrm{Total}} &= \textrm{SS}_Y.\end{align*}

Then

    \[\textrm{SS Total} = \textrm{SS Reg} + \textrm{SS Res}}\]

Notes

  1. The SS Reg can also be expressed as follows.

        \begin{align*} \sum_{i=1}^n (\hat{y}_i-\overline{y})^2 &= \sum_{i=1}^n (b_0+b_1x_i-\overline{y})^2 \\ &= \sum_{i=1}^n (\overline{y}-b_1\overline{x}+b_1x_i-\overline{y})^2 \\ &= b^2 \sum_{i=1}^n (x_i-\overline{x})^2 = \text{SS}_{\text{Reg}}. \end{align*}

  2. The proportion of variation explained by the regression is

        \[R^2=\dfrac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Total}}}.\]

    Note 0 \le R^2 \le 1.

15.4 Hypothesis Test for \beta_1

To determine if there is a significant linear relationship between x and y, we test the hypotheses:

    \begin{align*} H_0: \beta_1 &=0 \qquad \text{No linear relationship} \\ H_1: \beta_1 &\neq 0 \qquad \text{Significant linear relationship} \\ \beta_1 &>0 \qquad \text{Significant positive linear relationship} \\ \beta_1 &<0 \qquad \text{Significant negative linear relationship} \end{align*}

If H_0 is true then \beta_1=0, so

    \[\text{SS}_{\text{Reg}}=b_1^2 \text{SS}_X \approx 0 \quad \text{and} \quad \text{SS}_{\text{Tot}} \approx \text{SS}_{\text{Res}}.\]

Thus we can use the ratio \text{SS}_{\text{Reg}}/\text{SS}_{\text{Res}} as a test statistic. However, each of these sums of squares need to be adjusted by their degrees of freedom.

    \begin{align*} \text{Regression df} &= \text{Number of parameters -1} \\ &=k-1 =1 \; \text{for simple linear regression}.\\ \text{Total df} &= n-1 \\[12pt] \text{Residual df} &= n-k \\ &= n-2 \; \text{for simple linear regression}. \end{align*}

Note that two parameters are estimated from the data in simple linear regression, the intercept and the slope, so k=2. These calculations can be set up as a table.

ANOVA Table

Source df SS MS F
Regression k-1

    \[{\rm SS}_{\rm Reg}\]

    \[{\rm MS}_{\rm Reg} = \frac{{\rm SS}_{\rm Reg}}{k-1}\]

    \[F = \frac{\rm MS_{\rm Reg}}{\rm MS_{\rm Res}} \sim F_{k-1, n-k}\]

Residual n-k

    \[{\rm SS}_{\rm Res}\]

    \[{\rm MS}_{\rm Res} = \frac{{\rm SS}_{\rm Res}}{n-k}\]

Total n-1

    \[{\rm SS}_{\rm Tot}\]

    \[{\rm MS_{\rm Tot} = {\rm SS}_{\rm Tot}/(n-1)\]

 

The test then proceeds as usual.

If p-value = P(F > f_{obs}) < \alpha, where F \sim F_{k-1,n-k}, then reject H_0, and conclude there is a significant linear relationship between Y and X.

Plot of the F(6,45) distribution, with the upper tail shaded corresponding to a probability of 0.05.
Plot of the F(6,45) distribution, with the upper tail shaded corresponding to a probability of 0.05.

Example 15.4 Beetle data (ctd)

    \begin{align*} SS_T &= SS_Y = 24.13056\\ SS_{Reg} &= b_1^2 SS_{X} = (-0.05322)^2 \times 8301.389 = 23.51259. \end{align*}

The ANOVA table is given below. Note that the relationships in the ANOVA table are as in chapter 15. That is,

    \begin{align*} SST &= SSReg + SSRes\\ df \ Total &= df\ Reg + df\ Res\\ MSReg &= \frac{SSReg}{df\ Reg}\\ MSRes &= \frac{SSRes}{df\ Res}\\ F &= \frac{MSReg}{MSRes} \end{align*}

Source df SS MS F
Regression 1 23.51449 23.51449 267.18
Residual 7 0.61607 0.088010
Total 8 24.13056

 

The hypothesis of interest are:

    \[H_0: \beta_1 = 0 \qquad H_1: \beta_1 \ne 0\]

p-value = P(F > 267.18) = 7.82 \times 10^{-7} << 0.05, so there is overwhelming evidence against the null hypothesis.

pf(267.18, 1, 7, lower.tail = F)
[1] 7.816436e-07

We conclude based on the linear regression analysis that there is a significant linear relationship between Weightloss and Humidity.

We could also conduct a left-sided test of hypothesis.

    \[H_0: \beta_1 = 0 \qquad H_1: \beta_1 < 0\]

Now the p-value is simply half of the two-sided test. That is,

    \[{\rm p-value\ }= 7.82 \times 10^{-7}= 3.91 \times 10^{-7} << 0.025,\]

so there is overwhelming evidence against the null hypothesis. We conclude based on the linear regression analysis that there is a significant negative linear relationship between Weightloss and Humidity.

15.5 Estimate of \sigma^2

The final model parameter that needs to be estimated is the constant error variance term \sigma^2. Now

    \[\Var(Y_i) = \Var(\epsilon_i) =\sigma^2,\]

which can be estimated by the average of the residuals squared. That is,

    \[\hat{\sigma}^2 = \frac{1}{n-k}\sum_{i=1}^n r_i^2.\]

Note that here we have used the data to estimate k parameters/regression coefficients. For simple linear regression, k=2. Also note that as in ANOVA,

    \[s^2 = MS_{Res},\]

which can be read from the ANOVA table for regression.

For the Beetle data, s^2 = 0.08801.

15.6 Distribution of \beta_1

    \[B_1 = \dfrac{\sum_{i=1}^n (X_i-\overline{X})Y_i}{\sum_{i=1}^n (X_i-\overline{X})^2}=\dfrac{\text{SS}_{XY}}{\text{SS}_X}\]

The random variables here are Y_i, and X_i are considered constant. Put

    \begin{align*} C_i &= \frac{X_i-\overline{X}}{SS_X}\\ \text{so\ } \sum_{i=1}^n C_i &= \frac{1}{SS_X} \sum_{i=1}^n (X_i-\overline{X}) =0,\\ \sum_{i=1}^n C_i\ X_i &= \frac{1}{SS_X}\sum_{i=1}^n (X_i-\overline{X})X_i = \frac{SS_X}{SS_X} = 1,\\ \text{and\ } \sum_{i=1}^n C_i^2 &= \frac{1}{SS_X^2}\sum_{i=1}^n (X_i-\overline{X})^2 = \frac{1}{SS_x}. \end{align*}

Then

    \begin{align*} B_1 &= \sum_{i=1}^n C_i\ Y_i\\ \text{so\ } {\rm E}(B_i) &= \sum_{i=1}^n C_i\ {\rm E}(Y_i)\\ &= \sum_{i=1}^n C_i\ (\beta_0+\beta_1 X_i)\\ &= \beta_0 \underbrace{\sum_{i=1}^n C_i}_{=0} + \beta_1 \underbrace{\sum_{i=1}^n C_i\ X_i}_{=1}\\ &= \beta_1, \end{align*}

so B_1 is an unbiased estimator of \beta_1.

Next,

    \[{\rm Var}(B_1) = \sum_{i=1}^n C_i^2\ {\rm Var}(Y_i)= \frac{\sigma^2}{SS_X}.\]

Further, Y_i are normally distributed, so B_1=\sum_{i=1}^n C_iY_i is the sum of normal random variables; thus B_1 is also normal.

    \begin{align*} B_1 & \sim \text{N}\left(\beta_1, \dfrac{\sigma^2}{\text{SS}_X}\right)\\ \Rightarrow Z&=\dfrac{B_1-\beta_1}{\sigma/\sqrt{\text{SS}_X}} \sim \text{N}(0,1) \end{align*}

 

If \sigma is unknown (usually the case), we replace it by S, and then

    \begin{align*} T &= \dfrac{B_1-\beta_1}{S/\sqrt{\text{SS}_X}} \sim t_{n-2}\\ SE\left(B_1\right) &= \frac{s}{\sqrt{\text{SS}_X}} \end{align*}

We can thus use the t-distribution for (one-sided and two-sided) hypothesis tests and confidence intervals for \beta_1.

Example 15.4 Beetle data

b_1 = -0.0532222, \quad SSx = 8301.389, \quad s= \sqrt{0.08801}

Is there a significant

  1. linear relationship
  2. positive linear relationship
    between Sales and Promotions

Solution

The hypothesis are H_0: \beta_1 = 0 against the two-sided H_1: \beta_1 \ne 0 or the right-sided H_1: \beta_1 > 0.
Test statistic is

    \begin{align*} T &= \frac{B_1-\beta_1}{S/SS_X} \sim t_{7}\\ \text{and\ } t_{Obs} &= \frac{-0.053222 - 0}{\sqrt{0.08801}/\sqrt{8301.389}} = -16.34563. \end{align*}

  1.  The two-sided p-value = 2\ P(T < -16.34563) = 7.82 \times 10^{-7} << 0.05 so there is overwhelming evidence against the null hypothesis. Same p-value and conclusion as before.
  2. One-sided test. p-value = 2\ P(T < -16.34563) = 3.91 \times 10^{-07} << 0.025 so there is overwhelming evidence against the null hypothesis. Same p-value and conclusion as before.

Note

  1. The p-values from the F-distribution and the t-distribution are the same here. Note that

        \[(t_{obs})^2 = (-16.34563)^2 = 267.1796 = F_{obs}\]

    from the ANOVA table. In fact

        \[(t_k)^2 = F_{1,k}.\]

  2. The one-sided p-value is simply half the two-sided p-value from either the t- or F-distribution.

Example 15.4 Beetle data –Confidence interval

A 95% confidence interval for \beta_1 is calculated as before.

    \begin{align*} 95\% \text{\ CL for\ } \beta_1 &= b_1 \pm t_{7}^{0.025} \times \frac{S}{\sqrt{SS_X}}\\ &= -0.053222 \pm 2.3646 \times \frac{\sqrt{0.08801}}{\sqrt{8301.389}}\\ &= -0.053222 \pm 0.007699\\ \text{so\ } 95\% \text{\ CI for\ } \beta_1 &= (-0.0609, -0.0455). \end{align*}

15.7 Model diagnostics

Model assumptions are verified by analysis of residuals. There are four assumptions.

  1. A linear model is appropriate.
    1. Scatterplot of y_i against x_i. This should show points roughly around a straight line. This is usually part of the data exploration before model fitting.
    2. Scatterplot of standardised residuals against fitted values. The plot should be patternless. This is the best plot to verify this assumption, and is also used in the next chapter on multiple regression.
    3. Scatterplot of standardised residuals against x_i. Plot should be without pattern. This is similar to the previous plot. It is also used in the next chapter on multiple regression for model building and selecting if some transformation of data is appropriate.
  2. The errors are normally distributed. These are verified as for ANOVA.
    1. Histogram of residuals.
    2. Normal probability plot.
    3. Chi-squared test of normality. We won’t cover this.
  3. Errors have constant variance. This is again by inspecting residual plots, as for ANOVA.
    1. Plot of standardised residuals against fitted values. Should have constant spread. This is the best plot and is also used in the next chapter on multiple regression.
    2. Plot of standardised residuals against x_i. Should have constant spread. In the case of multiple regression this plot may help identify variables that are related to inhomogeneous variance.
  4. Errors are uncorrelated. Durbin–Watson test statistic. This is beyond the scope of this text.

Example 15.5 (a) 

Evaluate the model with the given diagnostic plots.

Diagnostic plots for Example 15.5 (a). Four graphs; scatterplot, histogram of residuals, normality probability plot and residuals against fitted values. The scatterplot appears close to a straight line. The histogram of residuals peaks close to the center. The normal probability plot does not depart markedly from a straight line. The plot of residuals against fitted values shows no clear pattern.
Diagnostic plots for Example 15.5 (a).

 

Solution

The scatterplot appears close to a straight line, and the plot of residuals against fitted values shows no clear pattern, so a linear model is appropriate. The histogram of residuals is not too different from that expected for a normal distribution, showing only a light asymmetry, and the normal probability plot does not depart markedly from a straight line. We conclude that the normality assumption is not violated. Finally, the plot of residuals against fitted values shows no change in spread, so there is no evidence against the homogeneous variance assumption.

Example 15.5 (b) 

Solution

Evaluate the model with the given diagnostic plots.

Diagnostic plots for Example 15.5 (b). Four graphs; scatterplot, histogram of residuals, normality probability plot and residuals against fitted values. The scatterplot is a slightly curved line, increasing on both x and y. The histogram is much higher at low x values and tapers off to the right. The normal probability plot shows values at a significant distance from the straight line and the residuals against fitted values shows an inverted curve shape.
Diagnostic plots for Example 15.5 (b).

Solution

The plot of residuals against fitted values shows a quadratic trend, so a linear model is not appropriate. The departure from normality is also evident from the histogram and normal probability plot, but this may be due to the linear model not being appropriate. The homogeneous variance assumption seems to be satisfied as there is no change in spread in the plot of residuals against fitted values.

Example 15.5 (c) 

Evaluate the model with the given diagnostic plots.

Diagnostic plots for Example 15.5 (c). Four graphs; scatterplot, histogram of residuals, normality probability plot and residuals against fitted values. The scatterplot shows values cluster around a diagonal line. The histogram peaks sharply for the center x values and tapers and low and high values. The normal probability plot shows departures from the straight line at upper and lower extremes. The residuals against fitted values has no clear pattern.
Diagnostic plots for Example 15.5 (c).

Solution

The plot of residuals against fitted values indicates a “fanning out”, so the homogeneous variance assumptions is violated. A linear model seems appropriate and there is no evidence against the normality assumption.

Example 15.5 (d) 

Evaluate the model with the given diagnostic plots.

Diagnostic plots for Example 15.5 (d). Four graphs; scatterplot, histogram of residuals, normality probability plot and residuals against fitted values. The scatterplot shows values increasing linearly, but the spread is not uniform. The histogram is right skewed. The normal probability plot shows departures from the straight line at both extremes. The residuals against fitted values shows a non-uniform spread.
Diagnostic plots for Example 15.5 (d).

Solution

From the histogram and normal probability plot, it is clear that the normality assumption does not hold. The scatterplot of the data indicates that a linear model may not be appropriate. Similarly the plot of residuals against fitted values shows evidence against the homogeneous variance assumption. However, these issues may be due to a lack of normality.

15.8 Outliers and Points of High Leverage

Outliers are points away from the bulk of the data. They usually have large absolute residuals, that is, large positive or large negative residuals, and are easily identified from plots. Commonly |\text{standardised}\, r_i| > 2 is used to identify outliers.

Outliers are important because they may affect model fit. To investigate the effect of an outlier the regression model should be fitted with and without the point for comparison.

If in arriving at the final model any outliers have been omitted then this should be reported. One should never silently omit outliers. Indeed, if enough points are omitted then one can arrive at a perfect model.

Points of high leverage are those that strongly influence the goodness of fit of the model. These usually have small absolute residuals, close to zero. Again the model should be fitted with and without them.

Scatterplot illustrating a point of high leverage. A scatterplot graph with trend line. The graph shows one point (x 6.5, y 5) that is away from the bulk of the data, which is mostly within the ranges x=1 to x=2.5 and y=1 to y=3.
Scatterplot illustrating a point of high leverage.

The graph above shows a point that is away from the bulk of the data. This point is an outlier. It is also a point of high leverage, since the linear model is very strongly influenced by it. Removing it and re-fitting the model gives the graph below. This plot shows there really is no model between the variables here.

Scatterplot of data with the point of high leverage removed. Scatterplot graph with trend line. Trend line is almost horizontal.
Scatterplot of data with the point of high leverage removed.

15.9 Correlation coefficient

The correlation coefficient, r, is a measure of the strength of the linear relationship between x and y.

    \[r = \dfrac{\text{SS}_{XY}}{\sqrt{\text{SS}_X \text{SS}_Y}}\]

Then

    \begin{align*} r^2 &= \dfrac{(\text{SS}_{XY})^2}{\text{SS}_X \text{SS}_Y} = \left(\dfrac{\text{SS}_{XY}}{\text{SS}_X}\right)^2 \dfrac{\text{SS}_X}{\text{SS}_Y} \\ &= \dfrac{b_1^2 \text{SS}_X}{\text{SS}_Y}\\ &= \dfrac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Total}}} = 1-\dfrac{\text{SS}_{\text{Res}}}{\text{SS}_{\text{Total}}} \end{align*}

Note that 0 \le r^2 \le 1, so -1 \le r \le 1.

|r|=1 \; \Rightarrow \; r^2=1=\dfrac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Tot}}} \; \Rightarrow \; \text{SS}_{\text{Tot}}=\text{SS}_{\text{Reg}}
and \; \text{SS}_{\text{Res}}=\sum_{i=1}^n r_i^2=0 \; \Rightarrow \; r_i=0,
and \; y_i=b_0+b_1x_i, so the points (x_i,y_i) lie on a straight line.

If |r| is close to 1, then a strong linear relationship exists between x and y.
If |r| is close to 0, no linear relationship exists between x and y.

Note that

    \[r=b_1\sqrt{\dfrac{\text{SS}_X}{\text{SS}_Y}},\]

so r has the same sign as b_1. Testing H_0: \beta_1=0 is equivalent to testing H_0: r=0

CAUTION

|r| large means a strong linear relationship exists between x and y. It DOES NOT mean “x causes y”. It is possible that x and y are both related through a third variable. Spurious correlation is often misused to imply causation.

If |r| is small, it does not indicate that x and y are unrelated — it only indicates a lack of a linear relationship. x and y may still be related non-linearly.

If we reject H_0:\beta_1=0, then there is a significant linear relationship between x and y. This DOES NOT indicate that the relationship is a good one, or the best one. Residual analysis should be performed to verify model assumptions. A similar comment applies if |r| is large.

Example 15.6 Example of weak correlation but strong relationship

The scatterplot shows a very strong non-linear relationship between x and y, but a zero correlation.

Scatterplot indicating that zero correlation does not imply no relationship. Scatterplot graph shows values arranged in a symmetrical oval shape. At each x value there are 2 points with "mirrored" y values, one above the horizontal axis (eg; at 1.0) and one mirrored below the horizontal axis (eg; at negative 1.0). At each y value there are also two points on the graph with mirrored x values, so that the oval shape is also symmetrical to the left and right of the vertical y axis.
Scatterplot indicating that zero correlation does not imply no relationship.

ALWAYS examine a plot of the data!

15.10 Detailed analysis of the Beetle data: with R code

The data consists of observations for nine beetles for the variables  Weightloss (mg) and Humidity (%).

Objective:  To quantify the relationship between sales and promotional expenses.

The figure below shows a scatterplot of the data and some summary statistics.

> nelson<-read.csv("nelson.csv", header=T)
 Humidity       Weightloss   
 Min.   : 0.00   Min.   :3.720  
 1st Qu.:29.50   1st Qu.:4.680  
 Median :53.00   Median :5.900  
 Mean   :50.39   Mean   :6.022  
 3rd Qu.:75.50   3rd Qu.:6.670  
 Max.   :93.00   Max.   :8.980  
 sd     : 32.21  sd     :1.74
> with(nelson, plot(Weightloss ~ Humidity))
Scatterplot with vertical axis 'Weightloss' and horizontal axis 'Humidity'. Values are roughly in a diagonal line trending down from Weightloss = 9 at Humidity = 0 to Weightloss = 4 at Humidity = 90
Scatterplot of Weightloss against Humidity.

Observations

The data range for Humidity is 0–93, with mean 50.39 and median 53.00. Weightloss has a range of 3.272–8.98, with mean 6.022 and median 5.900.

The scatter plot shows that a linear relationship is feasible.

Note that the function to fit the regression model is lm(). The ANOVA table is given, as well as a table of coefficients, standard errors for the coefficients and the p-values.

The output of the regression analysis in R is given below. The function to fit the linear model is lm().

> nelson.lm<-lm(Weightloss ~ Humidity, data = nelson)# y ~ x is the model formula, 
#and the dataframe is specified by data = nelson.
> summary(nelson.lm)

Call:
lm(formula = Weightloss ~ Humidity, data = nelson)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46397 -0.03437  0.01675  0.07464  0.45236 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.704027   0.191565   45.44 6.54e-10 ***
Humidity    -0.053222   0.003256  -16.35 7.82e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2967 on 7 degrees of freedom
Multiple R-squared:  0.9745,	Adjusted R-squared:  0.9708 
F-statistic: 267.2 on 1 and 7 DF,  p-value: 7.816e-07

The coefficients of the model are given, along with the corresponding standard errors and p-values. The output below the coefficient table gives the residual standard error, which is the estimate of the model standard deviation \sigma. The degrees of freedom of th residual standard error is also given. It also gives the value of the correlation coefficient squared. Finally, the F-value, its corresponding degrees of freedom and p-value are also given.

We will discuss the other output in the next chapter. Here we will calculate the Multiple R-Squared and Adjusted R-squared.

The adjusted R-squared and Adjusted R-squared are defined as

    \[R^2 = 1 - \frac{\text{Residual SS}}{\text{Total SS}},\]

and

    \begin{align*} \text{Adjusted\ } R^2 &= 1-\dfrac{\text{SS Res}/(n-k)}{\text{SS Total}/(n-1)} \\ &= 1 - \frac{\text{MS Res}}{\text{MS Total}}, \end{align*}

where n = number of observations and k = number of variables.

The calculations in R are below.

(SST <- sum(anova(nelson.lm)[2]))
[1] 24.1306
> (SSRes <- anova(nelson.lm)[2][[1]][2])
[1] 0.616063
> (Rsq <- 1-SSRes/SST)
[1] 0.97447
> (Totdf <- sum(anova(nelson.lm)[1])) 
[1] 8
> (Resdf <- (anova(nelson.lm)[1][[1]][2]))
[1] 7
> (AdjRsq <- 1- (SSRes/Resdf)/(SST/Totdf))
[1] 0.970822

Scatterplot of Beetle data is given below, with regression line superimposed. The equation of the line and the correlation coefficient is also stated.

Scatterplot of Beetle data, with regression line superimposed. The equation of the line and the correlation coefficient is also stated.
Scatterplot of Beetle data, with regression line superimposed. The equation of the line and the correlation coefficient is also stated.

Two and one sided hypothesis tests for \beta_1 can be conducted using the output.

    \[H_0: \beta_1 = 0 \qquad H_1: \beta_1 > 0\]

For the coefficient for Humidity, the one-sided p-value = 7.82/2 \times 10^{-07} = 3.91\times 10^{-07} < 0.025. At the 2.5% level of significance we will reject the null hypothesis and conclude that a strong linear relationship exists between Weightloss and Humidity.

Confidence Intervals can be easily obtained. Compare this with the calculation in Example 15.4.

confint(nelson.lm)
                  2.5 %      97.5 %
(Intercept)  8.25104923  9.15700538
Humidity    -0.06092143 -0.04552287

The ANOVA table for the regression is given below.

> anova(nelson.lm)
Analysis of Variance Table

Response: Weightloss
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Humidity   1 23.5145  23.515  267.18 7.816e-07 ***
Residuals  7  0.6161   0.088                      

Compare the ANOVA table values with those given in the regression output. The F-value, degrees of freedom and p-values are the same as in the regression output. Note that the square root of the Residual mean sum of squares,

    \[\sqrt{0.088} = 0.2966,\]

which is equal to the residual standard error from the linear regression output.

The figure below  may be used to verify the assumptions of linear regression.

> oldpar<-par(mfrow=c(2,2)) #mfrow (multi figure row) specifies the number of rows
# and columns the plots should be arranged in.
> nelson.res<-rstandard(nelson.lm)
> hist(nelson.res,xlab="Standardised residuals", ylab="Frequency")
> box()
> plot(nelson.res~nelson.lm$fitted.values, xlab="Fitted values", 
ylab="Standardised residuals",main="Standardised residuals vs Fitted")
> qqnorm(nelson.res,xlab="Normal scores", ylab="Standardised residuals")
> par(oldpar)
Diagnostic plots for the Nelson linear model. Three graphs: Histogram of nelson.res, Standardised residual vs Fitted and Normal Q-Q Plot. The histogram peaks in the center, not too different to that expected for a normal distribution, although there are gaps (where the frequency = 0) between the central peak and the tails on each side. The plot of standardised residuals against predicted Sales and Promotions shows no obvious pattern. The normal probability plot is not too far from a straight line. The plot of standardised residuals against predicted Weightloss seems to indicate some "fanning" outward, but this is perception is due to the small sample size.
Diagnostic plots for the Nelson linear model.

 

  1. A linear model is appropriate. The plot of standardised residuals against predicted Sales and Promotions shows no
    obvious pattern, so we conclude that a linear model is appropriate.
  2. Residuals are normal. We have only nine observations so normality cannot be verified. Nonetheless, the normal probability plot is not too far from a straight line. Additionally, the histogram shows a plot that is not too different to that expected for a normal distribution, despite the gaps. We conclude that there is no evidence against the normality assumption.
  3. Residuals have a constant variance. The plot of standardised residuals against predicted Weightloss seems to indicate some “fanning” outward, but this is perception is due to the small sample size.

Note that there are no outliers in the data since the standardised residuals are all between -2 and 2.

5.11 Prediction

Predictions can be made using the predict function in R . This is easy to use – check the online help in you are unsure. For example, to forecast the Weightloss for Humidity = 1, 2, 3, 4, 5, and 6:

new=data.frame(Humidity=c(1:6))
> predict(nelson.lm,new)
       1        2        3        4        5        6 
8.650805 8.597583 8.544361 8.491139 8.437917 8.384694 

15.12 Warning: Always perform model diagnostics

Given below is the output of a regression analysis.

Call:
lm(formula = y ~ x)

Residuals:
   Min     1Q Median     3Q    Max 
 -10.0   -7.5   -1.0    6.0   15.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15.0000     5.5076  -2.724   0.0235 *  
x            10.0000     0.9309  10.742 1.97e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.764 on 9 degrees of freedom
Multiple R-squared:  0.9276,	Adjusted R-squared:  0.9196 
F-statistic: 115.4 on 1 and 9 DF,  p-value: 1.967e-06

Observations

  1. The coefficient of determination is 0.9740, so about 97% of the variation in the data is explained by the regression. The p-value is 4.18 \times 10^{-7}, so there is a significant linear relationship between the variables.
  2. This does not mean that the linear regression is a good one. We have not performed any model diagnostics yet. In particular, we do not even know that a linear model is even appropriate! It is very important to perform some EDA before the regression analysis.
Diagnostic plots for example of non-linear model. Four graphs: Scatterplot of data with fitted line, histogram of residuals, normal probability plot, residuals against fitted values. Values on the scatterplot are close to the fitted line. The histogram shows a peak on the left at the lowest Residuals value, with a gap (ie frequency = 0) and then equivalent low frequency values for the higher Residuals values. Values on the normal probability plot are close to the line. Values on the Residuals against fitted values form an inverted curved shape.
Diagnostic plots for example of non-linear model.

The scatter plot of the data shows that there is a problem with the linear model — the data points follow a pattern. This is much more clear in the plot of residuals against predicted values and x values. (This is the reason for residual plots — any patterns are accentuated). Both show that a quadratic model is appropriate. In fact the data comes from y=x^2, as you can check from the scatter plot of the data.

This example clearly shows that just the Anova, correlation coefficient and hypothesis test are not enough to justify a linear model.

15.13 A note on outliers

Outliers usually lead to  model misfit and are commonly omitted. However, outliers are still part of the data. It is important to understand the reasons for the outliers. These points should be investigated to see how they differ from the rest of the data. What are the particular characteristics of  outliers? Do they, for example, represent observations from a sub-population with some special features?

As an example, if a clinical trial is conducted for a drug for weight loss, and it is found that out of a sample of size 500, ten people have sharp abdominal pains. Are these outliers and should they be ignored or omitted? In this case it is important to investigate these ten people further to understand if they have any particular characteristics that may explain the pain. If for example all ten are females over 50, then a very strong caveat for the drug needs to be that females over 50 should not be taking this drug.

If these ten points are omitted and simply ignored, then when the sample is generalised to an entire population the ten people are scaled to several hundred thousand to millions, and this is a serious medical problem.

15.14 Summary

  1. State the linear regression model and explain the meaning of each term.
  2. Fit a linear regression model in R and interpret the output.
  3. Understand the Anova table for regression.
  4. Perform a test of hypothesis for the slope parameter \beta_1 using either the F-distribution or the t-distribution in the R output.
  5. Find confidence intervals for the slope parameter.
  6. Predict the values of y from the fitted equation for regression.
  7. Know when the prediction is reliable.
  8. Know some simple relationships between sums of squares, degrees of freedom and the coefficients of regression.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Statistics: Meaning from data Copyright © 2024 by Dr Nazim Khan is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book