Simple Linear Regression

Dr R. Nazim Khan

15 Simple Linear Regression

Learning Outcomes

At the end of this chapter you should be able to:

understand the concept of simple linear regression;
fit a linear regression model to data in R;
interpret the results of regression analysis;
perform model diagnostics;
evaluate models.

15.1 Introduction

Many situations require an assessment of which of the available variables affect a particular response or output.

Example 15.1 Case Study: Nelson beetle data

Nelson (1964) investigate the effects of starvation and humidity on weight loss in flour beetle. The data consists of the variables Weightloss (mg) and Humidity (%).

Objective: To quantify the relationship between Weightloss and Humidity.

Reference: Nelson, V. E. (1964). The effects of starvation and humidity on water content in Tribolium confusum Duval (Coleoptera). PhD Thesis, University of Colorado.

Regression is the study of relationships between variables, and is a very important statistical tool because of its wide applicability. Simple linear regression involves only two variables:

$X =$ independent or explanatory variable;

$Y =$ dependent or response variable;

and they are related by a straight line.

The observations are $(X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n).$

Example 15.2

Let $X$ = height and $Y$ = weight of person.

People of the same height can have different weights. On average as height increases, weight also increases. Given the height, the weight has random variations from some mean weight for that height.

Assumptions of the regression model. The graph is a scatterplot of Weight against Height. For a given height, several weights are plotted, with the weight increasing linearly with height. A simple linear regression line is also plotted. For each set of weights for a given a height, a normal distribution curve is plotted. All the normal curves have the same spread, indicating that at different values of height, weight is normally distribute with a homogeneous variance. — Assumptions of the regression model.

The Model

The general univariate statistical mode is

$y_ i = \mu_i + \epsilon_i,$

where $\mu_i$ is the mean of $y_i$ and $\epsilon_i$ is the random variation term. Linear regression assumes that $\mu_i$ is a linear function of $X$ , that is

$\E(Y_i)=\mu_i = \beta_0+\beta_1 X_i.$

THE MODEL

$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \quad i=1,2,\cdots,n$

where

$Y_i =$ response (or dependent) variable,

$X_i =$ explanatory (or independent) variable,

$\beta_0 =$ intercept,
$\beta_1 =$ slope,
$\epsilon_i$ = error or random variation.

Model assumptions

Observed data $\left(X_i,Y_i\right)$ , and

$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i, \quad i = 1,2,\ldots,n.$

We treat $Y_i$ as random variables corresponding to observations $X_i$ .

ASSUMPTIONS

A linear model is appropriate, that is, ${\rm E}\ (Y_i) = \beta_0 + \beta_1 X_i$ .
The error terms $\epsilon_i$ are normally distributed.
The error terms $\epsilon_i$ have constant variance.
The error terms $\epsilon_i$ are uncorrelated.

Variance

Note that

${\rm Var}(Y_i) = {\rm Var}(\beta_0 + \beta_1 X_i + \epsilon_i).$

But $\beta_0 + \beta_1 X_i$ is a constant, so

${\rm Var}(Y_i) = {\rm Var}(\epsilon_i) = \sigma^2,$

assumed constant (that is, $\sigma^2$ does not depend on $i$ ). The assumptions then imply that

$Y_i \stackrel{iid}{\sim} {\rm N}\left(\beta_0 + \beta_1 X_i, \sigma^2\right).$

Further,

${\rm E}\left(\epsilon_i\right) = {\rm E}\left[Y_i-\left(\beta_0 + \beta_1 X_i\right)\right] =0,$

since ${\rm E}(X-\mu) = 0$ . This implies that

$\epsilon_i \stackrel{iid}{\sim} {\rm N}(0,\sigma^2).$

15.2 Parameter Estimation — Method of Least Squares

The model has three parameters, $\beta_0,\beta_1$ and $\sigma^2$ , and these need to be estimated from the data. Denote by $B_0$ and $B_1$ respectively the estimates of $\beta_0$ and $\beta_1$ .

$\begin{align*} \textrm{Fitted value} \quad \hat{Y}_i &= B_0+B_1 X_i\\ \textrm{Residual} \quad r_i &= Y_i-\hat{Y}_i\\ &= Y_i-(B_0+B_1 X_i) \end{align*}$

We use the method of least squares to estimate $B_0$ and $B_1$ so as to minimise $\sum_{i=1}^n r_i^2$ .

Recall

$\sum_{i=1}^n (y_i-c)^2$ is a minimum when $c=\overline{y}$ , from Chapter 2.
$ax^2+bx+c$ is a minimum when $x= -\dfrac{b}{2a}$ .
$(a \pm b)^2 = a^2 \pm 2ab+b^2$ .

Now

$\begin{align*} \sum_{i=1}^n r_i^2 &= \sum_{i=1}^n \left[Y_i-(B_0+B_1 X_i)\right]^2\\ &= \sum_{i=1}^n \left[(Y_i-B_1 X_i) - B_0\right]^2 \end{align*}$

This is a minimum when $B_0 = \overline Y - B_1 \overline X$ . Then

$\begin{align*} \sum_{i=1}^n r_i^2 &= \sum_{i=1}^n \left[Y_i-(\overline Y -B_1 \overline X)\right]^2\\ &= \sum_{i=1}^n \left[(Y_i- \overline Y) - B_1(X_i - \overline X)]^2 \end{align*}$

Using $(a-b)^2 = a^2 - 2ab + b^2$ , the last line gives

$r_i^2= \sum_{i=1}^n\left(Y_i-\overline Y\right)^2 -2B_1 \sum_{i=1}^n\left(X_i-\overline X\right)\left(Y_i- \overline Y\right)+ B_1^2\sum_{i=1}^n\left(X_i-\overline X\right).$

Define

$\begin{align*} \textrm{SS}_Y &= \sum_{i=1}^n (Y_i-\overline Y)^2\\ \textrm{SS}_X &= \sum_{i=1}^n (X_i- \overline X)^2\\ \textrm{SS}_{XY} &= \sum_{i=1}^n (X_i - \overline X)Y_i \end{align*}$

$\sum_{i=1}^n r_i^2=B_1^2 \textrm{SS}_X-2B_1 \textrm{SS}_{XY}+ \textrm{SS}_Y$

Now the quadratic function $y=ax^2+bx+c$ has a minimum when $x=-\dfrac{b}{2a}$ . So this is a quadratic function in $B_1$ , and has a minimum at $\hat{B}_1=\dfrac{2\textrm{SS}_{XY}}{2\textrm{SS}_X}$ .
Thus the least squares estimates of the slope and intercept parameters are:

$\hat{B}_1=b_1=\dfrac{\textrm{SS}_{XY}}{\textrm{SS}_X} \hspace{1cm} \hat{B}_0=b_0=\overline{Y}-\hat{B}_1 \overline{X}$

Exercise

Show that

$\begin{align*} \textrm{SS}_{XY} &= \sum_{i=1}^n (X_i - \overline X)(Y_i-\overline Y)\\ &= \sum_{i=1}^n (X_i - \overline{X}) Y_i\\ &= \sum_{i=1}^n (Y_i-\overline{Y}) X_i\\ &= \sum_{i=1}^n X_iY_i - n {\overline{X}}\ {\overline{Y}}\\ \textrm{SS}_X &= \sum_{i=1}^n (X_i - \overline{X})^2\\ &= \sum_{i=1}^n (X_i - \overline{X})X_i\\ &= \sum_{i=1}^n X_i^2 - n \overline{X}^2\\ \textrm{SS}_Y &= \sum_{i=1}^n (Y_i-\overline{Y})^2\\ &= \sum_{i=1}^n (Y_i - \overline{Y})Y_i\\ &= \sum_{i=1}^n Y_i^2 - n\overline{Y}^2 \end{align*}$

Note

$B_0=\overline{Y}-B_1 \overline{X} \Rightarrow \overline{Y}=B_0+B_1 \overline{X}.$

That is, the point $(\overline{X},\overline{Y})$ satisfies the equation of regression, i.e., $(\overline{X},\overline{Y})$ ALWAYS lies on the line of regression.

Example 15.3 Beetle data

Nine beetles were used in an experiment to determine the effects of starvation and humidity on water loss in flour beetle. here $X$ = Humidity, $Y$ = Weight.

Summary statistics

$\sum_{i=1}^{9} x_i=453.5 \qquad \sum_{i=1}^{9} x_i^2=31 152.75 \qquad \sum_{i=1}^{9} y_i=54.2$

$\sum_{i=1}^{9} y_i^2=350.535 \qquad \sum_{i=1}^{9} x_iy_i=2 289.26 \qquad \overline x = 50.38889$

$\overline y = 6.02222$

$\begin{align*} \textrm{SS}_Y &= \sum_{i=1}^9 y_i^2 - 9\overline{y}^2 = 350.535 - 9\times 6.022^2 = 24.13056\\ \textrm{SS}_X &= \sum_{i=1}^9 x_i^2- \overline{x}^2 = 31 152.75 - 9\times 50.39^2 = 8 301.389\\ \textrm{SS}_{XY} &= \sum_{i=1}^9 x_iy_i -9\times\overline{x}\ \overline{y} = 2 289.26 - 9 \times 50.39 \times 6.022 = -441.8178\\ b_1 &= \frac{SS_{XY}}{SS_X} = \frac{-441.8178}{8 301.389} = -0.05322\\ b_0 &= \overline y - b_1 \overline x = 6.022 + 0.05322 \times 50.389 = 8.7040\end{align*}$

$\widehat{\rm Weightloss} = 8.740 - 0.05322 {\rm \ Humidity}$

$\widehat{\rm Weightloss} =$ Estimated average Weightloss for a given humidity

Intercept $= 8.7040 =$ Average Weightloss when Humidity is zero.

Slope $= -0.05322 =$ If Humidity increases by 1 % then Weightloss decreases by 0.05322 mg.

That is, the beetles prefer a more humid environment in which the weightloss is lower.

15.3 Partitioning the sum of squares

$\textrm{Residual SS} = \textrm{SS}_{\textrm{Res}} = \sum_{i=1}^n r_i^2.$

Now

$\begin{align*} \textrm{SS}_{XY} & = \sum_{i=1}^n (X_i - \overline X)(Y_i-\overline Y)\\ &= \sum_{i=1}^n (X_i - \overline{X}) Y_i\\ &= \sum_{i=1}^n (Y_i - \overline{Y})X_i\\ &= \sum_{i=1}^n X_i Y_i - n {\overline{X}}\ {\overline{Y}}\\ \textrm{SS}_X &= \sum_{i=1}^n (X_i-\overline{X})^2\\ &= \sum_{i=1}^n (X_i -\overline{X})X_i\\ &= \sum_{i=1}^n X_i^2 - n\overline{X}^2\\ \textrm{SS}_Y &= \sum_{i=1} ^n (Y_i -\overline{Y})^2\\ &= \sum_{i=1}^n (Y_i - \overline{Y}) Y_i\\ &= \sum_{i=1} ^n Y_i^2 -n {\overline{Y}}^2.\end{align*}$

Define

$\begin{align*} \textrm{SS}_{\textrm{Regression}} & = b_1^2\ \textrm{SS}_X\\ \textrm{SS}_{\textrm{Total}} &= \textrm{SS}_Y.\end{align*}$

Then

$\textrm{SS Total} = \textrm{SS Reg} + \textrm{SS Res}}$

Notes

The SS Reg can also be expressed as follows.
$\begin{align*} \sum_{i=1}^n (\hat{y}_i-\overline{y})^2 &= \sum_{i=1}^n (b_0+b_1x_i-\overline{y})^2 \\ &= \sum_{i=1}^n (\overline{y}-b_1\overline{x}+b_1x_i-\overline{y})^2 \\ &= b^2 \sum_{i=1}^n (x_i-\overline{x})^2 = \text{SS}_{\text{Reg}}. \end{align*}$
The proportion of variation explained by the regression is
$R^2=\dfrac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Total}}}.$

Note $0 \le R^2 \le 1$ .

15.4 Hypothesis Test for $\beta_1$

To determine if there is a significant linear relationship between $x$ and $y$ , we test the hypotheses:

$\begin{align*} H_0: \beta_1 &=0 \qquad \text{No linear relationship} \\ H_1: \beta_1 &\neq 0 \qquad \text{Significant linear relationship} \\ \beta_1 &>0 \qquad \text{Significant positive linear relationship} \\ \beta_1 &<0 \qquad \text{Significant negative linear relationship} \end{align*}$

If $H_0$ is true then $\beta_1=0$ , so

$\text{SS}_{\text{Reg}}=b_1^2 \text{SS}_X \approx 0 \quad \text{and} \quad \text{SS}_{\text{Tot}} \approx \text{SS}_{\text{Res}}.$

Thus we can use the ratio $\text{SS}_{\text{Reg}}/\text{SS}_{\text{Res}}$ as a test statistic. However, each of these sums of squares need to be adjusted by their degrees of freedom.

$\begin{align*} \text{Regression df} &= \text{Number of parameters -1} \\ &=k-1 =1 \; \text{for simple linear regression}.\\ \text{Total df} &= n-1 \\[12pt] \text{Residual df} &= n-k \\ &= n-2 \; \text{for simple linear regression}. \end{align*}$

Note that two parameters are estimated from the data in simple linear regression, the intercept and the slope, so $k=2$ . These calculations can be set up as a table.

ANOVA Table

Source

df

SS

MS

F

Regression

$k-1$

${\rm SS}_{\rm Reg}$

${\rm MS}_{\rm Reg} = \frac{{\rm SS}_{\rm Reg}}{k-1}$

$F = \frac{\rm MS_{\rm Reg}}{\rm MS_{\rm Res}} \sim F_{k-1, n-k}$

Residual

$n-k$

${\rm SS}_{\rm Res}$

${\rm MS}_{\rm Res} = \frac{{\rm SS}_{\rm Res}}{n-k}$

Total

$n-1$

${\rm SS}_{\rm Tot}$

${\rm MS_{\rm Tot} = {\rm SS}_{\rm Tot}/(n-1)$

The test then proceeds as usual.

If p-value $= P(F > f_{obs}) < \alpha$ , where $F \sim F_{k-1,n-k}$ , then reject $H_0$ , and conclude there is a significant linear relationship between $Y$ and $X$ .

Plot of the F(6,45) distribution, with the upper tail shaded corresponding to a probability of 0.05.

Example 15.4 Beetle data (ctd)

$\begin{align*} SS_T &= SS_Y = 24.13056\\ SS_{Reg} &= b_1^2 SS_{X} = (-0.05322)^2 \times 8301.389 = 23.51259. \end{align*}$

The ANOVA table is given below. Note that the relationships in the ANOVA table are as in chapter 15. That is,

$\begin{align*} SST &= SSReg + SSRes\\ df \ Total &= df\ Reg + df\ Res\\ MSReg &= \frac{SSReg}{df\ Reg}\\ MSRes &= \frac{SSRes}{df\ Res}\\ F &= \frac{MSReg}{MSRes} \end{align*}$

Source	df	SS	MS	F
Regression	1	23.51449	23.51449	267.18
Residual	7	0.61607	0.088010
Total	8	24.13056

The hypothesis of interest are:

$H_0: \beta_1 = 0 \qquad H_1: \beta_1 \ne 0$

p-value $= P(F > 267.18) = 7.82 \times 10^{-7} << 0.05$ , so there is overwhelming evidence against the null hypothesis.

pf(267.18, 1, 7, lower.tail = F)
[1] 7.816436e-07

We conclude based on the linear regression analysis that there is a significant linear relationship between Weightloss and Humidity.

We could also conduct a left-sided test of hypothesis.

$H_0: \beta_1 = 0 \qquad H_1: \beta_1 < 0$

Now the p-value is simply half of the two-sided test. That is,

${\rm p-value\ }= 7.82 \times 10^{-7}= 3.91 \times 10^{-7} << 0.025,$

so there is overwhelming evidence against the null hypothesis. We conclude based on the linear regression analysis that there is a significant negative linear relationship between Weightloss and Humidity.

15.5 Estimate of $\sigma^2$

The final model parameter that needs to be estimated is the constant error variance term $\sigma^2$ . Now

$\Var(Y_i) = \Var(\epsilon_i) =\sigma^2,$

which can be estimated by the average of the residuals squared. That is,

$\hat{\sigma}^2 = \frac{1}{n-k}\sum_{i=1}^n r_i^2.$

Note that here we have used the data to estimate $k$ parameters/regression coefficients. For simple linear regression, $k=2$ . Also note that as in ANOVA,

$s^2 = MS_{Res},$

which can be read from the ANOVA table for regression.

For the Beetle data, $s^2 = 0.08801$ .

15.6 Distribution of $\beta_1$

$B_1 = \dfrac{\sum_{i=1}^n (X_i-\overline{X})Y_i}{\sum_{i=1}^n (X_i-\overline{X})^2}=\dfrac{\text{SS}_{XY}}{\text{SS}_X}$

The random variables here are $Y_i$ , and $X_i$ are considered constant. Put

$\begin{align*} C_i &= \frac{X_i-\overline{X}}{SS_X}\\ \text{so\ } \sum_{i=1}^n C_i &= \frac{1}{SS_X} \sum_{i=1}^n (X_i-\overline{X}) =0,\\ \sum_{i=1}^n C_i\ X_i &= \frac{1}{SS_X}\sum_{i=1}^n (X_i-\overline{X})X_i = \frac{SS_X}{SS_X} = 1,\\ \text{and\ } \sum_{i=1}^n C_i^2 &= \frac{1}{SS_X^2}\sum_{i=1}^n (X_i-\overline{X})^2 = \frac{1}{SS_x}. \end{align*}$

Then

$\begin{align*} B_1 &= \sum_{i=1}^n C_i\ Y_i\\ \text{so\ } {\rm E}(B_i) &= \sum_{i=1}^n C_i\ {\rm E}(Y_i)\\ &= \sum_{i=1}^n C_i\ (\beta_0+\beta_1 X_i)\\ &= \beta_0 \underbrace{\sum_{i=1}^n C_i}_{=0} + \beta_1 \underbrace{\sum_{i=1}^n C_i\ X_i}_{=1}\\ &= \beta_1, \end{align*}$

so $B_1$ is an unbiased estimator of $\beta_1$ .

Next,

${\rm Var}(B_1) = \sum_{i=1}^n C_i^2\ {\rm Var}(Y_i)= \frac{\sigma^2}{SS_X}.$

Further, $Y_i$ are normally distributed, so $B_1=\sum_{i=1}^n C_iY_i$ is the sum of normal random variables; thus $B_1$ is also normal.

$\begin{align*} B_1 & \sim \text{N}\left(\beta_1, \dfrac{\sigma^2}{\text{SS}_X}\right)\\ \Rightarrow Z&=\dfrac{B_1-\beta_1}{\sigma/\sqrt{\text{SS}_X}} \sim \text{N}(0,1) \end{align*}$

If $\sigma$ is unknown (usually the case), we replace it by $S$ , and then

$\begin{align*} T &= \dfrac{B_1-\beta_1}{S/\sqrt{\text{SS}_X}} \sim t_{n-2}\\ SE\left(B_1\right) &= \frac{s}{\sqrt{\text{SS}_X}} \end{align*}$

We can thus use the t-distribution for (one-sided and two-sided) hypothesis tests and confidence intervals for $\beta_1$ .

Example 15.4 Beetle data

$b_1 = -0.0532222, \quad SSx = 8301.389, \quad s= \sqrt{0.08801}$

Is there a significant

linear relationship
positive linear relationship
between Sales and Promotions

Solution

The hypothesis are $H_0: \beta_1 = 0$ against the two-sided $H_1: \beta_1 \ne 0$ or the right-sided $H_1: \beta_1 > 0$ .
Test statistic is

$\begin{align*} T &= \frac{B_1-\beta_1}{S/SS_X} \sim t_{7}\\ \text{and\ } t_{Obs} &= \frac{-0.053222 - 0}{\sqrt{0.08801}/\sqrt{8301.389}} = -16.34563. \end{align*}$

The two-sided p-value $= 2\ P(T < -16.34563) = 7.82 \times 10^{-7} << 0.05$ so there is overwhelming evidence against the null hypothesis. Same p-value and conclusion as before.
One-sided test. p-value $= 2\ P(T < -16.34563) = 3.91 \times 10^{-07} << 0.025$ so there is overwhelming evidence against the null hypothesis. Same p-value and conclusion as before.

Note

The p-values from the F-distribution and the t-distribution are the same here. Note that
$(t_{obs})^2 = (-16.34563)^2 = 267.1796 = F_{obs}$

from the ANOVA table. In fact

$(t_k)^2 = F_{1,k}.$
The one-sided p-value is simply half the two-sided p-value from either the t- or F-distribution.

Example 15.4 Beetle data –Confidence interval

A 95% confidence interval for $\beta_1$ is calculated as before.

$\begin{align*} 95\% \text{\ CL for\ } \beta_1 &= b_1 \pm t_{7}^{0.025} \times \frac{S}{\sqrt{SS_X}}\\ &= -0.053222 \pm 2.3646 \times \frac{\sqrt{0.08801}}{\sqrt{8301.389}}\\ &= -0.053222 \pm 0.007699\\ \text{so\ } 95\% \text{\ CI for\ } \beta_1 &= (-0.0609, -0.0455). \end{align*}$

15.7 Model diagnostics

Model assumptions are verified by analysis of residuals. There are four assumptions.

A linear model is appropriate.
1. Scatterplot of $y_i$ against $x_i$ . This should show points roughly around a straight line. This is usually part of the data exploration before model fitting.
2. Scatterplot of standardised residuals against fitted values. The plot should be patternless. This is the best plot to verify this assumption, and is also used in the next chapter on multiple regression.
3. Scatterplot of standardised residuals against $x_i$ . Plot should be without pattern. This is similar to the previous plot. It is also used in the next chapter on multiple regression for model building and selecting if some transformation of data is appropriate.
The errors are normally distributed. These are verified as for ANOVA.
1. Histogram of residuals.
2. Normal probability plot.
3. Chi-squared test of normality. We won’t cover this.
Errors have constant variance. This is again by inspecting residual plots, as for ANOVA.
1. Plot of standardised residuals against fitted values. Should have constant spread. This is the best plot and is also used in the next chapter on multiple regression.
2. Plot of standardised residuals against $x_i$ . Should have constant spread. In the case of multiple regression this plot may help identify variables that are related to inhomogeneous variance.
Errors are uncorrelated. Durbin–Watson test statistic. This is beyond the scope of this text.

Example 15.5 (a)

Evaluate the model with the given diagnostic plots.

Diagnostic plots for Example 15.5 (a).

Solution

The scatterplot appears close to a straight line, and the plot of residuals against fitted values shows no clear pattern, so a linear model is appropriate. The histogram of residuals is not too different from that expected for a normal distribution, showing only a light asymmetry, and the normal probability plot does not depart markedly from a straight line. We conclude that the normality assumption is not violated. Finally, the plot of residuals against fitted values shows no change in spread, so there is no evidence against the homogeneous variance assumption.

Example 15.5 (b)

Solution

Evaluate the model with the given diagnostic plots.

Diagnostic plots for Example 15.5 (b). Four graphs; scatterplot, histogram of residuals, normality probability plot and residuals against fitted values. The scatterplot is a slightly curved line, increasing on both x and y. The histogram is much higher at low x values and tapers off to the right. The normal probability plot shows values at a significant distance from the straight line and the residuals against fitted values shows an inverted curve shape. — Diagnostic plots for Example 15.5 (b).

Solution

The plot of residuals against fitted values shows a quadratic trend, so a linear model is not appropriate. The departure from normality is also evident from the histogram and normal probability plot, but this may be due to the linear model not being appropriate. The homogeneous variance assumption seems to be satisfied as there is no change in spread in the plot of residuals against fitted values.

Example 15.5 (c)

Evaluate the model with the given diagnostic plots.

Solution

The plot of residuals against fitted values indicates a “fanning out”, so the homogeneous variance assumptions is violated. A linear model seems appropriate and there is no evidence against the normality assumption.

Example 15.5 (d)

Evaluate the model with the given diagnostic plots.

Solution

From the histogram and normal probability plot, it is clear that the normality assumption does not hold. The scatterplot of the data indicates that a linear model may not be appropriate. Similarly the plot of residuals against fitted values shows evidence against the homogeneous variance assumption. However, these issues may be due to a lack of normality.

15.8 Outliers and Points of High Leverage

Outliers are points away from the bulk of the data. They usually have large absolute residuals, that is, large positive or large negative residuals, and are easily identified from plots. Commonly $|\text{standardised}\, r_i| > 2$ is used to identify outliers.

Outliers are important because they may affect model fit. To investigate the effect of an outlier the regression model should be fitted with and without the point for comparison.

If in arriving at the final model any outliers have been omitted then this should be reported. One should never silently omit outliers. Indeed, if enough points are omitted then one can arrive at a perfect model.

Points of high leverage are those that strongly influence the goodness of fit of the model. These usually have small absolute residuals, close to zero. Again the model should be fitted with and without them.

Scatterplot illustrating a point of high leverage. A scatterplot graph with trend line. The graph shows one point (x 6.5, y 5) that is away from the bulk of the data, which is mostly within the ranges x=1 to x=2.5 and y=1 to y=3. — Scatterplot illustrating a point of high leverage.

The graph above shows a point that is away from the bulk of the data. This point is an outlier. It is also a point of high leverage, since the linear model is very strongly influenced by it. Removing it and re-fitting the model gives the graph below. This plot shows there really is no model between the variables here.

Scatterplot of data with the point of high leverage removed. Scatterplot graph with trend line. Trend line is almost horizontal. — Scatterplot of data with the point of high leverage removed.

15.9 Correlation coefficient

The correlation coefficient, $r$ , is a measure of the strength of the linear relationship between $x$ and $y$ .

$r = \dfrac{\text{SS}_{XY}}{\sqrt{\text{SS}_X \text{SS}_Y}}$

Then

$\begin{align*} r^2 &= \dfrac{(\text{SS}_{XY})^2}{\text{SS}_X \text{SS}_Y} = \left(\dfrac{\text{SS}_{XY}}{\text{SS}_X}\right)^2 \dfrac{\text{SS}_X}{\text{SS}_Y} \\ &= \dfrac{b_1^2 \text{SS}_X}{\text{SS}_Y}\\ &= \dfrac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Total}}} = 1-\dfrac{\text{SS}_{\text{Res}}}{\text{SS}_{\text{Total}}} \end{align*}$

Note that $0 \le r^2 \le 1$ , so $-1 \le r \le 1$ .

$|r|=1 \; \Rightarrow \; r^2=1=\dfrac{\text{SS}_{\text{Reg}}}{\text{SS}_{\text{Tot}}} \; \Rightarrow \; \text{SS}_{\text{Tot}}=\text{SS}_{\text{Reg}}$
and $\; \text{SS}_{\text{Res}}=\sum_{i=1}^n r_i^2=0 \; \Rightarrow \; r_i=0,$
and $\; y_i=b_0+b_1x_i$ , so the points $(x_i,y_i)$ lie on a straight line.

If $|r|$ is close to 1, then a strong linear relationship exists between $x$ and $y$ .
If $|r|$ is close to 0, no linear relationship exists between $x$ and $y$ .

Note that

$r=b_1\sqrt{\dfrac{\text{SS}_X}{\text{SS}_Y}},$

so $r$ has the same sign as $b_1$ . Testing $H_0: \beta_1=0$ is equivalent to testing $H_0: r=0$

CAUTION

$|r|$ large means a strong linear relationship exists between $x$ and $y$ . It DOES NOT mean “ $x$ causes $y$ ”. It is possible that $x$ and $y$ are both related through a third variable. Spurious correlation is often misused to imply causation.

If $|r|$ is small, it does not indicate that $x$ and $y$ are unrelated — it only indicates a lack of a linear relationship. $x$ and $y$ may still be related non-linearly.

If we reject $H_0:\beta_1=0$ , then there is a significant linear relationship between $x$ and $y$ . This DOES NOT indicate that the relationship is a good one, or the best one. Residual analysis should be performed to verify model assumptions. A similar comment applies if $|r|$ is large.

Example 15.6 Example of weak correlation but strong relationship

The scatterplot shows a very strong non-linear relationship between $x$ and $y$ , but a zero correlation.

Scatterplot indicating that zero correlation does not imply no relationship. Scatterplot graph shows values arranged in a symmetrical oval shape. At each x value there are 2 points with "mirrored" y values, one above the horizontal axis (eg; at 1.0) and one mirrored below the horizontal axis (eg; at negative 1.0). At each y value there are also two points on the graph with mirrored x values, so that the oval shape is also symmetrical to the left and right of the vertical y axis. — Scatterplot indicating that zero correlation does not imply no relationship.

ALWAYS examine a plot of the data!

15.10 Detailed analysis of the Beetle data: with R code

The data consists of observations for nine beetles for the variables Weightloss (mg) and Humidity (%).

Objective: To quantify the relationship between sales and promotional expenses.

The figure below shows a scatterplot of the data and some summary statistics.

> nelson<-read.csv("nelson.csv", header=T)

 Humidity       Weightloss   
 Min.   : 0.00   Min.   :3.720  
 1st Qu.:29.50   1st Qu.:4.680  
 Median :53.00   Median :5.900  
 Mean   :50.39   Mean   :6.022  
 3rd Qu.:75.50   3rd Qu.:6.670  
 Max.   :93.00   Max.   :8.980  
 sd     : 32.21  sd     :1.74

> with(nelson, plot(Weightloss ~ Humidity))

Scatterplot with vertical axis 'Weightloss' and horizontal axis 'Humidity'. Values are roughly in a diagonal line trending down from Weightloss = 9 at Humidity = 0 to Weightloss = 4 at Humidity = 90 — Scatterplot of Weightloss against Humidity.

Observations

The data range for Humidity is 0–93, with mean 50.39 and median 53.00. Weightloss has a range of 3.272–8.98, with mean 6.022 and median 5.900.

The scatter plot shows that a linear relationship is feasible.

Note that the function to fit the regression model is lm(). The ANOVA table is given, as well as a table of coefficients, standard errors for the coefficients and the p-values.

The output of the regression analysis in R is given below. The function to fit the linear model is lm().

> nelson.lm<-lm(Weightloss ~ Humidity, data = nelson)# y ~ x is the model formula, 
#and the dataframe is specified by data = nelson.
> summary(nelson.lm)

Call:
lm(formula = Weightloss ~ Humidity, data = nelson)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46397 -0.03437  0.01675  0.07464  0.45236 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.704027   0.191565   45.44 6.54e-10 ***
Humidity    -0.053222   0.003256  -16.35 7.82e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2967 on 7 degrees of freedom
Multiple R-squared:  0.9745,	Adjusted R-squared:  0.9708 
F-statistic: 267.2 on 1 and 7 DF,  p-value: 7.816e-07

The coefficients of the model are given, along with the corresponding standard errors and p-values. The output below the coefficient table gives the residual standard error, which is the estimate of the model standard deviation $\sigma$ . The degrees of freedom of th residual standard error is also given. It also gives the value of the correlation coefficient squared. Finally, the F-value, its corresponding degrees of freedom and p-value are also given.

We will discuss the other output in the next chapter. Here we will calculate the Multiple R-Squared and Adjusted R-squared.

The adjusted R-squared and Adjusted R-squared are defined as

$R^2 = 1 - \frac{\text{Residual SS}}{\text{Total SS}},$

and

$\begin{align*} \text{Adjusted\ } R^2 &= 1-\dfrac{\text{SS Res}/(n-k)}{\text{SS Total}/(n-1)} \\ &= 1 - \frac{\text{MS Res}}{\text{MS Total}}, \end{align*}$

where $n =$ number of observations and $k =$ number of variables.

The calculations in R are below.

(SST <- sum(anova(nelson.lm)[2]))
[1] 24.1306
> (SSRes <- anova(nelson.lm)[2][[1]][2])
[1] 0.616063
> (Rsq <- 1-SSRes/SST)
[1] 0.97447
> (Totdf <- sum(anova(nelson.lm)[1])) 
[1] 8
> (Resdf <- (anova(nelson.lm)[1][[1]][2]))
[1] 7
> (AdjRsq <- 1- (SSRes/Resdf)/(SST/Totdf))
[1] 0.970822

Scatterplot of Beetle data is given below, with regression line superimposed. The equation of the line and the correlation coefficient is also stated.

Scatterplot of Beetle data, with regression line superimposed. The equation of the line and the correlation coefficient is also stated.

Two and one sided hypothesis tests for $\beta_1$ can be conducted using the output.

$H_0: \beta_1 = 0 \qquad H_1: \beta_1 > 0$

For the coefficient for Humidity, the one-sided p-value $= 7.82/2 \times 10^{-07} = 3.91\times 10^{-07} < 0.025$ . At the 2.5% level of significance we will reject the null hypothesis and conclude that a strong linear relationship exists between Weightloss and Humidity.

Confidence Intervals can be easily obtained. Compare this with the calculation in Example 15.4.

confint(nelson.lm)
                  2.5 %      97.5 %
(Intercept)  8.25104923  9.15700538
Humidity    -0.06092143 -0.04552287

The ANOVA table for the regression is given below.

> anova(nelson.lm)
Analysis of Variance Table

Response: Weightloss
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Humidity   1 23.5145  23.515  267.18 7.816e-07 ***
Residuals  7  0.6161   0.088

Compare the ANOVA table values with those given in the regression output. The F-value, degrees of freedom and p-values are the same as in the regression output. Note that the square root of the Residual mean sum of squares,

$\sqrt{0.088} = 0.2966,$

which is equal to the residual standard error from the linear regression output.

The figure below may be used to verify the assumptions of linear regression.

> oldpar<-par(mfrow=c(2,2)) #mfrow (multi figure row) specifies the number of rows
# and columns the plots should be arranged in.
> nelson.res<-rstandard(nelson.lm)
> hist(nelson.res,xlab="Standardised residuals", ylab="Frequency")
> box()
> plot(nelson.res~nelson.lm$fitted.values, xlab="Fitted values", 
ylab="Standardised residuals",main="Standardised residuals vs Fitted")
> qqnorm(nelson.res,xlab="Normal scores", ylab="Standardised residuals")
> par(oldpar)

Diagnostic plots for the Nelson linear model. Three graphs: Histogram of nelson.res, Standardised residual vs Fitted and Normal Q-Q Plot. The histogram peaks in the center, not too different to that expected for a normal distribution, although there are gaps (where the frequency = 0) between the central peak and the tails on each side. The plot of standardised residuals against predicted Sales and Promotions shows no obvious pattern. The normal probability plot is not too far from a straight line. The plot of standardised residuals against predicted Weightloss seems to indicate some "fanning" outward, but this is perception is due to the small sample size. — Diagnostic plots for the Nelson linear model.

A linear model is appropriate. The plot of standardised residuals against predicted Sales and Promotions shows no

obvious pattern, so we conclude that a linear model is appropriate.
Residuals are normal. We have only nine observations so normality cannot be verified. Nonetheless, the normal probability plot is not too far from a straight line. Additionally, the histogram shows a plot that is not too different to that expected for a normal distribution, despite the gaps. We conclude that there is no evidence against the normality assumption.
Residuals have a constant variance. The plot of standardised residuals against predicted Weightloss seems to indicate some “fanning” outward, but this is perception is due to the small sample size.

Note that there are no outliers in the data since the standardised residuals are all between $-2$ and $2$ .

5.11 Prediction

Predictions can be made using the predict function in R . This is easy to use – check the online help in you are unsure. For example, to forecast the Weightloss for Humidity = 1, 2, 3, 4, 5, and 6:

new=data.frame(Humidity=c(1:6))
> predict(nelson.lm,new)
       1        2        3        4        5        6 
8.650805 8.597583 8.544361 8.491139 8.437917 8.384694

15.12 Warning: Always perform model diagnostics

Given below is the output of a regression analysis.

Call:
lm(formula = y ~ x)

Residuals:
   Min     1Q Median     3Q    Max 
 -10.0   -7.5   -1.0    6.0   15.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15.0000     5.5076  -2.724   0.0235 *  
x            10.0000     0.9309  10.742 1.97e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.764 on 9 degrees of freedom
Multiple R-squared:  0.9276,	Adjusted R-squared:  0.9196 
F-statistic: 115.4 on 1 and 9 DF,  p-value: 1.967e-06

Observations

The coefficient of determination is 0.9740, so about 97% of the variation in the data is explained by the regression. The p-value is $4.18 \times 10^{-7}$ , so there is a significant linear relationship between the variables.
This does not mean that the linear regression is a good one. We have not performed any model diagnostics yet. In particular, we do not even know that a linear model is even appropriate! It is very important to perform some EDA before the regression analysis.

Diagnostic plots for example of non-linear model.

The scatter plot of the data shows that there is a problem with the linear model — the data points follow a pattern. This is much more clear in the plot of residuals against predicted values and x values. (This is the reason for residual plots — any patterns are accentuated). Both show that a quadratic model is appropriate. In fact the data comes from $y=x^2$ , as you can check from the scatter plot of the data.

This example clearly shows that just the Anova, correlation coefficient and hypothesis test are not enough to justify a linear model.

15.13 A note on outliers

Outliers usually lead to model misfit and are commonly omitted. However, outliers are still part of the data. It is important to understand the reasons for the outliers. These points should be investigated to see how they differ from the rest of the data. What are the particular characteristics of outliers? Do they, for example, represent observations from a sub-population with some special features?

As an example, if a clinical trial is conducted for a drug for weight loss, and it is found that out of a sample of size 500, ten people have sharp abdominal pains. Are these outliers and should they be ignored or omitted? In this case it is important to investigate these ten people further to understand if they have any particular characteristics that may explain the pain. If for example all ten are females over 50, then a very strong caveat for the drug needs to be that females over 50 should not be taking this drug.

If these ten points are omitted and simply ignored, then when the sample is generalised to an entire population the ten people are scaled to several hundred thousand to millions, and this is a serious medical problem.

15.14 Summary

State the linear regression model and explain the meaning of each term.
Fit a linear regression model in R and interpret the output.
Understand the Anova table for regression.
Perform a test of hypothesis for the slope parameter $\beta_1$ using either the F-distribution or the t-distribution in the R output.
Find confidence intervals for the slope parameter.
Predict the values of y from the fitted equation for regression.
Know when the prediction is reliable.
Know some simple relationships between sums of squares, degrees of freedom and the coefficients of regression.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

15 Simple Linear Regression

Learning Outcomes

Contents

15.1 Introduction

The Model

THE MODEL

Model assumptions

ASSUMPTIONS

Variance

15.2 Parameter Estimation — Method of Least Squares

Recall

Exercise

Note

15.3 Partitioning the sum of squares

Notes

15.4 Hypothesis Test for $\beta_1$

ANOVA Table

15.5 Estimate of $\sigma^2$

15.6 Distribution of $\beta_1$

Note

15.7 Model diagnostics

15.8 Outliers and Points of High Leverage

15.9 Correlation coefficient

CAUTION

15.10 Detailed analysis of the Beetle data: with R code

Observations

5.11 Prediction

15.12 Warning: Always perform model diagnostics

Observations

15.13 A note on outliers

15.14 Summary

Licence

Share This Book