10 Estimation

Learning Outcomes

At the end of this chapter you should be able to:

  1. explain the difference between point and interval estimates;
  2. understand the concept of a confidence interval;
  3. compute a confidence interval for a population mean;
  4. compute the confidence interval for a population proportion;
  5. explain the factors that affect the width of a confidence interval;
  6. interpret confidence intervals;
  7. compute the required sample size for a given error.

 

 

10.1 Introduction

Population parameters are usually unknown. What is the mean age of Australians? What is the median house price in Perth suburbs? How much do international students need for living expenses?

We use an appropriate sample statistic, W, to estimates a population parameter \theta. We call W the estimator of \theta. The observed value of W, denoted w, is an estimate of \theta. For example, \overline X is an estimator of population mean \mu, and \overline x is an estimate of \mu.

W is an unbiased estimator of parameter \theta if {\rm E}(W) = \theta.

10.2 Common Estimators

We are interested mainly in estimating population mean \mu and population proportion p.

Population Mean {\boldmath{\mu}}

The sample mean

    \[\bar{X}=\frac{1}{n} \sum_{i=1}^n X_i\]

is an unbiased estimator of the population mean \mu.

    \[{\rm E}(\bar{X})=\mu, \qquad {\rm Var} (\bar{X})=\frac{\sigma^2}{n}\]

Population Variance \boldmath{\sigma^2}

The sample variance

    \[S^2=\frac{1}{n-1} \sum_{i=1}^n (X_i-\bar{X})^2\]

is an unbiased estimator of the population variance \sigma^2.

Proof

To prove that the sample variance is an unbiased estimator of the population variance we need to show that the expected value of the sample variance is equal to the population variance. First note that from properties of variance,

    \begin{equation*} {\rm E}\left(X^2\right) = {\rm Var}(X) + \left[{\rm E}(X)\right]^2. \end{equation*}

This gives the following two results, with X replaced by X_i and \overline{X} as appropriate.

    \begin{align*} {\rm E}(X_i^2) &= {\rm Var}(X_i) + \left[{\rm E}(X_i)\right]^2 = \sigma^2 + \mu^2,\\ {\rm E}({\overline X}^2) &= {\rm Var}(\overline X) + \left[{\rm E}(\overline X)\right]^2 = \frac{\sigma^2}{n} + \mu^2. \end{align*}

Next, note that from properties of sums from chapter 3,

    \begin{equation*} \sum_{i=1}^n \left(X_i-\overline X\right)^2 = \sum_{i=1}^n X_i^2 - n{\overline X}^2. \end{equation*}

Then

    \begin{align*} {\rm E}\left[(n-1)S^2\right] &= {\rm E}\left[\sum_{i=1}^n \left(X_i-\overline X\right)^2\right]= {\rm E}\left[\sum_{i=1}^n X_i^2 - n{\overline X}^2\right]\\ &= \sum_{i=1}^n {\rm E}\left(X_i^2\right) -n{\rm E}\left({\overline X}^2 \right)\\ &= \sum_{i=1}^n \left(\sigma^2 + \mu^2\right) - n \left(\frac{\sigma^2}{n} + \mu^2\right)\\ &= n\ \sigma^2 + n\ \mu^2 - \sigma^2 - n\ \mu^2= (n-1)\sigma^2,\\ \intertext{so} {\rm E}\left(S^2\right) &= \sigma^2. \end{align*}

That is, the sample variance is an unbiased estimator for the population variance.

Population Proportion \boldmath{p}

The sample here is n identical Bern(p) trials (e.g. toss a coin n times). Let X_1,X_2,\dots,X_n denote the outcome of the trials, so

    \[X_i = \left\{ \begin{array}{rl} 1 &\mbox{ if $i$ the trial is a success,}\\ 0 &\mbox{ otherwise.} \end{array} \right.\]

Then the sample proportion

    \[\hat p = \frac{1}{n}\sum_{i=1}^n X_i}= \frac{X}{n}\]

is an unbiased estimator of the population proportion, where

    \[X = \sum_{i=1}^n X_i\sim {\rm Bin}(n,p).\]

Proof

Note that X_i are iid and X_i \sim {\rm Bern}(p), \ i = 1, 2, \ldots, n. Put

    \[X = \sum_{i=1}^n X_i,\]

so X is the number of successes in n iid Bernoulli trials. Thus X \sim {\rm Bin}(n,p), so

    \[{\rm E}(X) = np {\rm\ and\ } {\rm Var}(X) = np(1-p).\]

Now put

    \[\hat p = \frac{X}{n}.\]

Note that \hat p denotes the sample proportion of successes. Then

    \begin{align*} ${\rm E}\left(\hat p\right) &= E\left(\frac{X}{n}\right)\\ &= \frac{{\rm E}(X)}{n}\\ &= \frac{np}{n}\\ &= p, \end{align*}

so \hat p is an unbiased estimator for the population proportion p. Further,

    \begin{align*} {\rm Var}\left(\hat p\right) &= {\rm Var}\left(\frac{X}{n}\right)\\ &= \frac{{\rm Var}(X)}{n^2}\\ &= \frac{np(1-p)}{n^2}\\ &= \frac{p(1-p)}{n}. \end{align*}

In practice p is unknown, so we estimate p by \hat p. Then the standard error of \hat p is

    \[{\rm SE}(\hat p) = \frac{\hat p(1-\hat p)}{n}.\]

Distribution of \boldmath{\hat p}

Note that

    \[\hat p = \frac{1}{n} \sum_{i=1}^n X_i,\]

where X_i are iid Bern(p) random variables. So \hat p is simply a sample mean. Then all the results of Chapter 10 apply. In particular, for n \ge 30, by CLT,

    \[\hat p \overset{\cdot}{\underset{\cdot}{\sim}} N\left(p, \frac{\hat p (1-\hat p)}{n}\right).\]

    \begin{align*} Z = \frac{\overline X - \mu}{s/\sqrt{n}}  \overset{\cdot}{\underset{\cdot}{\sim}} N(01,)\\ Z = \frac{\hat p - p}{\sqrt{p(1-p)/n}} \overset{\cdot}{\underset{\cdot}{\sim}}  N(0,1). \end{align*}

10.3 Confidence Intervals

A point estimate does not include any measure of variability. An interval estimate is often preferred, as it also includes a measure of the variability of the estimate.

10.3.1 Confidence interval for population mean µ

\overline X is a point estimate of \mu. We choose a confidence level, say 95%. We need two symmetric limits, L and U, such that

    \begin{align*} P(L < \overline X < U) &= 0.95\\ \Rightarrow P(\overline X < L) &= P(\overline X > U)\\ \Rightarrow P\left(Z < \frac{L - \mu}{\sigma/\sqrt{n}}\right) &=P\left(Z > \frac{U - \mu}{\sigma/ \sqrt{n}}\right)\\ &=0.025. \end{align*}

Now for the normal distribution, P(Z< -1.96) = P(Z > 1.96) = 0.025, so we have

    \begin{align*} \frac{L - \mu}{\sigma\sqrt{n}} &= -1.96 \Rightarrow L = \mu - 1.96 \frac{\sigma}{\sqrt{n}}\\ \frac{U - \mu}{\sigma\sqrt{n}} &= 1.96 \Rightarrow U = \mu + 1.96 \frac{\sigma}{\sqrt{n}} \end{align*}

Then

    \[P\left(L < \overline X < U\right) = P\left(\mu - 1.96 \frac{\sigma}{\sqrt{n}} < \overline X < \mu - 1.96 \frac{\sigma}{\sqrt{n}}\right) = 0.95.\]

We want end points that are not based on \mu. Thus

    \[\overline X < \mu + 1.96 \frac{\sigma}{\sqrt{n}} \Rightarrow \overline X - 1.96 \frac{\sigma}{\sqrt{n}} < \mu.\]

Similarly,

    \[\mu - 1.96 \frac{\sigma}{\sqrt{n}} < \overline X  \Rightarrow \mu < \overline X + 1.96 \frac{\sigma}{\sqrt{n}}.\]

This gives

    \[\overline X - 1.96 \frac{\sigma}{\sqrt{n}} < \mu < \overline X + 1.96 \frac{\sigma}{\sqrt{n}},\]

and

    \[P\left(\overline X - 1.96 \frac{\sigma}{\sqrt{n}} < \mu < \overline X + 1.96 \frac{\sigma}{\sqrt{n}}\right) = 0.95.\]

From this we obtain a 95% confidence interval (CI) for \mu as

    \[\left(\overline X - 1.96 \frac{\sigma}{\sqrt{n}}, \overline X + 1.96 \frac{\sigma}{\sqrt{n}}\right).\]

Note that this is a random interval (since \overline X is a random variable). In practice, we use the estimate \bar{x} for \bar{X}, giving an observed 95% CI for \mu as

    \[\left(\bar{x}-1.96\dfrac{\sigma}{\sqrt{n}},\bar{x}+1.96\dfrac{\sigma}{\sqrt{n}}\right).\]

In general, a 100(1-\alpha)% CI is given by

    \[\left(\bar{x}-z_{\alpha/2}\dfrac{\sigma}{\sqrt{n}},\bar{x}+z_{\alpha/2}\dfrac{\sigma}{\sqrt{n}}\right),\]

where \textrm{P}(|Z|>z_{\alpha/2})=\alpha

or  

    \[2P\left(Z<z_{\alpha/2}\right)=\alpha.\]

The appropriate z_{\alpha/2} can be obtained using software.

Example 10.1 Compute a 95% confidence interval for a population mean \mu based on sample of size n=10, with \overline x = 8.4, \sigma = 3.

Solution

The confidence limits (CL) are

    \begin{align*}95\%\ {\rm CL\ for\ } \mu &= \overline x \pm 1.96 \frac{\sigma}{\sqrt{n}}\\ &= 8.4 \pm \frac{1.96 \times 3}{\sqrt{10}}\\ &= 8.4 \pm 1.86\\ \Rightarrow 95\%\ {\rm CI\ for\ } \mu &= (6.54, 10.26). \end{align*}

Frequentist interpretation of confidence interval

We still do not know the value of \mu. Thus we do not know if any given (observed) CI contains \mu or not. We cannot say in Example 10.1 that

    \[\textrm{P}(6.54<\mu<10.26)=0.95.\]

However, if several samples of size n are taken and a 95% CI computed for each, then 95% of these intervals will on average contain the value of \mu. This is illustrated in the figure below. Twenty random samples of size 100 each were taken from a N\left(15, 100\right) distribution. The 95% confidence intervals obtained from these samples are given in the figure below, with the horizontal line representing the population mean \mu. Of these, only one (that is, 5%) does not contain the value of the sample mean.

Frequentist interpretation of 95% confidence interval. Vertical lines indicate the minimum and maximum of the confidence intervals for each sample. Only 1 in 20 (5%) intervals does not cover the true mean, that is, the entire confidence interval for the sample is above the horizontal line indicating the true mean.
Frequentist interpretation of 95% confidence interval. Only 1 in 20 (5%) intervals does not cover the true mean.

Accuracy of estimate

The half-width of the interval is w = z_{\alpha/2} \times \dfrac{\sigma}{\sqrt{n}}. This depends on three things.

  1. Level of confidence. High confidence \Rightarrow larger value of z \Rightarrow wider interval.
  2. The standard deviation \sigma. Large \sigma \Rightarrow wide interval. Large \sigma indicates large variance in the population, and hence larger uncertainty in the estimate.
  3. Sample size n. Large sample size \Rightarrow narrow interval.

Of these three parameters, only the sample size can be controlled. A narrow (more accurate) interval with high confidence requires a larger sample size, which comes at a cost.

Sampling distribution again

In Chapter 10 we covered three cases for the distribution of the sample mean. These three cases apply in the calculation of the confidence interval.

CASE 1  Normal Population — \sigma known (RARELY THE CASE)}

100(1-\alpha)% CL = \bar{x} \pm z_{\alpha/2} \cdot \dfrac{\sigma}{\sqrt{n}}

CASE 2  Normal Population — \sigma unknown
100(1-\alpha)% CL = \bar{x} \pm t^{\alpha/2}_{n-1} \cdot \dfrac{s}{\sqrt{n}}

CASE 3  Population Distribution not Normal
Use CLT (n \ge 30).

100(1-\alpha)% CL = \bar{x} \pm t^{\alpha/2}_{n-1} \cdot \dfrac{s}{\sqrt{n}}

OR  if n is large (a few hundred) use \bar{x} \pm z_{\alpha/2} \cdot \dfrac{s}{\sqrt{n}}.

If n < 30 then we must assume that the population is normal and use Case 2 (since \sigma is usually unknown).

Note R always uses the t-distribution. Historically the normal distribution is used for hand calculations for ease, since table values for all degrees of freedom for the t-distribution are not available.

General form of confidence interval

General form of 100(1-\alpha)% CL for parameter \theta is

    \[\hat{\theta} \pm z_{\alpha/2} \times SE (\hat{\theta})\]

OR

    \[\hat{\theta} \pm t^{\alpha/2}_{n-1} \times SE (\hat{\theta})\]

Example 10.2

Chronic exposure to asbestos fibre is a well known health hazard. The table below gives measurements of pulmonary compliance (a measure of the how effectively the lung can inhale and exhale, in ml/ cm H_2O) for 16 construction workers, 8 months after they had left a site on which they had suffered prolonged exposure to asbestos. (Data source: Harless, Watanabe and Renzetti Jr (1978). `The Acute Effects of Chrysotile Asbestos Exposure on Lung Function’,  Environmental Research, 16, 360-372. ©Elsevier. Used with permission.)

Table: Pulmonary compliance (ml/ cm H_2O) for 16 workers subjected to prolonged exposure to asbestos fibre.

167.9 180.8 184.8 189.8 194.8 200.2 201.9 206.9
207.2 208.4 226.3 227.7 228.5 232.4 239.8 258.6

Compute a point estimate and a 95% confidence interval for the population mean pulmonary compliance \mu for people with prolonged asbestos fibre exposure.

Solution

The mean of the data is \bar x = 209.75 and the sample standard deviation is s=24.16. The standard error for \bar x is SE(\bar x) = 24.16/\sqrt{16} =6.04. To find a 95% confidence interval for \mu, we need the  critical value t_{\alpha/2} with \nu = n-1=15, where \alpha = 0.05 (since 95% = 100(1-0.05)). From R we find that t_{0.025}=2.131. Hence, a 95% confidence interval for \mu is given by

    \begin{eqnarray*} \lefteqn{\left ( \bar x - t_{\alpha/2} SE(\bar x), \, \bar x + t_{\alpha/2} SE(\bar x) \right ) }\\ & & = \left (209.75 - 2.131 \times 6.04, \, 209.75 + 2.131 \times 6.04 \right )\\ & & = ( 196.9 ,\, 222.6 ) \hspace{10mm} \mbox{(1 dp)} \end{eqnarray*}

Example 10.3

A random sample of 200 light bulbs gives a mean lifetime of 2,000 hours with a standard deviation of 150 hours. Find a 95% confidence interval for the mean lifetime \mu of the bulbs.

Solution

n = 200, \overline x = 2,000 and s = 150. The value of t_{199}^{0.025} from R is 1.972. Note the value is negative, since we obtained the lower 0.025 critical value.

qt(0.025,199)
[1] -1.971957

Then

    \begin{align*} 95% {\rm CL\ for\ } \mu &= \overline x \pm t_{199}^{0.025} \times \frac{s}{\sqrt{200}}\\ &= 2,000 \pm (1.9720)\frac{150}{\sqrt{200}}\\ &= 2,000 \pm 20.91622\\ \end{align*}

so 95% CI =(1979.1,2020.9) hours.

Example 10.4

To estimate the mean weekly income of restaurant waitresses in a large city, an investigator collects weekly income data from a random sample of 75 waitresses. The mean and standard deviation are found to be $200 and $30 respectively. Compute a 99% confidence interval for the mean weekly income.

Solution

n = 75, \overline x = 200 and s = 30. The value of t_{74}^{0.005} from R is 2.6439. Note that the value is negative, since we obtained the lower 0.005 critical value.

qt(0.005,74)
[1] -2.643913

Then

    \begin{align*} 99% {\rm CL\ for\ } \mu &= \overline x \pm t_{74}^{0.005} \times \frac{s}{\sqrt{75}}\\ &= 200 \pm (2.6439)\frac{30}{\sqrt{75}}\\ &= 200 \pm 9.159\\ \end{align*}

so 99% CI =($190.84, $209.16).

10.3.2 Confidence Interval for Population Proportion

We consider only the case n \ge 30. Then the standardised sample proportion is

    \[Z=\dfrac{\hat{p}-p}{\sqrt{\dfrac{p(1-p)}{n}}} \sim \textrm{N(0,1)}.\]

Note that {\rm E}(\hat{p})=p, Var(\hat{p})=\dfrac{p(1-p)}{n}. Since p is unknown, we estimate it by \hat{p}, and estimate

    \begin{align*} {\rm Var}(\hat{p}) &= \dfrac{\hat{p}(1-\hat{p})}{n}\\ {\rm and\ SE\ of \ }\hat{p} &= \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}. \end{align*}

    \[100(1-\alpha)\% {\rm\ CL\  for\ } p = \hat{p} \pm z_{\alpha/2} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}.\]

The half-width of the interval is w = z_{\alpha/2} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}},
and depends on three things.

  1. Level of confidence. High confidence \Rightarrow wider interval.
  2. \hat{p}. Close to 0.5 \Rightarrow wider interval, close to 0 or 1 \Rightarrow narrow interval.
  3. n. Large n \Rightarrow narrow interval.

Example 10.5

A random sample of 100 children in a developing country found that 15% of them were classified as  low birth weight (< 2500 g). Compute a 95% CI for the population proportion of babies that are low birth weight (LBW).

Solution

The sample proportion of LBW babies is \hat p = 0.15. Then

    \begin{align*} 95\% {\rm CL} &= 0.15 \pm 1.96 \times \sqrt{\frac{(0.15)(1-0.15)}{100}}\\ &= 0.15 \pm 0.069986\\ \Rightarrow 95\% {\rm\ CI\ } &= (0.08, 0.22). \end{align*}

10.4 Sample size calculations

Here we assume that the sample size is large (at least 30, but usually a few hundred) so we can use the normal distribution (instead of the t-distribution). The error in estimating a parameter by an interval estimate is half the interval width.

Population Mean

The half width is  z_{\alpha/2} \dfrac{s}{\sqrt{n}}. If we want this to be less than w, then

    \begin{align*} z_{\alpha/2} \dfrac{s}{\sqrt{n}} \leq w\\ \Rightarrow z_{\alpha/2} \dfrac{s}{w} \leq \sqrt{n}\\ \Rightarrow \sqrt{n} \geq z_{\alpha/2} \frac{s}{w}\\ \Rightarrow n & \geq \left(\frac{z_{\alpha/2} \times s}{w}\right)^2. \end{align*}

Example 10.6

What minimum sample size is required to estimate a population mean to within 2 with 95% confidence if s=20?

Solution

Assume n \geq 30 so we can use the normal distribution. Then

    \[n \geq \left(\frac{1.96 \times 20}{2}\right)^2 = 384.16,\]

that is, n = 385.

In practice we would take n=400 as a round figure.

Population proportion

The half width is  z_{\alpha/2} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}} = z_{\alpha/2} \frac{s}{\sqrt{n}} where s = \sqrt{\hat p (1- \hat p)} is the standard deviation of the Bern(p) observations that comprise the sample.

So the formula for the sample size is the same is previously:

    \begin{align*} n &\geq \left(\frac{z_{\alpha/2} \times s}{w}\right)^2,\\ {\rm that\ is,\ } n &\geq \left(\frac{z_{\alpha/2} \times \sqrt{\hat p (1- \hat p)}}{w}\right)^2. \end{align*}

This depends on the value of \hat p, which is unknown before the sample is taken! However, the expression \hat p (1-\hat p) is greatest when \hat p = 0.5. Thus we can take:

    \[n \geq \left(\frac{0.5 \times z_{\alpha/2}}{w}\right)^2\]

Example 10.7

What minimum sample size is required to estimate a population proportion to within 0.05 with confidence 95%?

Solution

Assume n \geq 30 so we can use the normal distribution. Then

    \[n \geq \left(\frac{0.5 \times 1.96}{0.05}\right)^2 = 384.16,\]

that is, n = 385.

In practice we would take n=400 as a round figure.

10.5 Summary

Population mean

    \begin{align*} 100(1-\alpha)%\ {\rm CL\ for\ } \mu &= \bar{x} \pm t^{\alpha/2}_{n-1} \cdot \dfrac{s}{\sqrt{n}}\\ {\rm Sample\ size:\ } n &\geq \left(\frac{z_{\alpha/2} \times s}{w}\right)^2 \end{align*}

Population proportion

    \begin{align*} 100(1-\alpha)%\ {\rm CL\ for\ } p &= \hat{p} \pm z_{\alpha/2} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\\ {\rm Sample\ size:\ } &\geq \left(\frac{0.5 \times z_{\alpha/2}}{w}\right)^2 \end{align*}

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Statistics: Meaning from data Copyright © 2024 by Dr Nazim Khan is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book