9 Sampling distribution of the sample mean

Learning Outcomes

At the end of this chapter you should be able to:

  1. explain the reasons and advantages of sampling;
  2. explain the sources of bias in sampling;
  3. select the appropriate distribution of the sample mean for a simple random sample.

 

 

9.1 Introduction

Problem 

Information about populations is commonly  needed for various purposes. Some examples are:

  • mean income of Australians;
  • proportion voting for the government at the next election;
  • market share of a major retailer;
  • proportion of faulty items produced in a factory;
  • proportion of children undergoing tonsillectomy who will have adverse respiratory events.

The above are population quantities. We want the mean income of all Australians; the proportion of all faulty items produced.

How can this be done? One way is to make measurements on every unit in the population –check each item for fault; ask every voter their voting intention. This is a census.

Issues with census

  • Expense. It is costly to make measurements on all units in a population.
  • Takes too much time.
  • Difficult to conduct.
  • Measurements may be destructive — how do you test a match for quality?
  • Impossible in some situations. For example:
    • How much gold/oil is present in this deposit? We will only know once we have mined the deposit.
    • What is the population of tuna, lions, elephants, turtles, whales, seals?

So how do we proceed?

Solution

Sampling. We make observations on a subset of the population, then generalise the results to the whole population. THIS REQUIRES CARE!

How should a sample be selected? The sample should “represent” the population. Inappropriate sampling techniques can lead to bias in the results.

BEWARE OF BIAS IN THE SAMPLE!!

These were covered in Chapter 1: Data Collection.

Example 9.1

  • TV polls. This is a self-selection survey. Only viewers of the channel with strong feelings (and SMS capability at the time of viewing) will respond. Such polls are very unreliable.
  • Telephone polls. Only people with telephones and who are in at the time of calling will be in the sample. Non-response (no one at home, not answering the call or unwilling to participate) is a major problem here.

Some Definitions

  • population is a set of units of interest.
  • parameter is any population measure, such as mean, variance, proportion of population size.
  • sample is a subset of the population.

Several sampling schemes exist. An important idea in sampling theory is randomisation, that is, each unit in the sample is picked at random from the population. Estimates from samples will never be the same as the population quantities. That is, the estimates of population parameters based on samples will have error. In particular, we want to quantify the error in any estimates that are based on samples.

We need to define probability models for the above concepts so that probability theory can be used in estimation and inference for population parameters.

Why does sampling work?

What proportion of voters will vote Liberal at the next federal election? Ask a random sample of n voters. Suppose X of them say they will vote Liberal. So the  sample proportion is \hat p=X/n, which I can use to estimate the  true population proportion p.

Question: Can I assume that at the next federal election the proportion \hat p of the voters will vote Liberal? What assumptions are involved?

  • We need to assume that the  sample represents the population of voters. That is, the voters who were not asked would answer in the same was as those in my sample.
  • BUT, different samples will give me different results! That is, there will be  sampling variation. Can we quantify this variation?
  • AND, how do I know if my sample results and your sample results are different only due to sampling variation?
  • Further, if I take another sample a week later, the results will be different. How do I know that this difference is within  sampling error?

These questions relate to randomness in sampling, which is related to probability theory. To answer these questions we need the sampling distribution of our estimator.

PROBABILITY MODEL

We want to measure a quantity X (such as salary, voting preference) for units in a population. We need a probability distribution for X; we call this the population distribution.

A parameter is any quantity associated with a population distribution, and is usually a summary measure of the population.

A sample is a set of  independent and identically distributed (iid) random variables X_1,X_2,\ldots,X_n, having the same distribution as X (that is, the population distribution).

Simple Random Sample

simple random sample (SRS) is one in which every unit in the population has the same probability of being selected. In practice, the form of the population distribution is known, but some parameters will be unknown. The only information available about the unknown parameters is the data x_1,x_2,\cdots,x_n, which are observations on the random variables X_1,X_2,\cdots,X_n in the sample.

ALL estimation and  inference is based on the population model and the data.

Example 9.2: Mean birthweight of babies in sub-Saharan Africa 

A paediatrician want to estimate the mean birthweight of babies in sub-Saharan Africa. She takes the birthweight of n babies. Let these weights be X_1, X_2, \ldots, X_n. Then we assume that X_i \sim {\rm N}(\mu, \sigma^2), where \mu is the mean birthweight and \sigma^2 is the variance. This is the population distribution, and the parameters in this model are \mu and \sigma.

Questions

  1. How do we estimate the mean?
  2. How accurate is our estimate?
  3. How do we select the sample size?
  4. How large a sample size do we need for a specified accuracy?

Statistic

statistic is a random variable the observed value of which depends only on the observed data. A statistic is usually a summary measure of the data.

Two common statistics are:

Sample Mean

    \[\overline X = \frac{1}{n} \sum_{i=1}^n X_i.\]

Observed value

    \[\overline x = \frac{1}{n} \sum_{i=1}^n x_i.\]

Sample Variance

    \[S^2=\frac{1}{n-1}\sum_{i=1}^n\left(X_i-\overline X\right)^2.\]

Observed value

    \[s^2=\frac{1}{n-1}\sum_{i=1}^n\left(x_i-\overline x\right)^2.\]

Note that \overline X and S^2 are random variables — their observed values (\overline x and s^2) depend on the sample selected. Note also that as always, we use upper case letters (\overline X) for the random variable and the corresponding lower case letter (\overline x) for its observed value.

Example 9.3: Mean product weight

To determine the mean weight of a packed product, n items are selected at random from a warehouse. Let X denote the weight of a randomly selected item. Then X\sim {\rm N}(\mu,\sigma^2) is the assumed population model, where the model parameters are the mean weight \mu and variance \sigma^2 of the weights.

Let X_1,X_2,\cdots,X_n be the random variables denoting the weights of the items in the sample. Then X_i \overset {iid}\sim {\rm N}(\mu,\sigma^2), i=1,2,\cdots,n.

Suppose the weights of boxes of a cereal are normally distributed with mean 500 g and standard deviation 5 g. What is the probability that a random sample of 25 boxes has a  sample mean weight less than 495 g?

That is, we need

    \[P\left(\overline X < 495\right).\]

To compute this we need the distribution of  \overline X!

So what is the distribution of the sample mean? What if we use simulations?

 

 

Histogram of 100 samples of size 100 each from a N(500, 25) distribution. Histogram shows low frequencies at the lowest sample mean (498.5) and highest sample mean (501.0) wither higher frequency around the man. Thus the histogram maps well onto a normal distribution, which is drawn as a line over the histogram.
Histogram of 100 sample means of size 100 each from a N(500, 25) distribution.
Histogram of 100 samples of size 1000 each from a N(500, 25) distribution. The histogram is narrower than the previous with sample sizes of 100, and the values are in a narrower range from minimum 499.5 to maximum 500.5. With a peak in the center, the histogram still maps onto a normal distribution.
Histogram of 100 sample means of size 1000 each from a N(500, 25) distribution.

Shown above are relative histograms of simulations of 100 means of sample sizes n_1=100 and n_2 =1000, from the N\left(500,5^2\right) distribution, with a normal distribution curve superimposed. The horizontal axis has the same scale in both plots. Both histograms closely resemble the normal distribution. Note the following.

  • Both histograms are centred at 500, which is the mean of the distribution simulated from.
  • The standard deviation is much lower than the population standard deviation of 5, as judged from the spread in the histograms. The range for n_1=100 is a little more than 2.5, while that for n_2 = 1000 is less than 1.
  • The standard deviation of the means of the simulated data with n_1=100 was 0.5, and that for n_2=1000 was 0.158.

IMPORTANT RESULT

Let X_1,X_2,\cdots,X_n be independent and identically distributed random variables with mean \mu and variance \sigma^2, that is, {\rm E}(X_i)=\mu and {\rm Var}(X_i) = \sigma^2 for i=1,2,\cdots,n. Let \overline X be the sample mean,

    \[\overline X = \frac{1}{n} \sum_{i=1}^n X_i.\]

Then

    \[{\rm E}\left(\overline X\right) = \mu_{\overline X}=\mu\]

and

    \[{\rm Var}\left(\overline X\right) = \sigma^2_{\overline X}=\frac{\sigma^2}{n}.\]

Thus if n is large, \sigma_{\overline X} is small, so \overline X is likely to be close to its mean \mu.

Proof

First consider the mean of \overline X. We use properties of expectation and variance of a sum of independent random variables.

    \begin{align*} {\rm E}\left(\overline X\right)&={\rm E}\left(\frac{1}{n}\sum_{i=1}^nX_i\right)\\ &= \frac{1}{n}\sum_{i=1}^n{\rm E}\left(X_i\right)\\ &=\frac{1}{n}\sum_{i=1}^n \mu = \frac{1}{n} n\mu = \mu. \end{align*}

Similarly,

    \[{\rm Var}\left(\overline X\right) = {\rm Var}\left(\frac{1}{n}\sum_{i=1}^nX_i\right) = \frac{1}{n^2}\sum_{i=1}^n{\rm Var}\left(X_i\right) =\frac{1}{n^2}\sum_{i=1}^n \sigma^2 = \frac{1}{n^2} n\sigma^2 = \frac{\sigma^2}{n}.\]


Note: Be careful!

  1. The sample mean is the average calculated from a sample, and is denoted {\overline X}. This is a random variable, because its value depends on the sample chosen, and the sampling is random. Its observed value calculated from sample data is denoted \overline x. This observed value depends on the sample data.
  2. The population mean \mu is a constant at any point it time. It may change with time, but at the time of sampling we assume it is a constant. The value of \mu is usually unknown.

9.2 Sampling Distribution of the Sample Mean

  1. In many sampling situations the population mean \mu is unavailable and is the parameter of interest.
  2. The natural way to estimate \mu is by the sample mean \overline X.
  3. For inference about \mu, we need the distribution of the sample mean \overline{X}.
  4. The distribution of \overline{X} depends on the population distribution and the sampling scheme, and so it is called the sampling distribution of the sample mean.
  5. The sampling distribution of the sample mean depends on the population variance \sigma^2.
  6. Different sampling distributions arise, depending on whether \sigma^2 is known or not.
  7. The population distribution is often unknown.

We consider three cases below. These three cases from the basis of estimation and inference and will be used throughout the rest of the book.

Case 1: Normal population—σ known

Let X_1,X_2,\cdots,X_n be iid {\rm N}(\mu,\sigma^2) random variables. Then

    \[\overline X \sim {\rm N}\left(\mu,\frac{\sigma^2}{n}\right)\]

or

    \[Z=\frac{\overline X-\mu}{\sigma/\sqrt{n}} \sim {\rm N}(0,1).\]

This follows from our earlier results that sums and linear scaling of normal random variables are also normal.

We saw this result in the simulation histograms earlier, where we had simulated from a normal distribution.

Example 9.4

The time it takes to serve a customer at a bank teller is normal with mean 150 secs and variance 900 secs^2.

(i) What is the probability that a random sample of 25 customers will have a mean service time greater than 160 seconds?

(ii) Repeat (i) above for a sample of 100 customers.

Solution

(i) Let \overline X denote  the sample mean. Then

    \begin{align*} \overline X &\sim N\left(150, \frac{900}{25}\right).\\ P(\overline X > 160) &= P\left(Z > \frac{160-150}{\sqrt{900/25}}\right)\\ &= P(Z > 1.67)\\ &= 1- 0.9525 = 0.0475 \approx 5\%. \end{align*}

(ii)Now

    \begin{align*} \overline X &\sim N\left(150, \frac{900}{100}\right).\\ P(\overline X > 160) &= P\left(Z > \frac{160-150}{\sqrt{900/100}}\right)\\ &= P(Z > 3.33)\\ &= 1- 0.9996 = 0.0004 \approx 0.04\%. \end{align*}

Example 9.5

A sample of size 100 is drawn from a N\left(\mu, 16\right) population. Find

(i) P(\overline X < \mu)

(ii) P\left(\left | \overline X  - \mu\right | < 0.2\right)

(iii) k such that P\left(\mid\overline X -\mu\mid < k \right) = 0.95.

Solution

(i) P(\overline X < \mu) = 0.5, since the normal distribution is symmetric about its mean.

(ii)

    \begin{align*} P\left(\left | \overline X  - \mu\left | < 0.2\right) &= P\left(\left |\frac{\overline X - \mu}{\sqrt{16/100}}\right | < \frac{0.2}{\sqrt{16/100}}\right)\\ &= P(\left|Z\right| < 0.5) \\ &= 2\times [P(Z < 0.5) - 0.5]\\ &= 2\times(0.6915 - 0.5) = 0.3830. \end{align*}

(iii)

    \begin{align*} P\left(\left | \overline X  - \mu\left | < k\right) &= P\left(\left |\frac{\overline X - \mu}{\sqrt{16/100}}\right | < \frac{k}{\sqrt{16/100}}\right)\\ &= P\left(\left|Z\right| < \frac{k}{\sqrt{16/100}}\right) \\ &= 0.95\\ \Rightarrow  \frac{k}{\sqrt{16/100}} &= 1.96\\ \Rightarrow k &= 1.96 \times \sqrt{16/100} = 0.784. \end{align*}

Case 2: Normal Population—σ Unknown

Now we estimate the population standard deviation by the sample standard deviation S. Then

    \[T = \frac{\overline X - \mu}{S\sqrt{n}} \sim t_{n-1}\]

where t_{n-1} is a t distribution with n-1 degrees of freedom.

The t-distribution has a similar shape to that of the standard normal distribution, but is flatter. We say the t-distribution has “fatter tails”. For large values of degrees of freedom (df), the t-distribution is very close to the normal distribution. The plot below shows the density functions for the standard normal distribution with several t-distributions superimposed.

Plots of the t-distribution for various degrees of freedom, with the normal distribution superimposed. The plots shows that the t-distribution has a wider distribution. The spread of the t-distribution decreases as the degrees of freedom increases and approaches the normal distribution.
Plots of the t-distribution for various degrees of freedom, with the normal distribution superimposed.

Notes

  1.     \[S^2 = \frac{1}{n-1} \sum_{i= 1}^n \left(X_i - \overline X\right)^2\] is the estimator of \sigma^2. The divisor in this expression is n-1, which is the degrees of freedom of the t-distribution. ALWAYS the divisor in the estimator for the variance gives the df for the corresponding t distribution for the problem.
  2.     \[Var\left(\overline X\right) = \frac{\sigma^2}{n},\]

    and this is estimated by

        \[\frac{S}{\sqrt{n}} = {\rm \ standard\ error\ (SE)\ of\ the\ mean.}\]

t-tables

t-tables list the t-values for only the tail probabilities of given degrees of freedom. But R give probabilities for all values for all degrees of freedom.

> pnorm(1.95) ##P(Z < 1.95) for Z ~ N(0,1)
[1] 0.9744119
> pt(1.95, 5) ##P(T < 1.95) for T ~ t_5 distribution
[1] 0.9456649
> pt(1.95, 20) ##P(T < 1.95) for T ~ t_20 distribution
[1] 0.9673328
> pt(1.95, 200) ##P(T < 1.95) for T ~ t_100 distribution
[1] 0.9737131

Case 3:  Population Not Normal—σ Unknown

CENTRAL LIMIT THEOREM (CLT)

Suppose X_1, X_2, \ldots, X_n are iid random variables with mean \mu and variance \sigma^2. If n is large enough, then

    \[Z = \frac{\overline X - \mu}{\sigma/\sqrt{n}} \overset{\cdot}{\underset{\cdot}{\sim}} N(0,1)\]

Notes

  1. We consider n \ge 30 to be large enough for the approximation to be good. As n increases the approximation improves.

If \sigma is unknown then

    \[T = \frac{\overline X - \mu}{s/\sqrt{n}} \overset{\cdot}{\underset{\cdot}{\sim}} t_{n-1}.\]

BUT we must have n \ge 30.

Notes

  1. If the sample size is large (a few hundred), then the t_{n-1} can be approximated by N(0,1) distribution. This is the most common situation.
  2. Although historically the N(0,1) distribution has been used in hand calculations, R ALWAYS uses the t_{n-1} distribution.

Example 9.6 

A particular long-life light bulb has a mean life of 1500 hours with a standard deviation of 300 hours. What is the probability that a random sample of 200 bulbs has a mean life time less than 1450 hours?

Solution

Here the standardised sample mean is

    \[Z= \frac{\overline X - 1500}{300\sqrt{200}} \overset{\cdot}{\underset{\cdot}{\sim}} N(0,1),\]

by the CLT (n = 200 \ge 30).

    \begin{align*} P(\overline X < 1450) &= P\left(Z < \frac{1450 - 1500}{300\sqrt{200}}\right)\\ &= P(Z < -2.36)\\ &= 0.0081. \end{align*}

9.3 Summary

Population standard deviation \sigma Normal distribution Not normal distribution
Known

    \[Z = \frac{\overline{X} -\mu}{\sigma/\sqrt{n}} \sim {\rm N}(0,1)\]

Exact distribution

    \[Z = \frac{\overline{X} -\mu}{\sigma/\sqrt{n}} \overset{\cdot}{\underset{\cdot}{\sim}} {\rm N}(0,1)\]

Approximate distribution

Unknown

    \[Z = \frac{\overline{X} -\mu}{s/\sqrt{n}} \sim {\rm t}_{n-1}\]

Exact distribution

    \[Z = \frac{\overline{X} -\mu}{s/\sqrt{n}} \overset{\cdot}{\underset{\cdot}{\sim}} {\rm t}_{n-1}\]

Approximate distribution

 

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Statistics: Meaning from data Copyright © 2024 by Dr Nazim Khan is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book