Sampling distribution of the sample mean

Dr R. Nazim Khan

9 Sampling distribution of the sample mean

Learning Outcomes

At the end of this chapter you should be able to:

explain the reasons and advantages of sampling;
explain the sources of bias in sampling;
select the appropriate distribution of the sample mean for a simple random sample.

9.1 Introduction

Problem

Information about populations is commonly needed for various purposes. Some examples are:

mean income of Australians;
proportion voting for the government at the next election;
market share of a major retailer;
proportion of faulty items produced in a factory;
proportion of children undergoing tonsillectomy who will have adverse respiratory events.

The above are population quantities. We want the mean income of all Australians; the proportion of all faulty items produced.

How can this be done? One way is to make measurements on every unit in the population –check each item for fault; ask every voter their voting intention. This is a census.

Issues with census

Expense. It is costly to make measurements on all units in a population.
Takes too much time.
Difficult to conduct.
Measurements may be destructive — how do you test a match for quality?
Impossible in some situations. For example:
- How much gold/oil is present in this deposit? We will only know once we have mined the deposit.
- What is the population of tuna, lions, elephants, turtles, whales, seals?

So how do we proceed?

Solution

Sampling. We make observations on a subset of the population, then generalise the results to the whole population. THIS REQUIRES CARE!

How should a sample be selected? The sample should “represent” the population. Inappropriate sampling techniques can lead to bias in the results.

BEWARE OF BIAS IN THE SAMPLE!!

These were covered in Chapter 1: Data Collection.

Example 9.1

TV polls. This is a self-selection survey. Only viewers of the channel with strong feelings (and SMS capability at the time of viewing) will respond. Such polls are very unreliable.
Telephone polls. Only people with telephones and who are in at the time of calling will be in the sample. Non-response (no one at home, not answering the call or unwilling to participate) is a major problem here.

Some Definitions

A population is a set of units of interest.
A parameter is any population measure, such as mean, variance, proportion of population size.
A sample is a subset of the population.

Several sampling schemes exist. An important idea in sampling theory is randomisation, that is, each unit in the sample is picked at random from the population. Estimates from samples will never be the same as the population quantities. That is, the estimates of population parameters based on samples will have error. In particular, we want to quantify the error in any estimates that are based on samples.

We need to define probability models for the above concepts so that probability theory can be used in estimation and inference for population parameters.

Why does sampling work?

What proportion of voters will vote Liberal at the next federal election? Ask a random sample of $n$ voters. Suppose $X$ of them say they will vote Liberal. So the sample proportion is $\hat p=X/n$ , which I can use to estimate the true population proportion $p$ .

Question: Can I assume that at the next federal election the proportion $\hat p$ of the voters will vote Liberal? What assumptions are involved?

We need to assume that the sample represents the population of voters. That is, the voters who were not asked would answer in the same was as those in my sample.
BUT, different samples will give me different results! That is, there will be sampling variation. Can we quantify this variation?
AND, how do I know if my sample results and your sample results are different only due to sampling variation?
Further, if I take another sample a week later, the results will be different. How do I know that this difference is within sampling error?

These questions relate to randomness in sampling, which is related to probability theory. To answer these questions we need the sampling distribution of our estimator.

PROBABILITY MODEL

We want to measure a quantity $X$ (such as salary, voting preference) for units in a population. We need a probability distribution for $X$ ; we call this the population distribution.

A parameter is any quantity associated with a population distribution, and is usually a summary measure of the population.

A sample is a set of independent and identically distributed (iid) random variables $X_1,X_2,\ldots,X_n$ , having the same distribution as $X$ (that is, the population distribution).

Simple Random Sample

A simple random sample (SRS) is one in which every unit in the population has the same probability of being selected. In practice, the form of the population distribution is known, but some parameters will be unknown. The only information available about the unknown parameters is the data $x_1,x_2,\cdots,x_n$ , which are observations on the random variables $X_1,X_2,\cdots,X_n$ in the sample.

ALL estimation and inference is based on the population model and the data.

Example 9.2: Mean birthweight of babies in sub-Saharan Africa

A paediatrician want to estimate the mean birthweight of babies in sub-Saharan Africa. She takes the birthweight of $n$ babies. Let these weights be $X_1, X_2, \ldots, X_n$ . Then we assume that $X_i \sim {\rm N}(\mu, \sigma^2)$ , where $\mu$ is the mean birthweight and $\sigma^2$ is the variance. This is the population distribution, and the parameters in this model are $\mu$ and $\sigma$ .

Questions

How do we estimate the mean?
How accurate is our estimate?
How do we select the sample size?
How large a sample size do we need for a specified accuracy?

Statistic

A statistic is a random variable the observed value of which depends only on the observed data. A statistic is usually a summary measure of the data.

Two common statistics are:

Sample Mean

$\overline X = \frac{1}{n} \sum_{i=1}^n X_i.$

Observed value

$\overline x = \frac{1}{n} \sum_{i=1}^n x_i.$

Sample Variance

$S^2=\frac{1}{n-1}\sum_{i=1}^n\left(X_i-\overline X\right)^2.$

Observed value

$s^2=\frac{1}{n-1}\sum_{i=1}^n\left(x_i-\overline x\right)^2.$

Note that $\overline X$ and $S^2$ are random variables — their observed values ( $\overline x$ and $s^2$ ) depend on the sample selected. Note also that as always, we use upper case letters ( $\overline X$ ) for the random variable and the corresponding lower case letter ( $\overline x$ ) for its observed value.

Example 9.3: Mean product weight

To determine the mean weight of a packed product, $n$ items are selected at random from a warehouse. Let $X$ denote the weight of a randomly selected item. Then $X\sim {\rm N}(\mu,\sigma^2)$ is the assumed population model, where the model parameters are the mean weight $\mu$ and variance $\sigma^2$ of the weights.

Let $X_1,X_2,\cdots,X_n$ be the random variables denoting the weights of the items in the sample. Then $X_i \overset {iid}\sim {\rm N}(\mu,\sigma^2)$ , $i=1,2,\cdots,n$ .

Suppose the weights of boxes of a cereal are normally distributed with mean 500 g and standard deviation 5 g. What is the probability that a random sample of 25 boxes has a sample mean weight less than 495 g?

That is, we need

$P\left(\overline X < 495\right).$

To compute this we need the distribution of $\overline X$ !

So what is the distribution of the sample mean? What if we use simulations?

Histogram of 100 samples of size 100 each from a N(500, 25) distribution. Histogram shows low frequencies at the lowest sample mean (498.5) and highest sample mean (501.0) wither higher frequency around the man. Thus the histogram maps well onto a normal distribution, which is drawn as a line over the histogram. — Histogram of 100 sample means of size 100 each from a N(500, 25) distribution.

Histogram of 100 samples of size 1000 each from a N(500, 25) distribution. The histogram is narrower than the previous with sample sizes of 100, and the values are in a narrower range from minimum 499.5 to maximum 500.5. With a peak in the center, the histogram still maps onto a normal distribution. — Histogram of 100 sample means of size 1000 each from a N(500, 25) distribution.

Shown above are relative histograms of simulations of 100 means of sample sizes $n_1=100$ and $n_2 =1000$ , from the $N\left(500,5^2\right)$ distribution, with a normal distribution curve superimposed. The horizontal axis has the same scale in both plots. Both histograms closely resemble the normal distribution. Note the following.

Both histograms are centred at 500, which is the mean of the distribution simulated from.
The standard deviation is much lower than the population standard deviation of 5, as judged from the spread in the histograms. The range for $n_1=100$ is a little more than 2.5, while that for $n_2 = 1000$ is less than 1.
The standard deviation of the means of the simulated data with $n_1=100$ was 0.5, and that for $n_2=1000$ was 0.158.

IMPORTANT RESULT

Let $X_1,X_2,\cdots,X_n$ be independent and identically distributed random variables with mean $\mu$ and variance $\sigma^2$ , that is, ${\rm E}(X_i)=\mu$ and ${\rm Var}(X_i) = \sigma^2$ for $i=1,2,\cdots,n$ . Let $\overline X$ be the sample mean,

$\overline X = \frac{1}{n} \sum_{i=1}^n X_i.$

Then

${\rm E}\left(\overline X\right) = \mu_{\overline X}=\mu$

and

${\rm Var}\left(\overline X\right) = \sigma^2_{\overline X}=\frac{\sigma^2}{n}.$

Thus if $n$ is large, $\sigma_{\overline X}$ is small, so $\overline X$ is likely to be close to its mean $\mu$ .

Proof

First consider the mean of $\overline X$ . We use properties of expectation and variance of a sum of independent random variables.

$\begin{align*} {\rm E}\left(\overline X\right)&={\rm E}\left(\frac{1}{n}\sum_{i=1}^nX_i\right)\\ &= \frac{1}{n}\sum_{i=1}^n{\rm E}\left(X_i\right)\\ &=\frac{1}{n}\sum_{i=1}^n \mu = \frac{1}{n} n\mu = \mu. \end{align*}$

Similarly,

${\rm Var}\left(\overline X\right) = {\rm Var}\left(\frac{1}{n}\sum_{i=1}^nX_i\right) = \frac{1}{n^2}\sum_{i=1}^n{\rm Var}\left(X_i\right) =\frac{1}{n^2}\sum_{i=1}^n \sigma^2 = \frac{1}{n^2} n\sigma^2 = \frac{\sigma^2}{n}.$

Note: Be careful!

The sample mean is the average calculated from a sample, and is denoted ${\overline X}$ . This is a random variable, because its value depends on the sample chosen, and the sampling is random. Its observed value calculated from sample data is denoted $\overline x$ . This observed value depends on the sample data.
The population mean $\mu$ is a constant at any point it time. It may change with time, but at the time of sampling we assume it is a constant. The value of $\mu$ is usually unknown.

9.2 Sampling Distribution of the Sample Mean

In many sampling situations the population mean $\mu$ is unavailable and is the parameter of interest.
The natural way to estimate $\mu$ is by the sample mean $\overline X$ .
For inference about $\mu$ , we need the distribution of the sample mean $\overline{X}$ .
The distribution of $\overline{X}$ depends on the population distribution and the sampling scheme, and so it is called the sampling distribution of the sample mean.
The sampling distribution of the sample mean depends on the population variance $\sigma^2$ .
Different sampling distributions arise, depending on whether $\sigma^2$ is known or not.
The population distribution is often unknown.

We consider three cases below. These three cases from the basis of estimation and inference and will be used throughout the rest of the book.

Case 1: Normal population—σ known

Let $X_1,X_2,\cdots,X_n$ be iid ${\rm N}(\mu,\sigma^2)$ random variables. Then

$\overline X \sim {\rm N}\left(\mu,\frac{\sigma^2}{n}\right)$

or

$Z=\frac{\overline X-\mu}{\sigma/\sqrt{n}} \sim {\rm N}(0,1).$

This follows from our earlier results that sums and linear scaling of normal random variables are also normal.

We saw this result in the simulation histograms earlier, where we had simulated from a normal distribution.

Example 9.4

The time it takes to serve a customer at a bank teller is normal with mean 150 secs and variance 900 secs $^2$ .

(i) What is the probability that a random sample of 25 customers will have a mean service time greater than 160 seconds?

(ii) Repeat (i) above for a sample of 100 customers.

Solution

(i) Let $\overline X$ denote the sample mean. Then

$\begin{align*} \overline X &\sim N\left(150, \frac{900}{25}\right).\\ P(\overline X > 160) &= P\left(Z > \frac{160-150}{\sqrt{900/25}}\right)\\ &= P(Z > 1.67)\\ &= 1- 0.9525 = 0.0475 \approx 5\%. \end{align*}$

(ii)Now

$\begin{align*} \overline X &\sim N\left(150, \frac{900}{100}\right).\\ P(\overline X > 160) &= P\left(Z > \frac{160-150}{\sqrt{900/100}}\right)\\ &= P(Z > 3.33)\\ &= 1- 0.9996 = 0.0004 \approx 0.04\%. \end{align*}$

Example 9.5

A sample of size 100 is drawn from a N $\left(\mu, 16\right)$ population. Find

(i) $P(\overline X < \mu)$

(ii) $P\left(\left | \overline X - \mu\right | < 0.2\right)$

(iii) $k$ such that $P\left(\mid\overline X -\mu\mid < k \right) = 0.95$ .

Solution

(i) $P(\overline X < \mu) = 0.5$ , since the normal distribution is symmetric about its mean.

(ii)

$\begin{align*} P\left(\left | \overline X - \mu\left | < 0.2\right) &= P\left(\left |\frac{\overline X - \mu}{\sqrt{16/100}}\right | < \frac{0.2}{\sqrt{16/100}}\right)\\ &= P(\left|Z\right| < 0.5) \\ &= 2\times [P(Z < 0.5) - 0.5]\\ &= 2\times(0.6915 - 0.5) = 0.3830. \end{align*}$

(iii)

$\begin{align*} P\left(\left | \overline X - \mu\left | < k\right) &= P\left(\left |\frac{\overline X - \mu}{\sqrt{16/100}}\right | < \frac{k}{\sqrt{16/100}}\right)\\ &= P\left(\left|Z\right| < \frac{k}{\sqrt{16/100}}\right) \\ &= 0.95\\ \Rightarrow \frac{k}{\sqrt{16/100}} &= 1.96\\ \Rightarrow k &= 1.96 \times \sqrt{16/100} = 0.784. \end{align*}$

Case 2: Normal Population—σ Unknown

Now we estimate the population standard deviation by the sample standard deviation $S$ . Then

$T = \frac{\overline X - \mu}{S\sqrt{n}} \sim t_{n-1}$

where $t_{n-1}$ is a t distribution with $n-1$ degrees of freedom.

The $t$ -distribution has a similar shape to that of the standard normal distribution, but is flatter. We say the $t$ -distribution has “fatter tails”. For large values of degrees of freedom (df), the $t$ -distribution is very close to the normal distribution. The plot below shows the density functions for the standard normal distribution with several $t$ -distributions superimposed.

Plots of the t-distribution for various degrees of freedom, with the normal distribution superimposed. The plots shows that the t-distribution has a wider distribution. The spread of the t-distribution decreases as the degrees of freedom increases and approaches the normal distribution. — Plots of the t-distribution for various degrees of freedom, with the normal distribution superimposed.

Notes

$S^2 = \frac{1}{n-1} \sum_{i= 1}^n \left(X_i - \overline X\right)^2$ is the estimator of $\sigma^2$ . The divisor in this expression is $n-1$ , which is the degrees of freedom of the $t$ -distribution. ALWAYS the divisor in the estimator for the variance gives the df for the corresponding t distribution for the problem.
$Var\left(\overline X\right) = \frac{\sigma^2}{n},$

and this is estimated by

$\frac{S}{\sqrt{n}} = {\rm \ standard\ error\ (SE)\ of\ the\ mean.}$

t-tables

$t$ -tables list the $t$ -values for only the tail probabilities of given degrees of freedom. But R give probabilities for all values for all degrees of freedom.

> pnorm(1.95) ##P(Z < 1.95) for Z ~ N(0,1)
[1] 0.9744119
> pt(1.95, 5) ##P(T < 1.95) for T ~ t_5 distribution
[1] 0.9456649
> pt(1.95, 20) ##P(T < 1.95) for T ~ t_20 distribution
[1] 0.9673328
> pt(1.95, 200) ##P(T < 1.95) for T ~ t_100 distribution
[1] 0.9737131

Case 3: Population Not Normal—σ Unknown

CENTRAL LIMIT THEOREM (CLT)

Suppose $X_1, X_2, \ldots, X_n$ are iid random variables with mean $\mu$ and variance $\sigma^2$ . If $n$ is large enough, then

$Z = \frac{\overline X - \mu}{\sigma/\sqrt{n}} \overset{\cdot}{\underset{\cdot}{\sim}} N(0,1)$

Notes

We consider $n \ge 30$ to be large enough for the approximation to be good. As $n$ increases the approximation improves.

If $\sigma$ is unknown then

$T = \frac{\overline X - \mu}{s/\sqrt{n}} \overset{\cdot}{\underset{\cdot}{\sim}} t_{n-1}.$

BUT we must have $n \ge 30$ .

Notes

If the sample size is large (a few hundred), then the $t_{n-1}$ can be approximated by N $(0,1)$ distribution. This is the most common situation.
Although historically the N(0,1) distribution has been used in hand calculations, R ALWAYS uses the $t_{n-1}$ distribution.

Example 9.6

A particular long-life light bulb has a mean life of 1500 hours with a standard deviation of 300 hours. What is the probability that a random sample of 200 bulbs has a mean life time less than 1450 hours?

Solution

Here the standardised sample mean is

$Z= \frac{\overline X - 1500}{300\sqrt{200}} \overset{\cdot}{\underset{\cdot}{\sim}} N(0,1),$

by the CLT ( $n = 200 \ge 30$ ).

$\begin{align*} P(\overline X < 1450) &= P\left(Z < \frac{1450 - 1500}{300\sqrt{200}}\right)\\ &= P(Z < -2.36)\\ &= 0.0081. \end{align*}$

9.3 Summary

Population standard deviation $\sigma$

Normal distribution

Not normal distribution

Known

$Z = \frac{\overline{X} -\mu}{\sigma/\sqrt{n}} \sim {\rm N}(0,1)$

Exact distribution

$Z = \frac{\overline{X} -\mu}{\sigma/\sqrt{n}} \overset{\cdot}{\underset{\cdot}{\sim}} {\rm N}(0,1)$

Approximate distribution

Unknown

$Z = \frac{\overline{X} -\mu}{s/\sqrt{n}} \sim {\rm t}_{n-1}$

Exact distribution

$Z = \frac{\overline{X} -\mu}{s/\sqrt{n}} \overset{\cdot}{\underset{\cdot}{\sim}} {\rm t}_{n-1}$

Approximate distribution

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

9 Sampling distribution of the sample mean

Learning Outcomes

Contents

9.1 Introduction

Problem

Issues with census

Solution

BEWARE OF BIAS IN THE SAMPLE!!

Some Definitions

Why does sampling work?

PROBABILITY MODEL

Simple Random Sample

Statistic

Sample Mean

Sample Variance

IMPORTANT RESULT

Proof

Note: Be careful!

9.2 Sampling Distribution of the Sample Mean

Case 1: Normal population—σ known

Case 2: Normal Population—σ Unknown

Notes

t-tables

Case 3: Population Not Normal—σ Unknown

CENTRAL LIMIT THEOREM (CLT)

Notes

9.3 Summary

Licence

Share This Book