7 Contingency Tables

Learning Outcomes

At the end of this chapter you should be able to:

  1. understand what a contingency table is;
  2. perform a chi squared test of independence for a contingency table;
  3. determine if two categorical variables are independent in a contingency table context;
  4. determine the form of dependence if any in a contingency table;
  5. explain your findings clearly.

 

 

7.1 Introduction

Inference for categorical or counts data was briefly covered in inference for population proportions and Poisson mean. We now consider categorical data in more detail and greater generality.

We consider the situation where two categorical variables are involved, and a joint table of observed frequencies is available. This is called a contingency table. We analyse the data for association between the variables, that is, we want to investigate if the variables are independent.

Such data often arises from a survey.

Example 7.1

A manufacturing facility has three different production lines, each of which produces the same product. The production manager wants to know if the proportion of defectives is the same for each production line. Data from each production line was collected, and is summarised in the table below.

Production line A B C Row Total
Defective 12 22 14 48
Good 188 148 196 532
Column total 200 170 210 580

At the 5% level of significance, determine if the proportion of defectives is the same for each production line.

Method

  1. Given a table of observed counts O_{ij} in row i and column j, we compute the expected counts E_{ij} under the null hypothesis.
  2. We then compute a test statistic based on the difference between the observed frequencies and the expected frequencies.
  3. We require that the expected frequency in each cell is at least 5; otherwise adjacent cells are pooled (merged, combined) until this requirement is satisfied.

The hypotheses

Here we are testing the hypotheses

    \begin{align*} H_0&:{\rm The\ two\ variables\ are\ independent.}\\ H_1&:{\rm The\ two\ variables\ are\ dependent.} \end{align*}

The observed frequency in each cell of the table is found assuming the null hypothesis to be true, that is, under the assumption of independence.

7.2 General Contingency Table

The general contingency table. In Row i and Column j the observed frequency is O_{ij}, the row total is R_i and the column total is C_j. The grand total, that is, the total of all the observations in the table is T.
\cdots col j \cdots Row Total
\cdots \cdots \cdots \cdots
row i \cdots O_{ij} \cdots R_i
\cdots \cdots \cdots \cdots
Column Total \cdots C_j \cdots T

We use ideas from joint distributions (see Chapter 6).
If the total of row i is R_i, the total of column j is C_j and the grand total is T, then the probability of being in row i

    \[r_i = R_i/T,\]

the probability of being in column j is

    \[c_j=C_j/T,\]

so, assuming independence, the probability of being in cell ij is

    \[p_{ij}=r_i\ c_j = \frac{R_i\ C_j}{T^2}.\]

Then the expected frequency of cell ij is

    \[e_{ij} = p_{ij}\ T = \frac{R_i\ C_j}{T^2} \times T = \frac{R_i\ C_j}{T}.\]

Test Statistic

    \[X^2 = \sum_{i=1}^r\sum_{j=1}^c\frac{\left(o_{ij}-e_{ij}\right)^2}{e_{ij}},\]

and this has a chi-squared distribution with degrees of freedom \nu, denoted \chi^2_{_{\nu}}, where the degrees of freedom is given by

    \[\nu = \left(r-1\right)\times\left(c-1\right),\]

where r and c are the number of rows and columns respectively.

Example 7.1

A manufacturing facility has three different production lines, each of which produces the same product. The production manager wants to know if the proportion of defectives is the same for each production line. Data from each production line was collected, and is summarised in the table below.

Production line A B C Row Total
Defective 12 22 14 48
Good 188 148 196 532
Column total 200 170 210 580

At the 5% level of significance, determine if the proportion of defectives is the same for each production line.

Solution

The hypotheses of interest are

H_0: Proportion of defectives is independent of machine (that is, the proportion of defectives is the same for all the machines).

H_1: H_0 is false.

The table below shows the expected frequencies (in brackets) for each cell under the null hypothesis, that is, under the assumption of independence. Note that these expected frequency sum to give the same row and column totals.

Production line A B C Row Total
Defective 12 (16.55) 22 (14.07) 14 (17.38) 48
Good 188 (183.45) 148 (155.93) 196 (192.62) 532
Column total 200 170 210 580

We provide the calculations for the expected frequencies for the first row to illustrate the process.

    \begin{align*} \frac{(48)(200)}{580} &= 16.55\\ \frac{(48)(170)}{580} &= 14.07\\ \frac{(48)(210)}{580} &= 17.38 \end{align*}

Exercise 7.1

Calculate the remaining expected frequencies.

The observed value of the test statistic is

    \begin{eqnarray*} X^2 &=& \frac{(12-16.55)^2}{16.55} + \frac{(22-14.07)^2}{14.07}  + \frac{(14-17.38)^2}{17.38}\\ && + \frac{(188-183.45)^2}{183.45} + \frac{(148-155.93)^2}{155.93} +\frac{(196-192.62)^2}{192.62}\\ &=& 1.2517 + 4.4709 + 0.6571\\ && 0.1129  + 0.4034 + 0.0593\\ &=& 6.9554 \end{eqnarray*}

The degrees of freedom for the chi squared distribution is (3-1)(2-1) = 2, and the p-value for the test is

    \[P(X > 6.9554) = 0.0309 < 0.05\]

(from R), where X \sim \chi^2_2, so we reject the null hypothesis at the 5 % level of significance. We conclude that the data provides evidence that the proportion of defectives is not the same for the production lines. A plot of the  \chi^2_2 is shown below, with the 5% upper tail  shaded.

The pdf of the chi-squared distribution with df = 2. The curve starts at the point 0.5 on the y-axis, and approaches the x-axis as a horizontal asymptote as x goes to infinity. The probability of 0.05 in the lower tail is shaded, and the chi-squared value corresponding to this value is 5.99, indicated on the graph.
The pdf of the chi-squared distribution with df = 2. The probability of 0.05 in the lower tail is shaded, and the chi-squared value corresponding to this value is 5.99, indicated on the graph.

We would also like to identify which production line is producing the higher proportion of defectives. In fact, looking that the chi square calculations, the largest contribution is in the cell corresponding to production line B Defectives. Examining the expected frequencies shows that it has a much higher number of defectives than that expected under the null hypothesis. Thus we conclude that production line 2 produces a higher proportion of defectives than the other two machines.

Exercise 7.2

A mobile technology provider has three different locations set up to provide customers with technical support for their products. The management wants to identify the call centre with the highest proportion of unsuccessfully resolved calls. They randomly select logs from each location and collect the following data.

Call centre A B C Row Total
Successful 257 264 283 804
Unsuccessful 43 86 97 226
Column total 300 350 380 1030

Based on this data, what should management conclude?

7.3 Analysis of contingency tables in R

The R code below is for the analysis for the data of Exercise 7.2.

The output gives the observed value of the test statistic and the p-value. The hypothesis test is conducted using the p-value; in this case the p-value =0.0007 < 0.05, so there is sufficient evidence to reject the null hypothesis.

M <- as.table(rbind(c(257, 264, 283), c(43, 86, 97)))
dimnames(M) <- list(Calls = c("Successful", "Unsuccessful"),
                    Location = c("1","2", "3"))
 (Xsq <- chisq.test(M))  # Prints test summary
#Enclosing a command in brackets runs the command and also prints out the output.
	Pearson's Chi-squared test

data:  M
X-squared = 14.404, df = 2, p-value = 0.0007453

Xsq$observed   # observed counts (same as M)
              Location
Calls            1   2   3
  Successful   257 264 283
  Unsuccessful  43  86  97
Xsq$expected   # expected counts under the null hypothesis of independence or no association
              Location
Calls                  1         2         3
  Successful   234.17476 273.20388 296.62136
  Unsuccessful  65.82524  76.79612  83.37864

Xsq$residuals  # Pearson residuals
              Location
Calls                   1          2          3
  Successful    1.4915759 -0.5568365 -0.7908957
  Unsuccessful -2.8133202  1.0502713  1.4917397

Xsq$stdres     # standardized residuals
              Location
Calls                  1         2         3
  Successful    3.782396 -1.463040 -2.125424
  Unsuccessful -3.782396  1.463040  2.125424

Note that the output contains (Pearson and standardised) residuals which makes it easier to determine any associations in the data. From this, we see that the centres with the largest (Pearson and standardised) residuals are 1 for successful calls and 3 for unsuccessful calls. So centre 1 is identified as having more successfully resolved calls than expected under independence, and centre 3 has more unsuccessfully resolved calls.

 

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Statistics: Meaning from data Copyright © 2024 by Dr Nazim Khan is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book