1 Data Collection

Learning Outcomes

At the end of this chapter you should be able to:

  1. state the sources of data;
  2. identify and discuss aspects of sampling;
  3. explain the sources of bias in sampling;
  4. explain and identify self-selection bias;
  5. explain and identify voluntary response bias;
  6. explain the effects of non-response;
  7. understand the importance of and issues with a survey instrument.

 

 

1.1 Introduction

Decisions should be based on evidence. Evidence in quantitative disciplines is in the form of data. Obtaining data is therefore the first step in this process. The research question or question of interest should be clearly and unambiguously formulated. This is an aspect that is often ignored, leading to meaningless inquiry. This then leads to determining what data is required. Not obtaining the “correct” data to answer the question of interest is a common problem.

The next step is obtaining the data. In this short chapter we will consider some common sources of data.

1.2 Sources of data

Primary data are collected directly by the researcher for a particular purpose. Secondary data is data collected by someone else for some other purpose, but is still useful for answering questions of interest. Obtaining primary data is an expensive and time-consuming exercise. Therefore always first see if the required data exists and is available. Sometimes secondary data will be free, and other times it will attract a cost. Many agencies collect data and then sell them to interested parties. If secondary data is not available then primary data must be obtained.

The following are some sources of free secondary business data.

  • The Reserve Bank. Data includes historical interest rates, inflation, various indices, and historical forecasts.
  • Australian Bureau of Statistics. General population data, such as average weekly income, unemployment rate, consumer prices index. Census data contains a lot of human data, including population composition by ethnic groups, religion, sex, education, residence, and age.
  • World Bank. Global finance data, food prices, climate data, statistical performance indicators.
  • Yahoo finance. Live and historical stock market and exchange rate data, exchange rates.
  • Department of Climate Change, Energy, the Environment and Water. Environmental data: species profile and threats; Indigenous protected areas.
  • SILO Climate Database. Climate data.
  • OECD. Population demography
  • DHS. Demographic health surveys, that provide data on health of population in developing nations. In particular, the surveys collect data on child health and nutrition.

1.3 Surveys

Most data is obtained by a survey. Two important aspects to surveys are: selecting the sample; and designing the survey. The sample must be selected in such a way that it is representative of the population.   Three common sources of biases in samples exist.

  1. Spatial or Location bias: only certain locations are sampled from. This can be a physical location, or more general, such as a sample of readers of a magazine.
  2. Temporal or time bias: Samples are taken at fixed times or in a narrow window of time. Again only a certain part of the population is sampled from.
  3. Incorrect population is sampled from. For example, the question of interest is the attitude of Perth residents to COVID-19 vaccination, but the sample includes residents from other cities.

1.4 The Survey Instrument

The survey questionnaire is called an instrument, as it measures quantities of interest. Questionnaire design is a large and complex area. The way a question is written and the overall questionnaire design can affect the responses.

Generally most surveys start with demographic questions, but sometimes these are at the end of the survey.

Discussion points

  • How does one ask sensitive questions, such as drug use and infidelity? Do the respondents provide honest answers?
  • What about questions on satisfaction? Are the responses reliable?

Surveys questionnaires should be designed and surveys should be conducted by experts in the area.

1.5 Issues with conducting surveys

If the survey is not correctly conducted then the results may be compromised. Three common issues with surveys and sampling are non-response, voluntary response and selection bias.

Non-response occurs when a respondent does not answer an item. This can lead to bias if for example the variable being measured is related to the non-response. For example, questions on salaries often lead to non-response, as people on higher incomes are less likely to respond. Often only a few items in a questionnaire may suffer from non-response, but if the proportion of non-response is high in key questions then the survey may be rendered useless.

In voluntary response surveys, the sample is selected by the surveyor, but people in the sample choose whether to respond or not. This is different from non-response, as here for those who choose not to participate none of the items in the questionnaire are answered.

In selection bias, a subset of the population is systematically selected according to some attribute. Equally, a subset of the population is excluded based on some attribute. Consequently, proper randomisation is not achieved.

A common form of this is self-selection bias, and occurs when respondents select themselves into the sample. That is, the surveyor does not select the sample—people choose whether to respond or not. Examples are radio and TV station polls and online polls.

Issues with the sampling invalidate the study and do not allow the results to be generalised to the whole population.

Example 1.1 The 1936 US Presidential Election

The 1936 US presidential election was mainly a battle between the incumbent Democrat President Franklin D. Roosevelt and Republican candidate Alf Landon. Since it was first published in 1916, The Literary Digest had run a poll for every US presidential election. In the five elections from 1916 to 1932, the magazine had correctly predicted the outcome of the elections. Their methodology was to mail out millions of postcards to the magazine subscribers and use the straw poll to predict a winner.

In 1936 the magazine sent out 10 million postcards to its subscribers and people listed in automobile registrations and telephone directories. Of these 2.4 million were returned. Based on the result of the poll the magazine predicted a win for Landon with 57% of the votes. In the election Roosevelt received a landslide 61% of votes, winning in all but two states.

Several analyses of the reasons for the incorrect prediction have been published. The basic error with this survey is that this is a voluntary response survey, in which people in the sample themselves decide if they will participate. Such surveys are biased because mostly those who have strong negative views participate. Following this disaster, the magazine went bankrupt.

Incorrect sampling techniques can lead to erroneous results from surveys that can have serious consequences both from a human, social, political as well as economic perspective.

Reference

Lusinchi, D. (2012) “President” Landon and the 1936 Literary Digest Poll. Were Automobiles and Telephone owners to blame? Social Science History, 31(1).

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Statistics: Meaning from data Copyright © 2024 by Dr Nazim Khan is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book