Errors in research

Dr Aila Khan; Dr Munir Hossain; Dr Sabreena Amin

13 Errors in research

This chapter is adapted from the Australian Bureau of Statistics licensed under a Creative Commons Attribution 4.0 International license.

Learning Objectives

By the end of this chapter, students must be able to:

explain sampling errors and non-sampling errors in research
understand the sources leading to such errors
demonstrate an understanding of strategies to reduce such errors

Introduction

The accuracy of a survey estimate refers to the closeness of the estimate to the true population value. Where there is a discrepancy between the value of the survey estimate and the true population value, the difference between the two is referred to as the error of the survey estimate. The total error of the survey estimate results from two types of error:

sampling error, which arises when only a part of the population is used to represent the whole population. Sampling error can be measured mathematically
a non-sampling error can occur at any stage of a sample survey

It is important for a researcher to be aware of these errors, in particular non-sampling errors, so that they can be either minimised or eliminated from the survey. An introduction to sampling error and non-sampling error is provided in the following sections.

Sampling error

Sampling error reflects the difference between an estimate (e.g., average) derived from a sample survey and the “true value” (i.e., actual population average) that would be obtained if the whole survey population was enumerated. It can be measured from the population values, but as these are unknown (or very difficult to calculate), it can also be estimated from the sample data. It is important to consider sampling error when publishing survey results as it gives an indication of the accuracy of the estimate and therefore reflects the importance that can be placed on interpretations. If sampling principles are applied carefully within the constraints of available resources, sampling error can be accurately measured and kept to a minimum.

Factors Affecting Sampling Error

Sampling error is affected by a number of factors including sample size, sample design, the sampling fraction, and the variability within the population. In general, larger sample sizes decrease the sampling error, however, this decrease is not directly proportional. As a rough rule of thumb, you need to increase the sample size fourfold to halve the sampling error. Of much lesser influence is the sampling fraction (the fraction of the population size in the sample), but as the sample size increases as a fraction of the population, the sampling error should decrease.

The population variability also affects the sampling error. More variable populations give rise to larger errors as the samples or the estimates calculated from different samples are more likely to have greater variation. The effect of the variability within the population can be reduced by increasing the sample size to make it more representative of the survey population. Various sample design options also affect the size of the sampling error. For example, stratification reduces sampling error whereas cluster sampling tends to increase it. A sampling error can be estimated statistically and is used while interpreting statistical results.

https://www.youtube.com/watch?v=XE7QDfdaQ68

Source: Frances Chumney ^[1]

Non-sampling error

Non-sampling error is comprised of all other errors in the estimate (e.g., a sample average). These include all errors which occur due to reasons other than sample plan or sample size. Some examples of causes of non-sampling error are a low response rate to the questionnaire, a badly designed questionnaire, respondent bias, and processing errors. Non-sampling errors can occur at any stage of the process. These errors can be found in censuses and sample surveys.

Sources of non-sampling errors are discussed below.

Non-Response Bias

Non-response refers to the situation when respondents either do not respond to any of the survey questions (i.e., total non-response) or do not respond to some survey questions owing to sensitive questions, recall problems, inability to answer, etc. (partial non-response). To improve response rates, care should be taken in designing the questionnaires, training interviewers, assuring the respondent of confidentiality, and calling back at different times if having difficulties contacting the respondent. “Call-backs” are successful in reducing non-response but can be expensive for personal interviews. A gentle email reminder for online surveys is also used as a tool to improve survey response rates.

Questionnaire problems

The content and wording of the questionnaire may be misleading and the layout of the questionnaire may make it difficult to accurately record responses. Questions should not be loaded, double-barrelled, misleading, or ambiguous, and should be directly relevant to the objectives of the survey.

It is essential that questionnaires are tested on a sample of respondents before they are finalised to identify questionnaire flow and question-wording problems and allow sufficient time for improvements to be made to the questionnaire. The questionnaire should then be re-tested to ensure changes made do not introduce other problems.

Respondent Bias

At times, respondents may provide inaccurate information as they believe they are protecting their personal interests and integrity. Careful questionnaire design and effective questionnaire testing can overcome these problems to some extent. Given below are two types of situations that can be avoided through better design and implementation of surveys.

Sensitivity
If respondents are faced with a question that they find embarrassing, they may refuse to answer, or choose a response that prevents them from having to continue with the questions. For example, if asked the question: “Are you taking any oral contraceptive pills for any reason?”, and knowing that if they say “Yes” they will be asked for more details, respondents who are embarrassed by the question are likely to answer “No”, even if this is incorrect.
Fatigue
Fatigue can be a problem in surveys that require a high level of commitment from respondents. For example, diary surveys where respondents have to record all expenses made in a two-week period. In these types of surveys, the level of accuracy and detail supplied may decrease as respondents become tired of recording all expenditures.

Processing Errors

There are four stages in the processing of the data where errors may occur: data grooming, data capture, editing, and estimation. Data grooming involves preliminary checking before entering the data onto the processing system in the capture stage. Inadequate checking and quality management at this stage can introduce data loss (where data is not entered into the system) and data duplication (where the same data is entered into the system more than once). Inappropriate edit checks and inaccurate weights in the estimation procedure can also introduce errors to the data. To minimise these errors, processing staff should be given adequate training and realistic workloads.

Misinterpretation of Results

This can occur if the researcher is not aware of certain factors that influence the characteristics under investigation. A researcher or any other user not involved in the collection stage of the data gathering may be unaware of trends built into the data due to the nature of the collection, such as its scope. (eg. a survey which collected income as a data item with the survey coverage and scope of all adult persons (ie. 18 years or older), would expect to produce a different estimate than that produced by the ABS Survey of Average Weekly Earnings (AWE) simply because AWE includes persons of age 16 and 17 years as part of its scope). Researchers should carefully investigate the methodology used in any given survey.

Time Period Bias

This occurs when a survey is conducted during an unrepresentative time period. For example, if a survey aims to collect details on ice-cream sales, but only collects a week’s worth of data during the hottest part of summer, it is unlikely to represent the average weekly sales of ice cream for the year.

https://www.youtube.com/watch?v=zF37RvnNHnk

Source: Frances Chumney ^[2]

Minimising non-sampling error

Non-sampling error can be difficult to measure accurately, but it can be minimised by

careful selection of the time the survey is conducted,
using an up-to-date and accurate sampling frame,
planning for follow up of non-respondents,
careful questionnaire design,
providing thorough training for interviewers and processing staff and
being aware of all the factors affecting the topic under consideration.

Since many surveys suffer from poor response rates, we have especially discussed ways of reducing non-response from the potential respondents.

Minimising Non-Response

Response rates can be improved through good survey design via short, simple questions, good forms design techniques and explaining survey purposes and uses. Assurances of confidentiality are very important as many respondents are unwilling to respond due to a fear of lack of privacy. Targeted follow-ups on non-contacts or those initially unable to reply can increase response rates significantly. Following are some hints on how to minimise refusals in a personal or phone contact:

Find out the reasons for refusal and try to talk through them
Use positive language
State how and what you plan to do to help with the questionnaire
Stress the importance of the survey
Explain the importance of their response as a representative of other units
Emphasise the benefits from the survey results, explain how they can obtain results
Give assurance of the confidentiality of the responses

Other measures that can improve respondent cooperation and maximise response include:

Public awareness activities include discussions with key organisations and interest groups, news releases, media interviews, and articles. This is aimed at informing the community about the survey, identifying issues of concern, and addressing them.
Advice to selected units by letter, giving them advance notice and explaining the purposes of the survey and how the survey is going to be conducted.

In the case of a mail survey, most of the points above can be stated in an introductory letter or through a publicity campaign.

Allowing for Non-Response

Where response rates are still low after all reasonable attempts of follow-up are undertaken, you can reduce bias by using population benchmarks to post-stratify the sample, intensive follow-up of a subsample of the non-respondents, or imputation for item non-response (non-response to a particular question).

The main aim of imputation is to produce consistent data without going back to the respondent for the correct values thus reducing both respondent burden and costs associated with the survey. Broadly speaking the imputation methods fall into three groups:

the imputed value is derived from other information supplied by the unit;
values by other units can be used to derive a value for the non-respondent (e.g., average);
an exact value of another unit (called the donor) is used as a value for the non-respondent (called recipient);

When deciding on the method of imputation it is desirable to know what effect will imputation have on the final estimates. If a large amount of imputation is performed the results can be misleading, particularly if the imputation used distorts the distribution of data.

If at the planning stage it is believed that there is likely to be a high non-response rate, then the sample size could be increased to allow for this. However, the non-response bias will not be overcome by just increasing the sample size, particularly if the non-responding units have different characteristics from the responding units. Post-stratification and imputation also fail to totally eliminate non-response bias from the results.

Example: Effect of Non-Response

Suppose a postal survey of 3421 fruit growers was run to estimate the average number of fruit trees on a farm. There was an initial period for response and following low response rates, two series of follow-up reminders were sent out. The response and results were as follows:

	Response	Ave. no. of Trees
Initial Response	300	456
Added after 1 follow up reminder	543	382
Added after 2 follow up reminders	434	340
Total Response	1277

After two follow-up reminders, there was still only a 37% response rate. From other information, it was known that the overall average was 329. The result based on this survey would have been:

	Cumulative Response	Combined Average
Initial Response	300	456
Added after 1 follow up reminder	843	408
Added after 2 follow up reminders	1277	385

If results had been published without any follow-up then the average number of trees would have been too high as farms with a greater number of trees appeared to have responded more readily. With follow-up, smaller farms sent back survey forms and the estimate became closer to the true value.

Media Attributions

Chumney, 2016, Introduction to error sources in survey research: sampling error, 14 August, online video, viewed 4 April 2022, <https://www.youtube.com/watch?v=XE7QDfdaQ68>. ↵
Chumney, 2016, Introduction to error sources in survey research: measurement errors, 14 August, online video, viewed 4 April 2022, <https://www.youtube.com/watch?v=zF37RvnNHnk>. ↵

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License