Introduction

Data is becoming increasing important in our daily lives. The amount of data is increasing exponentially. Recording and storing of data has been greatly facilitated by advancements in technology. Data also varies widely in type, such as sales (what, how much, how many), human (who, commercial and social profiling),  surveys (customer feedback and satisfaction), health (GPs, medical centres and hospitals), and in everyday transactions of various types, including traffic data, social services, and environmental.

Statistics and data analytics are now a common feature of most human activity. This could be simply profiling personal expenditure, to targeting particular demographics for a particular health service such as routine mammograms. Users of data can benefit immensely from correct use of statistics. The survival of many large corporations depends critically on obtaining and analysing data to extract important information, patterns and forecasts. Obtaining correct relationships is science and medicine are critical for making correct decisions that may affect lives, and these depend on the correct analysis and interpretation of data.

Statistics is a science of collection, presentation, analysis, interpretation and explanation of data. Data contains information, and statistics extracts information from data. Such information can be used to formulate procedures and policies and to make decisions that impact on human lives, and the life of the planet.

Statistics is not mathematics, but requires some mathematics to develop methods and ideas. It covers a set of methods and techniques for extracting information from data. Some concepts are common to several techniques.

Some areas of application of statistics are briefly discussed below.

  • Stock market data analysis. Investors are always looking for the edge in trading decisions that optimise market return. Many financial institutions are investigating algorithmic trading to protect against a fast-changing market.
  • Clinical trials in medicine to investigate the efficacy of a treatment.
  • The effect of a teaching strategy on the performance of students.
  • Determining the best cultivar and cultivation methods for higher yield in agriculture.
  • The effect of social media on opinions of people on various topics, such as immigration, house prices and interest rate increases.
  • The attitudes of the young generation on relationships and marriage.

Examples

The following examples highlight the use of probability and statistical methods.

Live Sports Betting

Live sports betting is based on placing and changing bets while a live sports event is in progress. A player can change the bets as the game progresses. Some systems allow several bets to be placed simultaneously, or over time. Some also allow the player to cash out a bet during the game if certain events have occurred. Others allow a no-loss condition if at anytime during the game the team that was bet on was ahead. Several other systems are available.

The systems need accurate probability calculations based on data available for the teams so that the bookmaker will not lose in the long run. In addition, the regulatory institutions need to ensure that the bets are fair and not too biased in favour of the bookmaker. Similarly, the bookmaker needs to ensure that the bets are attractive to the players. All this requires probability and statistics.

Questions

  1. How can a sports betting company be sure of making a profit if it also guarantees a refund if the team bet on is ahead at any stage of the game?
  2. What data would a company require to set the odds for a game?
  3. If a company allows twenty bets on a card, what is the probability that all the bets are paid out? What is the companies expected loss if this happens?

Human data

Human data relates to individuals as well as populations and profiles their behaviour. Examples of human data are:

  • Where do people park in the city? And for how long?
  • Who shops at Myer city outlet? Age, sex, education, employment, income, marital status.
  • When did Joe Citizen shop at Woolworths, and what did he buy? For example, does he shop on a particular day, at a particular time? Does what he buy depend on the day and time? How much does he spend  and when does he spend it?
  • How do people who work in the city travel to work? Private vehicle, public transport, and does this depend on their address and type of work?

Most of these data arise from surveys or tacit data collection, such as from parking stations and loyalty cards.

Questions

  1. An important aspect is the ethics of human data collection and use. Who owns this data, and how can it be used? Can this data be sold to third parties? How is the privacy of individuals protected from abuse?
  2. How is the identity of the individuals protected? How is the security of the data protected?
  3. How is the consent of individuals obtained before such data is collected? Is consent important?

The Data Analysis Paradigm

Data analysis is broadly speaking, a two step process:

  1. Exploratory Data Analysis (EDA): graphs and summary statistics are used to get a “feel” for the data.
  2. Modelling and Inference:
  • (a) A model for the data involves assumptions regarding the data generation mechanism. These assumptions need to be verified.
  • (b) An inference is a conclusion that patterns in the data, that is, the model, holds in some broader context.

A statistical inference is an inference justified by a probability model linking the data to the broader context through the model.

These two steps are sequential, complementary and iterative. That is, EDA precedes modelling and inference, but further EDA usually follows modelling in an attempt to uncover any special features of the data that have been revealed by the models. The data maybe re-modelled following this, and the whole process may be repeated until a satisfactory model is obtained that explains the data and its behaviour.

Statistical inference needs a probability model that links the data to the broader population.  Data is usually from a sample. The population is the set from which the sample was drawn. For example, to determine the median income of Australians, I take a random sample of Australians, and find the sample median. Then I relate this to the median income of ALL AUSTRALIANS.

The 1936 US Presidential election example in the next chapter illustrates the issues with this process.

For statistical inference, we need some requisite probability theory. We will cover the basic probability rules, ideas of random variables, expectation and variance. Some standard distributions are introduced, and their properties are discussed. Throughout we illustrate the concepts with examples.

 

Software is an important aspect of probability and statistics. We use the free R statistical environment. The student should download the software and install it. We will cover all the required steps in using the software by screen captures as well as videos. Datasets used in the book are available from the Appendix.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Statistics: Meaning from data Copyright © 2024 by Dr Nazim Khan is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book