February 29, 2020
A Guide to Statistics
In previous sections, you learned about the concepts involved in descriptive statistics. Specifically, we showed you the different measures involved in measures of central tendency and variability, as well as how to calculate each. In addition, we walked you through the types of variables involved in statistics as well as the types of analysis and visualizations you could make using data. Here, we’ll help you review everything related to descriptive statistics.
What are Descriptive Statistics?
The field of statistics is generally divided into two types of statistics: descriptive and inferential statistics. Descriptive statistics is, luckily, exactly what it sounds like: it involves analysing data on a descriptive basis. If this sounds confusing, let’s oppose it to inferential statistics in the table below.
|Descriptive Statistics||Inferential Statistics|
|Makes statements about what is within the data||Makes predictions using of data points outside the data set by using the information within the data|
|Conveys information through measures like mean and standard deviation||Conveys information through predictive models|
|Visualizations generally include: ||Visualizations generally include: |
While this general information is by no means exhaustive, it can be a great starting point for understanding the differences between the two branches of statistics. The goal of descriptive statistics is to either summarize the characteristics of a data set or to analyse a data set by utilizing its descriptive properties.
The units used in descriptive statistics can be anything. People using descriptive statistics can strive to measure things like:
- Trees in parks
- Tourists at a beach
The analysis that can be done using descriptive statistics alone isn’t just vastly diverse, it is also the majority of what many people use. The units that people strive to measure, however, need to be clearly defined in order to properly understand any data.
In statistics, the elements people want to study are split into a population and a sample. A population is the actual group of elements that you want to study. A population could be anything and take on any form. In the previous examples, the population would take the following form.
|Rainfall||Total rain produced|
|Trees in a park||All the trees in a park|
|Tourists at a beach||Total number of tourists at a beach|
While this may seem simple, and it is, populations are notoriously hard to measure. While surveying the total number of trees in a park might be an easy task to accomplish if it involves a local city park, imagine the same task applied to a national forest. Often times, there is not enough financial resources or time to be able to measure an entire population. That is why in statistics you’ll often encounter samples.
A sample is a part of a population, where the elements and units might be the same. A sample is drawn from a population in order to make the data collection process cheaper and more time efficient. Taking the previous example, let’s take a look at the differences between a population and a sample.
|Total rain produced||Rainfall produced in an hour in one location of a city|
|All the trees in a park||Number of trees in measured in a one-kilometre radius|
|Total number of tourists at a beach||Number of tourists arriving at the beach at three specific times in a day|
As you can guess, samples tend to include a fraction of the elements that are included in a population. There are many different methods for drawing a sample, which include:
- Simple Random Sampling
- Stratified Sampling
- Cluster Sampling
- Quota Sampling
As you can imagine, each sampling method has their advantages and disadvantages. The sampling method that is desired in most cases is simple random sampling, also known as SRS.
The reason is because it involves a completely random selection of elements from a population, which can decrease variability in the estimation of statistical measures. An SRS can be conducted with or without replacement.
Because the true population measure, or the measure we would have calculated had we measured the entire population, is unknown, measures calculated from samples are always considered as estimates of the population. A measure from a population is called a “parameter” while a measure from a sample is called a “statistic.”
Measures of Central Tendency
Measures of central tendency is a long name for something simple: measuring the centre. The reason why people like to measure the centre point of a data set is because it generally indicates what the most “typical” value of the data looks like.
There are three basic measures of central tendency: the mean, median and mode. Some rules of thumb for remembering when each of them is used are:
- When the data includes extreme values or outliers, the median is better
- When the data doesn’t include outliers and you want to measure the average, use the mean
- When you want to know the value or category with the highest frequency, use the mode
Below are the formulas for each measure.
|Median||Midpoint of ordered data points, the average of the two midpoint values if it’s an even number of values||Calculated the same as the sample|
|Mode||The value or category with the highest frequency||Calculated the same as the sample|
Measures of Variability
Unlike measures of central tendency, measures of variability strive to capture how the data are spread around the centre values. The two most basic types of variability measures include variance and standard deviation. Other common measures include:
- Coefficient of Variation
- Standard Error
The spread of a data set is how closely or how far apart the data lie around the centre. While variance is used throughout statistics, standard deviation tends to be preferred when speaking to the spread of a data set because its units are easy to interpret.
Below you’ll find the formulas for standard deviation and variance for populations and samples.
|Standard Deviation|| |
Notice that the standard deviation is simply the square root of the variance.
Notation of Measures of Central Tendency and Variability
As you may have noticed, the measures for the population and sample have different notations. These parameters are standardized throughout the statistical world. Meaning, you will encounter them everywhere from your textbooks to computer programs. Below, we’ve summarized the notations of the mean, standard deviation and variance.
|Standard Deviation|| |
Types of Variables
There are many variable types, all used in different statistical analysis. The most common variable distinction is made between two variables: qualitative and quantitative variables, also known as categorical and numerical variables.
Qualitative variables are those that involve categories. They are called qualitative because they describe a variable’s characteristics, or qualities. These include variables like:
Quantitative variables, on the other hand, involve variables that measure quantities of something. These include variables like:
Quantitative and qualitative variables can be further broken down into sub-groups. Below you’ll find a summary.
|A collection of observations, measurements or ideas on specific variables|
|Numeric information about a place, person or thing|| |
Descriptive information about a place, person or thing
|Ordered based on a specific scale|| |
Not ordered on a scale
Data visualization is an integral part of descriptive statistics and is defined by displaying information visually. The most common visualizations in descriptive statistics include:
- Bar charts
- Pie charts
- Line graphs