Patterns in Data

 

It is estimated that there are about 5 to 10 thousand stars that are visible from the earth with the naked eye. A small fraction of those stars form the many constellations we grow up trying to search for in the night sky. While drawing a line through several points to things such as the big dipper or a zodiac sign are helpful in locating certain patterns in the sky, all stars taken as a whole form just one, giant blob of light.

 

If stars where the plotted points in a data set, we could try to conclude that there is no pattern. However, how exactly can you measure the strength of a particular, observed pattern? Say, for example, that you instead plot the mass of stars against their energy, or luminosity. You might observe that as the mass increases, so does the luminosity. You could then run a simple linear regression to obtain an equation that describes the statistical model between mass and energy.

 

This regression model would then describe the correlation, or strength of the relationship, between the two variables. While descriptive statistics can be helpful, statistical models such as this one can help us determine whether patterns in a data set exist beyond simply plotting them in a graph.

 

Correlation Definition

 

Correlation is a technique used in statistics to measure whether or not two variables are related or not. Examples of correlation appear all the time in the real world, dealing with everything from the possible relationship between weight gain and level of work activity to violence in video games and violence in life.

 

There are a couple of common correlation techniques, but it is worth mentioning that correlation can only be calculated between two quantitative, or numerical, variables. Correlation cannot be calculated, for example, between the names of birth months or brand names.

 

The most common correlation technique is  product-moment, or Pearson, correlation. Below, you will find the definition, notation and description of Pearson’s correlation coefficient.

 

TypeRangeDescriptionNotation
Pearson or product-moment correlationA statistic from -1 to 1Measures the linear relationship between two variablesr_{xy}

 

While there are many programs that will calculate correlation coefficients, even including some online calculators, you can also calculate the correlation coefficient yourself. The formula for Pearson’s correlation can be found below.

 

    \[ r_{xy} = \frac{\sum_{i=1}^n (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sqrt{\sum_{i=1}^n (x_{i}-\bar{x})^2 \sum_{i=1}^n (y_{i}-\bar{y})^2}} \]

 

While this may look complicated at first, it can be easier to understand by breaking down each step..

 

StepDescriptionFormula
1Find the mean of x and yThis is the \bar{x} and \bar{y}
2Subtract the mean of x from every value of x and subtract the mean of y from every value of y(x_{i} - \bar{x}) and (y_{i}- \bar{y})
3Multiply all of these subtracted values together, then take the sum of these subtracted values\sum_{i=1}^n (x_{i} - \bar{x})(y_{i} - \bar{y})
4Calculate the square of all subtracted values, then take the sum of those squares\sum_{i=1}^n (x_{i}-\bar{x})^2 \sum_{i=1}^n (y_{i}-\bar{y})^2
5Divide the sum of the multiplied subtracted values by the square root of the product of the sum of squared subtracted values\frac{\sum_{i=1}^n (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sqrt{\sum_{i=1}^n (x_{i}-\bar{x})^2 \sum_{i=1}^n (y_{i}-\bar{y})^2}}

 

What Correlation Is and Is Not

Now that you understand a bit more about the definition and calculation of the correlation coefficient, it’s important to understand what correlation actually tells us. Take a look at the chart below, which graphs the hand measurements and corresponding heights of different people.

HeightvHandScatterplot
PersonHandHeight
A10140
B12142
C15150
D19169
E20171

 

Looking at the graph, as well as the observations, we can calculate the correlation coefficient and get a near perfect correlation coefficient of about 0.98. This is a perfect example of the classic saying, “correlation does not equal causation.”

 

While there is a strong relationship between hand size and height, this could be for any number of reasons. If hands caused tallness, what would happen if someone lost their hands? Would their height shorten? Of course not, which is why correlation is only a measure of the strength of an association between two variables.

 

Properties of Correlation Coefficient

Let’s take a look at some more properties of the correlation coefficient. The correlation coefficient can be any number between -1 and 1. Take a look at the table below for a clearer idea as to what these different degrees mean.

 

Correlation CoefficientInterpretation
r_{xy} = 1Indicates a perfect, positive correlation. Meaning, as x increases, so does y. As x decreases, y also decreases.
r_{xy} = 0Indicates that there is absolutely no correlation between x and y.
r_{xy} = -1Indicates a perfect, negative correlation. Meaning, as x increases, y decreases. As x decreases, y increases.

 

Getting these three results is highly unlikely, however they are good markers for the degree and the direction of the relationship between two variables.

 

Practice Problem

Let’s take a deeper dive into correlation by going through a practice problem step by step. First, we take two variables like the ones listed in the table below.

 

ObservationHappiness ScoreWork Hours
18930
29035
35440
46035
57340
64070

 

Here, we have a fictional data set that looks at happiness scores out of 100 and the number of work hours that individual has in a week. The first step in determining correlation is plotting the data set. This can give us an idea of whether or not the variables are related because those with perfect correlation should be on the 45 degree line.

RegressionWorkHoursvHappinessScatterplot

Now, let’s calculate the correlation coefficient of these two variables. We start by calculating the mean and move through the steps.

 

ObservationHappiness ScoreWork Hoursx_{i}-\bar{x}

e

y_{i}-\bar{y}

f

e*fe^2f^2
1.089.030.021.3-11.7-248.9455.1136.1
2.090.035.022.3-6.7-148.9498.844.4
3.054.040.0-13.7-1.722.8186.82.8
4.060.035.0-7.7-6.751.158.844.4
5.073.040.05.3-1.7-8.928.42.8
6.040.070.0

 

-27.728.3-783.9765.4802.8
Average67.741.7Total-1116.71993.31033.3

 

Plugging this into the formula, we get:

    \[ r_{xy} = \frac{-1116.7}{\sqrt{1993.3*1033.3}} = -0.78 \]

WorkHoursvHappinessScatterplot

The graph, along with the correlation coefficient, tells us that there is a strong negative relationship between happiness scores and work hours. Meaning, as work hours go up, happiness scores go down and vice versa.

 

If we wanted to make predictions about happiness scores, we would have to find the following.

 

Y standard deviation\sqrt{\frac{1033.3}{5}} = 14.38
X standard deviation\sqrt{\frac{1993.3}{5}} = 19.97
Slope of regression line-0.78*(\frac{19.97}{14.38}) = -0.56
y-intercept41.7 + (-0.56*67.7) = 79.57
Regression liney = 79.57 - 0.56x

 

Predicting a score for someone who works 20 hours, we get,

 

    \[ y = 79.57 - 0.56*(20) = 68.4 \]

 

A happiness score of 68.4.

ScatterplotPrediction

We just used our data to try and predict a value that was not included in our original data set. However, we could also pick a value included in our data set and see what our model predicts. These predictions have a special name in statistics and are summarized below.

 

TermDefinitionExample
InterpolationPredictions from values within our datasetA happiness score of 60
ExtrapolationPredictions about values outside our datasetA happiness score of 100
 
Do you need to find a Maths tutor?

Did you like the article?

1 Star2 Stars3 Stars4 Stars5 Stars 5.00/5 - 1 vote(s)
Loading...

Danica

Located in Prague and studying to become a Statistician, I enjoy reading, writing, and exploring new places.