Chapters

The linear correlation coefficient is one of the fundamental concepts behind the interpretation of regression models. In order to understand the mathematics and ideas behind the correlation coefficient, try to solve the following problem by reviewing what you know. If you’re encountering this concept for the first time, read through this guide for a **step-by-step** walk through.

## Problem 3

You are interested in knowing the relationship between the weather and tourism levels. To investigate, you collect data from the touristic centre in a city during one month in the summer, counting the number of people that arrive at the square at the same time every day. Given the data set below, what is the **correlation** between temperature and tourism? Interpret the correlation and name a few other reasons why these two variables are or are not related.

Temperature | Number of Visitors |

12 | 87 |

21 | 150 |

20 | 110 |

25 | 90 |

17 | 85 |

15 | 70 |

13 | 90 |

## What is the Correlation Coefficient?

The Pearson correlation coefficient, also known as the Pearson product-moment correlation coefficient, is one of the most powerful statistics in the field. Be careful not to confuse this with the **coefficient of determination**, which is also known as the “R squared” value. The correlation coefficient is a statistic that measures the strength of the linear relationship between two variables. A **linear relationship** between any two variables means that when the two variables are graphed, they follow a straight line. In other words, an increase or decrease in one variable will see a corresponding increase or decrease in the other variable as well.
You should be careful not to confuse the correlation coefficient with causation. Simply because two variables exhibit a strong linear correlation doesn’t mean one causes the other. A classic example is the strong linear relationship between shark attacks and ice cream sales. As ice cream sales increase, there is a **corresponding increase** in shark attacks as well. This does not mean an increase in ice cream sales cause an increase in shark attacks. Correlation simply signals towards a relationship between two variables. However, those two variables might have an underlying, common relationship to another variable which can explain why they are related in the first place. In this example, ice cream sales and shark attacks can exhibit a **strong relationship** because of hot weather: the hotter it is, the more people buy ice cream and swim in the ocean.

## Derivation of Formula

The formula for the correlation coefficient is the following. [ rho_{xy} = frac{Cov(x,y)}{sigma_{x} sigma_{y}} ] While this **formula** may seem confusing at first, it is actually quite simple to understand when breaking down each element of the formula.

Pearson product moment correlation | |

Covariance between x and y | |

Standard deviation of x | |

Standard deviation of y |

Let’s take the first element, which is the covariance. The covariance of two variables measures the direction of the relationship between them. In other words, the **covariance** measures how two variables move together. Next, let’s look at the two elements of the denominator of the correlation coefficient. The standard deviation is a statistic that measures how far spread the variable is from the mean. The **formulas** for all three elements can be seen below.

[ frac{sum_{i=1}^{n}(x_{i}- bar{x})(y_{i}- bar{y})}{n-1} ] | |

[ sqrt{frac{sum(x_{i} - bar{x}^2)}{n-1}} ] | |

[ sqrt{frac{sum(y_{i} - bar{y}^2)}{n-1}} ] |

As you can see, these three elements are what go into deriving the correlation formula. In the **numerator**, you have the measure of the direction of the relationship between two variables. This relationship can be either positive or negative. If, for example, the relationship is positive, this means that a decrease in one variable would result in a decrease in another variable - and vice versa. On the other hand, a negative covariance would mean that a decrease in one variable would result in an increase in the other variable, and again vice versa. The **denominator** is the multiplication of the standard deviations of both variables. The standard deviation of a variable is a measure of dispersion. This means that it measures the spread of a variable around it’s mean. To derive the correlation coefficient formula, you first plug in the three elements of the formula into the correlation coefficient formula. [ frac{frac{sum_{i=1}^{n}(x_{i}- bar{x})(y_{i}- bar{y})}{n-1}}{sqrt{frac{sum(x_{i} - bar{x}^2)}{n-1}}*sqrt{frac{sum(y_{i} - bar{y}^2)}{n-1}}} ] Recall that in mathematics, the square root of a fraction is simply the **square root** of the numerator divided by the square root of the denominator. This means that the denominator becomes: [ frac{sum_{i=1}^{n}(x_{i}- bar{x})(y_{i}- bar{y})}{n-1} div ( frac{sqrt{sum(x_{i} - bar{x}^2)}}{sqrt{n-1}} * frac{sqrt{sum(y_{i} - bar{y}^2)}}{sqrt{n-1}} ) ]
Recall that a square root times itself is simply the number. For an example, take the number 3. Also, keep in mind that when multiplying fractions, they become one fraction where the numerator is the two **multiplied** numerators and the denominator is the two multiplied denominators. For example, take the fraction one-third multiplied by one-fourth.
Putting these two **characteristics** together, we can see that the denominator of the correlation coefficient formula becomes the following. [ frac{sqrt{sum(x_{i} - bar{x}^2)}}{sqrt{n-1}} * frac{sqrt{sum(y_{i} - bar{y}^2)}}{sqrt{n-1}} = ] [ frac{sqrt{sum(x_{i} - bar{x}^2)} * sqrt{sum(y_{i} - bar{y}^2)}}{sqrt{n-1} * sqrt{n-1}} = ] [ frac{sqrt{sum(x_{i} - bar{x}^2)} * sqrt{sum(y_{i} - bar{y}^2)}}{n-1} ] When plugging this number back into the numerator, please remember that a fraction divided by a fraction is the **same thing** as a fraction multiplied by the inverse of that fraction. Taking the same example form above, this means that one-third divided by one fourth is the same thing as one-third multiplied by four over one.
[ frac{sum_{i=1}^{n}(x_{i}- bar{x})(y_{i}- bar{y})}{n-1} * frac{n-1}{sqrt{sum(x_{i} - bar{x}^2)} * sqrt{sum(y_{i} - bar{y}^2)}} ]

Cancelling out the denominator and the numerator, as they are both , and simplifying both the numerator and denominator, we get: [ frac{n sum xy - sum x sum y}{n} * frac{n}{sqrt{ (n sum x^2 - (sum x)^2) ( n sum y^2 - (sum y)^2) }} ] [ frac{n sum xy - sum x sum y}{sqrt{ (n sum x^2 - (sum x)^2) ( n sum y^2 - (sum y)^2) }} ]

## Interpretation of Correlation Coefficient

The** interpretation** of the correlation coefficient is quite simple and can be summarized by the table below.

Value | Direction | Strength | Interpretation |

-1 | Negative | Very Strong | Perfect negative correlation |

-0.3 | Negative | Weak | Very weak negative correlation |

0 | None | None | No correlation |

0.3 | Positive | Weak | Very weak positive correlation |

1 | Positive | Very strong | Perfect positive correlation |

## Step by Step Solution

**correlation**is calculated below.

Observation | Happiness Score | Work Hours | | | |||

1.0 | 89.0 | 30.0 | 21.3 | -11.7 | -248.9 | 455.1 | 136.1 |

2.0 | 90.0 | 35.0 | 22.3 | -6.7 | -148.9 | 498.8 | 44.4 |

3.0 | 54.0 | 40.0 | -13.7 | -1.7 | 22.8 | 186.8 | 2.8 |

4.0 | 60.0 | 35.0 | -7.7 | -6.7 | 51.1 | 58.8 | 44.4 |

5.0 | 73.0 | 40.0 | 5.3 | -1.7 | -8.9 | 28.4 | 2.8 |

6.0 | 40.0 | 70.0 | -27.7 | 28.3 | -783.9 | 765.4 | 802.8 |

Average | 67.7 | 41.7 | Total | -1116.7 | 1993.3 | 1033.3 |

Plugging this into the formula, we get: [ r_{xy} = frac{-1116.7}{sqrt{1993.3*1033.3}} = -0.78 ]

The platform that connects tutors and students