Regression Definition

 

If you’ve ever heard about popular conspiracy theories, you might be astounded by the level of detail groups have gone to in order to explain the unlikely relationships between events or phenomena. While on the surface conspiracy theories and statistics may seem like they’re on opposite ends of the spectrum, they have both arisen out of the well-documented tendency of humans to see patterns everywhere.

Patterns can be predictable, but they can also be very subjective. One dataset, for example, can be interpreted in a vast number of ways by researchers or students depending on their interests, abilities, and more. The beauty of statistics is that it has many different tools for discovering and analysing these patterns.

Regression is one of these tools. The most basic form of regression is linear regression, which investigates the relationship between one dependent variable and one or more independent variables. Linear regression strives to investigate the relationship between different variables and whether some can be used to predict another.

Ordinary least squares is the most common type of linear regression. Ordinary least squares seeks to minimize the squared errors in the model. The equation for OLS regression is:

    \[ y \; = \hat{\alpha} \; + \hat{\beta}*x \]

The OLS estimators, \alpha and \beta, can be calculated with the following equations:

    \[ \hat{\beta} \; = \frac{\sum_{i=1}^n (x_{i}-\bar{x}) (y_{i} - \bar{y})}{\sum_{i=1}^n (x_{i} - \bar{x})^2} = \frac{\sigma_{xy}}{\sigma_{x^2}} = \rho_{xy} \frac{\sigma_{y}}{\sigma_{x}} \]

    \[ \hat{\alpha} = \bar{y} - \hat{\beta}*\bar{x} \]

The best Maths tutors available
1st lesson free!
Ayush
5
5 (27 reviews)
Ayush
£90
/h
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£39
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Myriam
5
5 (15 reviews)
Myriam
£20
/h
1st lesson free!
Andrea
5
5 (12 reviews)
Andrea
£40
/h
1st lesson free!
Ayush
5
5 (27 reviews)
Ayush
£90
/h
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£39
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Myriam
5
5 (15 reviews)
Myriam
£20
/h
1st lesson free!
Andrea
5
5 (12 reviews)
Andrea
£40
/h
First Lesson Free>

Correlation Definition

You might recognize the word correlation. While the statistical term is often used in media as a sure-metric for a relationship between two variables, it might not mean what you think it means. Take a look at the graph below.

SharkAttackvIceCreamScatterplot

For each observation, we have ice cream sales on a given day and shark attacks on the same day. Notice that as ice cream sales go up, so do shark attacks - in fact, it looks like there is a near perfect correlation. While correlation does measure the strength of the relationship between two variables, it does not mean there is cause and effect between them.

Type Description Formula
Pearson’s correlation coefficient Describes the strength of a linear relationship between two variables r_{xy} = \frac{\sum_{i=1}^n (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sqrt{\sum_{i=1}^n (x_{i}-\bar{x})^2 \sum_{i=1}^n (y_{i}-\bar{y})^2}}

The best way to learn how to interpret correlation is by looking at correlation coefficients besides their graphs. Below are a series of graphs plotting two variables.

 

 

 

 

Image r_{xy}  Interpretation
A 1 Perfect positive correlation, as one variable increases, so does the other
B 0.3 Low positive correlation
C 0 No correlation, no relationship between the two variables
D -0.3 Low negative correlation
E -1 Perfect negative correlation, as one variable increases, the other decreases

We can also use regression models as a way to predict for the event’s we’ve modelled. Check out the table below to understand the two main categories of predictions.

Type Definition How it’s Done
Extrapolation The estimation of a value that is outside the data set range Plug the desired value into the regression formula
Interpolation The estimation of a value that is inside the range of the data set Plug the desired value into the regression model

Problem 1

Calculate and interpret the correlation coefficient of the two variables below.

Person Hand Height
A 17 150
B 15 154
C 19 169
D 17 172
E 21 175

Problem 2

The graph below represents each individual’s weight and corresponding blood pressure. Recall in previous sections the formulas for calculating a regression line. Using the correlation coefficient and regression line, interpret the graph.

BloodPressurevWeightScatterplot
Person Weight Blood Pressure
A 150 125
B 169 130
C 175 160
D 180 169
E 200 150

Problem 3

The following graph shows the regression model for age and salary. You are given the following regression model:

    \[ y = -14,448.8 + 2,552.5*x \]

Using the information given below to give an example of interpolation and extrapolation based on this model.

SalaryvAge
Person Age Salary
A 18 15000
B 21 60000
C 24 35000
D 30 75000
E 45 95000

Solution Problem 1

In order to solve this problem, let’s take it step-by-step.

  1. Calculate the means
  2. Subtract the means from every value
  3. Multiply and square these subtracted values
  4. Sum these multiplied and squared values
Person Hand Height x_{i}-\bar{x}

e

y_{i}-\bar{y}

f

e*f e^2 f^2
A 17 150 -0.8 -14.0 11.2 0.6 196.0
B 15 154 -2.8 -10.0 28.0 7.8 100.0
C 19 169 1.2 5.0 6.0 1.4 25.0
D 17 172 -0.8 8.0 -6.4 0.6 64.0
E 21 175 3.2 11.0 35.2 10.2 121.0
Average 17.8 164 Total 74.0 20.8 506.0

Lastly, you plug everything into the formula. Check out the table below for this calculation.

Formula Result
\sum (x_{i}-x)*(y_{i}-y) 74
\sum (x_{i}-x)^2 20.8
\sum(y_{i}-y)^2 506
Formula \frac{74}{\sqrt{(20.8*506)}}

The formula gives us a correlation coefficient of 0.72, which is a high, positive correlation. Meaning that in this data set, as height increases, so does hand height.

Solution Problem 2

First we calculate the correlation coefficient.

Person Weight Blood Pressure x_{i}-\bar{x}

e

y_{i}-\bar{y}

f

e*f e^2 f^2
A 150 125 -24.8 -21.8 540.6 615.0 475.2
B 169 130 -5.8 -16.8 97.4 33.6 282.2
C 175 160 0.2 13.2 2.6 0.0 174.2
D 180 169 5.2 22.2 115.4 27.0 492.8
E 200 150 25.2 3.2 80.6 635.0 10.2
Average 174.8 146.8 Total 836.8 1310.8 1434.8

Which yields a correlation coefficient of,

    \[ \frac{836.8}{\sqrt{(1310.8*1434.8)}} = 0.61 \]

Next we calculate the regression line.

S_{y} = \sqrt{\frac{\sum (y-\bar{y})^2}{n-1}} \sqrt{\frac{1434.8}{5-1}} = 18.94
S_{x} = \sqrt{\frac{\sum (x-\bar{x})^2}{n-1}} \sqrt{\frac{1310.8}{5-1}} = 18.3
b = r \frac{S_{y}}{S_{x}} 0.61* \frac{18.94}{18.3} = 0.64
a = \bar{y} - b* \bar{x} 146.8 - 174.8* 0.64 = 35.21
y = a + b*x y = 35.21 + 0.64x

This information is summarized below.

SalarvyvAgeScatterplot

Weight and blood pressure have a moderate, positive correlation. Looking at the slope, this means that as weight goes up by 1 kg, blood pressure goes up by 0.64.

Solution Problem 3

To give an example of interpolation and extrapolation, simply plug in values within and outside the data set into the regression model. Below are some examples

Age Result Type
24 46,811.02 Interpolation
60 13,8700.80 Extrapolation
Need a Maths teacher?

Did you like the article?

1 Star2 Stars3 Stars4 Stars5 Stars 4.33/5 - 3 vote(s)
Loading...

Danica

Located in Prague and studying to become a Statistician, I enjoy reading, writing, and exploring new places.