What is Regression?

Take a look at the following table, which is a dataset from a sample of high school students. 
Hours Spent on Phone Hours Spent Outside
2 0.5
1 0.6
5 0.2
3 0.4

 

With descriptive statistics, which measures the centre and spread of the data, we could calculate the mean number of hours spent on their phone or outside. We could also calculate the variance in the data or plot the number of hours spent on their phone with a bar chart.

 

While descriptive statistics are very powerful, inferential statistics can help us predict what is not included in our dataset. Regression analysis is one of the tools of inferential statistics, which models the linear relationship between two or more variables. Take a look at the image below, which you’ll be able to interpret by the end of this guide.

regression_line

 

The best Maths tutors available
1st lesson free!
Ayush
5
5 (27 reviews)
Ayush
£90
/h
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£39
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Myriam
5
5 (15 reviews)
Myriam
£20
/h
1st lesson free!
Andrea
5
5 (12 reviews)
Andrea
£40
/h
1st lesson free!
Ayush
5
5 (27 reviews)
Ayush
£90
/h
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£39
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Myriam
5
5 (15 reviews)
Myriam
£20
/h
1st lesson free!
Andrea
5
5 (12 reviews)
Andrea
£40
/h
First Lesson Free>

Simple Linear Regression

Simple linear regression is a form of linear regression in which there is only one independent and one dependent variable. To understand these variables, take a look at the image below.

slr_formula

This is the sample SLR equation, which closely follows the equation of a line, which can be seen below.

line_formula

As you can see, the SLR equation is composed of four main components. These components are explained in the table below.

 

Component Definition Interpretation
Y Response variable The variable that increases or decreases in response to changes in x
X Explanatory variable The variable that describes the variation in y
Bo Constant The value of y if x was zero
B1 Slope The amount of increase or decrease (if positive or negative) in y following a 1 unit change in x

 

SLR Interpretation

In order to understand how to interpret and SLR model, you should know that besides the four elements mentioned above, there are typically two more elements given in a regression model, summarized below.

 

Component Definition Interpretation
R-squared Proportion of variance of the response variable explained by the explanatory variables A high R-squared indicates the regression model is good at explaining the variance in y
Standard error of regression The standard error between the data points and the predicted values A low SE of the regression means that the data points and predicted values are close together

 

Problem 1

You are interested in studying the relationship between income level and energy consumption. In order to do this, you are given a data set that includes the variables income and energy consumption. The income variable is in thousands of dollars while the energy consumption variable are in megawatt hours, or MWh.

 

Calculate the covariance and correlation coefficient of these variables. Using this information, interpret the graph of both variables which is given below.

slr_regression_graph
Income Energy Consumption
35 9
46 10
52 11
60 12
85 16

 

Solution to Problem 1

In this problem, you were asked to calculate the correlation coefficient and then interpret the variables in the graph using this information. To calculate the correlation coefficient, you first have to calculate the mean of both x and y. Next, you subtract this value from each observation and plug the results into the formula.

 

Income Energy Consumption x-\bar{x} y-\bar{y} (x-\bar{x})*(y-\bar{y}) (x-\bar{x})^2 (y-\bar{y})^2
35 9 -21 -3 53.56 424.36 6.76
46 10 -10 -2 15.36 92.16 2.56
52 11 -4 -1 2.16 12.96 0.36
60 12 4 0 1.76 19.36 0.16
85 16 29 4 129.36 864.36 19.36
Mean = 56 Mean = 12 Total 202 1413 29

 

    \[ r(x,y) = \dfrac{202}{\sqrt{1413*29}} = 0.995 \]

 

Problem 2

In the previous problem, you were asked to explore the relationship between the two variables of energy consumption and income level using the covariance and correlation coefficient. Now, you want to see if there is another factor in determining energy consumption. You are given a data set that, for the same energy consumption observations, has data on the average temperature in that region.

 

Given the following graph of the two variables, calculate the correlation coefficient of the two variables and compare it to the previous two variables. In other words, find out if income level and energy consumption are more strongly or weakly correlated than average temperature and energy consumption.

slr_regression_example

 

Average Temperature Energy Consumption
20 9
19 10
10 11
4 12
28 16

 

Solution to Problem 2

In order to compare the two variables, we need to find the correlation between average temperature and energy consumption.

 

Average Temperature Energy Consumption x-\bar{x} y-\bar{y} (x-\bar{x})*(y-\bar{y}) (x-\bar{x})^2 (y-\bar{y})^2
20 9 4 -3 -9.88 14.44 6.76
19 10 3 -2 -4.48 7.84 2.56
10 11 -6 -1 3.72 38.44 0.36
4 12 -12 0 -4.88 148.84 0.16
28 16 12 4 51.92 139.24 19.36
Mean = 16 Mean = 12 Total 36 349 29

 

    \[ r(x,y) = \dfrac{36}{\sqrt{349*29}} = 0.361 \]

 

Income and energy consumption are more highly correlated than average temperature and energy consumption.

regression_line_example

 

Problem 3

You have now determined which variables are more strongly correlated. In order to be able to use this information, you need to use the data to model energy consumption. That is, you need to build an SLR model with the data that you have. Recall that there are two main elements you need to calculate in order to build an SLR model: the constant and the regression coefficient. You can find these formulas below.

regression_formulas_estimators

Find the SLR model using the data of the most strongly correlated variables. Next, perform an interpolation and extrapolation using any values. Recall that interpolation is when you predict y using an x variable that is already included in the range of your data set. Extrapolation, on the other hand, is when you predict a y using an x that is outside the range of your data. The picture below should give a clearer idea.

interpolation_extrapoliation

 

Solution to Problem 3

In order to build a regression model, we must find the values for b_{o} and b_{1}. Recall the information we already calculated.

 

x-\bar{x} 56
y-\bar{y} 12
(x-\bar{x})*(y-\bar{y}) 202
(x-\bar{x})^2 1413
(y-\bar{y})^2 29

 

b_{1} \frac{ \sum (x_{i}-\bar{x}) (y_{i}-\bar{y}) }{ \sum (x_{i}-\bar{x})^2}
= \dfrac{202}{1413} = 0.14

 

b_{o} \bar{y} - b_{1}\bar{x}
= 12-(0.14*56) = 3.6

 

Need a Maths teacher?

Did you like the article?

1 Star2 Stars3 Stars4 Stars5 Stars 3.00/5 - 2 vote(s)
Loading...

Danica

Located in Prague and studying to become a Statistician, I enjoy reading, writing, and exploring new places.