What is Linear Regression?

 

Have you ever wondered how statistics are calculated? For example, according to Statistica, in 2017 to 2018, people in the UK drove, on average, about 16,000 km. But how exactly do statisticians arrive at such a number?

 
Statistics, and maths in general, have been the source of much contention these days. In the face of alternative facts, understanding how statistics are calculated is an extremely important skill. Statistical analysis revolves around one main concept: we cannot access all data all the time. This is why statistical analyses are produced, in order to be able to estimate certain phenomena, such as how many kilometres people drive on average in the UK every year.

 

population_statistics

 

sample_selection

 

The image above is a visual representation of what a sample is along with some of the questions one must consider when taking a sample. The table below shows it’s definition, as well as what it means in our specific example.

 

Population All things included in the phenomena we want to study All of the cars driven in the UK
Sample Subset of the population A sample of 1,000 cars driven in the UK

 

What multiple linear regression, otherwise known as MLR, attempts to do is build a model that uses explanatory variables to predict one response variable. The data used in this MLR, for the majority of the time, comes from a sample. This means that the MLR model that we calculate is only an estimate of the MLR model that exists for the entire population.

 

Take a moment to refresh your memory on what measurements are called and mean for both the population and sample.

 

Population The entire group of things, ideas, or people we want to study Population parameters are measurements from the population Population MLR model
Sample Subset of the population Sample statistics are measurements from the sample that strive to estimate the true population values Sample MLR model that estimates the population MLR

 

The best Maths tutors available
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£49
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Ayush
5
5 (28 reviews)
Ayush
£60
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Rajan
4.9
4.9 (11 reviews)
Rajan
£15
/h
1st lesson free!
Farooq
5
5 (13 reviews)
Farooq
£35
/h
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£49
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Ayush
5
5 (28 reviews)
Ayush
£60
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Rajan
4.9
4.9 (11 reviews)
Rajan
£15
/h
1st lesson free!
Farooq
5
5 (13 reviews)
Farooq
£35
/h
First Lesson Free>

Applications of Linear Regression

The applications of MLR  are vast. Take a look at the MLR equations below that correspond to the population and the sample.

explanatory_variable

As you can see, the four main components of an MLR model are present. These include:

  • The response variable
  • The explanatory variables
  • The parameters of the model
  • The constant

 

An example of an MLR model in real life can be seen in the image below.

disease_multiple_regression

In this example, we take the current pandemic into account. This model strives to predict the amount of covid cases in a given region given the population, number of hospitals, and average flights per day in that region. In this example, the four main components can be broken down like the table below.

 

Response variable Number of covid cases
Explanatory variables Population, number of hospitals, average flights per day
Parameters of the model Beta values, which will be predicted using a data set
The constant The Bo value, which is the constant in the model

 

Transformed Variables

In the majority of cases, the data from our sample has moderately to highly skewed variables. Skewed variable is a variable whose distribution is heavier on one tail. Take the image below as an example.

skewed_distribution

This distribution plot shows the average annual income of a sample of adults. As we would expect, there are a lot less people that earn higher annual salaries. However, if you are trying to predict the which factors go into determining annual salaries, it’s not variable you can just scrap.

 

This is where the idea of transforming variables comes into play. If you want to use a variable but it has a highly skewed distribution, like the one above, you can transform it so that it’s distribution will have a more normal distribution. You should think about transforming your variable if it is highly skewed.

 

Skewness Coefficient Interpretation
> 1 or < -1 Highly skewed
1 > skew > 0.5 or - 1 < skew < - 0.5 Moderately skewed
0.5 > skew > - 0.5 Approximately symmetric

 

As you can see, if the skewness coefficient is greater than 1 or less than -1, then you should think about transforming your variable. The common types of transformations are summarized below.

 

Logarithm Take the log of the variable
Cube root Take the cube root of the variable
Square root Take the square root of the variable
Square Square the variable

 

Taking the log of the above values yields the following distribution. As you can see, our variable is no longer skewed and can be used without problems in our MLR.

log_transformation

 

Transformed Variables Interpretation

As we discussed, highly skewed variables are generally transformed. This is because skewed variables can lead to biased statistics. For example, recall our previous example which used annual income. Because the original variable is highly right skewed, this means that the majority of the data are located around lower salary values with only a few observations located around higher salary values.

 

If we were to take the mean of this highly skewed data, you can imagine it would give us a higher value than what is actually the most common salary in our dataset. The higher salary values act as inflators, pushing the mean up because of their magnitude. Now imagine this effect in a regression!

 

Take a look at the table below to see how log transformed variables are interpreted for common types of linear regression models.

 

Type Interpretation of Regression Coefficients
Log-log An 1% increase of x will lead to a (coefficient)% in y
Linear-log An 1 unit increase of x will lead to an increase of (coefficient/100) units in y
Log-linear An 1 unit increase in x will lead to an increase of (100*(coefficient))% in y

 

Transformed Variables Example

Let’s take a look at an example to put the above interpretations into perspective. The following model plots the association between the area and population.

transformed_variable

 

As you can see, the area is in 1,000 square kilometres while the population is in millions. From this graph, it is unclear what relationship there is, if any. This is because there is an extreme difference in range between where most of the values for population and area are when compared to the more extreme values.

log_transformed_graph

As you can see, taking the logarithm of both variables allows us to see a more clearly defined relationship for both variables. This is because a logarithmic scale is not linear - meaning, the difference between 1 and 2 versus 6 and 7 are not the same distance. Moving one unit on the log scale means multiplying by 10 each time.

 

Problem 1

A model has been run based off of the previous example. The following summarizes the output of each regression element.

 

y Log of area
bo The constant in the model is 3,000
b1 The regression coefficient for population is 9.8
x Log of population

 

Interpret the results of the regression

 

Solution Problem 1

Using the information given in this section, we can interpret the regression in the following way.

 

3,000 Here, the constant can be interpreted as the area of land given that the population is zero
9.8 An 1% increase in population will lead to a 9.8% in area

 

Need a Maths teacher?

Did you like the article?

1 Star2 Stars3 Stars4 Stars5 Stars 5.00/5 - 1 vote(s)
Loading...

Danica

Located in Prague and studying to become a Statistician, I enjoy reading, writing, and exploring new places.