The classic OLS assumptions make it possible for any least squares regression to be valid. Without these assumptions, the results of a least squares regression cannot be taken seriously. Try to test your knowledge on the OLS assumptions by answering the problem below. If you find that you’re having trouble answering or are encountering these concepts for the first time, take a read through this guide first.

 

Problem 6

You are given the following dataset and multiple regression model that explores the relationship between car sales blood pressure, weight, height and age. You’d like to conduct a multiple regression analysis but first want to check through the 6 OLS assumptions. Through the use of graphs and statistics, do you think this model passes each assumption? Explain why or why not.

 

Blood Pressure Weight Height Age
105 75 172 19
106 80 175 18
108 89 170 20
110 90 174 20
113 93 178 21
115 95 179 22
118 96 180 24
119 99 183 25
120 101 185 29
122 102 188 30

 

The best Maths tutors available
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£49
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Ayush
5
5 (28 reviews)
Ayush
£60
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Rajan
4.9
4.9 (11 reviews)
Rajan
£15
/h
1st lesson free!
Farooq
5
5 (13 reviews)
Farooq
£35
/h
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£49
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Ayush
5
5 (28 reviews)
Ayush
£60
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Rajan
4.9
4.9 (11 reviews)
Rajan
£15
/h
1st lesson free!
Farooq
5
5 (13 reviews)
Farooq
£35
/h
First Lesson Free>

OLS

Recall that ordinary least squares, or OLS, is a regression model that has the least squared errors. These population errors are estimated by the sample statistics called “residuals.” The most accurate regression line is one that reduces the space between the observed y values in the data and the estimated y values.

 

1 y_{i} = b_{0} + b_{i}x_{i} + u_{i} OLS estimated regression equation
2 u_{i} = y_{i} - (b_{0} - b_{i}x_{i}) OLS equation rearranged to get residual formula
3 u_{i} = y_{i} - \hat{y} Residual is the difference between the observed and predicted value
4 \sum u_{i}^2 Residuals are squared to counteract negative residuals

 

OLS Assumptions

There are some assumptions that all linear models should pass in order to be taken seriously. These classical linear regression models, or CLRM assumptions, make up the Gauss-Markov theorem. This theorem states that when a model passes the six assumptions, the model has the best, linear, unbiased estimates, or BLUE. Check out the assumptions below.

 

# Assumption Description
1 y = b_{0} + b_{1}x_{1}^2 Linear in parameters
2 {x_{i}, y_{i}} are random sample If each individual in the same population is equally likely to be picked for the sample
3 E(u_{i}|x_{i}) = 0 Zero conditional mean of the error term
4 Weak correlation between x_{n}'s No perfect collinearity
5 Var(u_{i}) = \sigma^2 Homoskedasticity
6 Cov(u_{i}, u_{j}) = 0 No autocorrelation

 

CLRM Assumption 1

The first assumption states that a regression model should be linear in parameters. Keep in mind that there is a difference between a model’s parameters and it’s variables.

 

regression_parameters_coefficients

 

These parameters are called regression coefficients, which are estimated using each sample. Our independent variables, on the other hand, are represented by the y and x_{n}'s. The model's parameters can only be linear, however the model's variables can be non-linear. To check this assumption, simply check the parameters.

 

Linear in Parameters Not Linear in Parameters
log(y) = b_{0} + b_{1}x_{1}^2 + b_{2}\sqrt{x_{2}} + b_{3}log(x_{3}) y = b_{0} + b_{1}^2x_{1} + \sqrt{b_{2}}x_{2} + log(b_{3})x_{3}

 

CLRM Assumption 2

This assumption states that the sample must be a random sample. This means that the sample is drawn from the same population where each sampled individual or value has an equal probability of being chosen. This also means that the sampled values are independent of each other.

 

To check this assumption, you simply have to find out the sampling methodology behind the data if you were not the one who took the sample. Check out the difference between sampling methodologies below.

 

Random Samples Non-random samples
SRS (Simple Random Sample) Quota, Voluntary, Expert, or Convenience Sampling

 

CLRM Assumption 3

The third assumption deals with the error term. This assumption states that the expected value, or mean, of the error term given x_{i} has to be zero. In other words, the x values cannot be correlated with the residual values. When the error term is correlated to any independent variables, this is called endogeneity.

 

Another term for endogeneity is omitted variable bias, where a crucial independent variable is omitted from your model. This crucial variable is then included in the error term, which, because it is correlated with the dependent variable, causes the error term to be correlated with the dependent variable.

 

You can check endogeneity with the Durbin-Wu-Hausman (DWH) test on any statistical program.

 

E(u_{i}|x_{i}) = 0 No endogeneity P-value < 0.05 Endogeneity
E(u_{i}|x_{i}) \neq 0 Endogeneity P-value > 0.05 No endogeneity

 

CLRM Assumption 4

The next assumption deals with collinearity. Perfect collinearity occurs when there is an exact linear relationship between two variables. Perfect collinearity can happen for many reasons, summarized below.

 

One variable is a multiple of another variable One explanatory variable (EV) x_{1} is in feet and another EV x_{2} is in meters
One variable is a linear combination of the others x_{1} is questions answered right, x_{2} is answered wrong and x_{3} left unanswered, then total points = x_{1} + x_{2} + x_{3}
One variable is a transformation of the other x_{1} is price and x_{2} is the logarithm of price

 

This can be checked with a correlation matrix, which lists all variables and their correlations.

 

Height Weight Heartbeat Rate
Height 1 0.96 0.61
Weight 0.96 1 0.72
Heartbeat Rate 0.61 0.72 1

 

CLRM Assumption 5

The fourth assumption states that there should be no heteroskedasticity, meaning the variance shouldn’t have a pattern.

 

heteroscedasticity

 

The above graph represents a heteroscedastic data set, where the pattern is that as the x value increases, the variance increases. As the x-value decreases, the variance decreases. Below are several ways to test for homoscedasticity.

 

Residuals v Predicted Values Plot Levene Test Rule of thumb
Homoscedastic Null plot - no pattern P-value > 0.05 Largest variance is less than 3 to 5 times the smallest variance

 

CLRM Assumption 6

The sixth assumption states there shouldn’t be any autocorrelation, or serial correlation. Meaning, different error terms cannot be correlated. Take the graph below as an example.

 

seasonal_trend

 

The residuals on some months are correlated - meaning, they move together. Here, this is because of seasonal trends, where shopping increases during holiday periods. Below are some ways to test for autocorrelation.

 

DWH Rule of Thumb
No autocorrelation DWH between 0 and 4 Residual v. predicted value plot is null, or no pattern
 

OLS Assumptions in Real Life

Let’s start with the first two assumptions, checked below.

 

# Assumption Answer
1 Linear in parameters The model is blood \; pressure = b_{0} + b_{1}weight + b_{2}height + b_{3}age
2 Random sample While we’re not told about the sampling method, we can assume it is a random sample

 

Next, we check assumption #4 by looking at the correlation matrix. As you could have probably guessed, our independent variables are highly correlated. This assumption is not passed.

 

BP Weight Height Age
BP 1
Weight 0.947891 1
Height 0.93656 0.822583 1
Age 0.930753 0.852695 0.922422 1

 

Next, check for heteroscedasticity visually. Height exhibits a bit more variability for lower values.

 

homoscedasticity_example

The last two assumptions we can do together, as they both involve the residuals. You can run this MLR in any statistical program to get the following regression coefficients.

Intercept 1.270524
Weight 0.355794
Height 0.422362
Age 0.186278

 

With this information, you can calculate the Durbin-Watson by simply getting the residuals with the regression model. With a Durbin-Watson test statistic of 0.0543, which is substantially below 1, we can say the data is autocorrelated.

 

The table below gives the results of the DWH test.

 

p-value = 1.3 Reject null hypothesis of no endogeneity

 

Based off of the multiple violations of assumptions, we should not use this model.

Need a Maths teacher?

Did you like the article?

1 Star2 Stars3 Stars4 Stars5 Stars 5.00/5 - 1 vote(s)
Loading...

Danica

Located in Prague and studying to become a Statistician, I enjoy reading, writing, and exploring new places.