What is Regression?

The statistics involved in data analysis all have one thing in common: trying to find patterns within data. Patterns are essential, not only because they help us process everyday phenomena, but also because they can help us try to make predictions about the future.

 

One of the tools you can use to make predictions about the future is regression modelling. This section gives a brief overview of the concepts involved in regression, as well as practice problems you can use to test what you’ve learned.

 

The best Maths tutors available
1st lesson free!
Ayush
5
5 (27 reviews)
Ayush
£90
/h
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£39
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Myriam
5
5 (15 reviews)
Myriam
£20
/h
1st lesson free!
Andrea
5
5 (12 reviews)
Andrea
£40
/h
1st lesson free!
Ayush
5
5 (27 reviews)
Ayush
£90
/h
1st lesson free!
Intasar
4.9
4.9 (23 reviews)
Intasar
£42
/h
1st lesson free!
Matthew
5
5 (17 reviews)
Matthew
£25
/h
1st lesson free!
Dr. Kritaphat
4.9
4.9 (6 reviews)
Dr. Kritaphat
£39
/h
1st lesson free!
Paolo
4.9
4.9 (11 reviews)
Paolo
£25
/h
1st lesson free!
Petar
4.9
4.9 (9 reviews)
Petar
£27
/h
1st lesson free!
Myriam
5
5 (15 reviews)
Myriam
£20
/h
1st lesson free!
Andrea
5
5 (12 reviews)
Andrea
£40
/h
First Lesson Free>

Linear Correlation Coefficient

Most people have heard of the concept of correlation, but not many understand what it actually is. The most common correlation coefficient is the Pearson product-moment correlation coefficient, which is simply a statistic that tells us how closely related two variables are.

 

It is called a linear correlation coefficient because it measures the strength of the linear relationship between two variables. A perfect linear relationship describes a situation in which a change in one variable leads to the exact unit change in another variable. This perfect linear relationship corresponds to a correlation coefficient of 1.

 

Simple Linear Regression

The correlation coefficient is only one part of analysing the relationship between variables. Linear regression modelling is another tool you can employ to analyse variables. In fact, it is one of the most powerful and commonly used methods of analysis in statistics.

 

Simple linear regression is a linear regression model that has only one independent variable and one dependent variable. An independent variable is the variable that you want to use to study a dependent variable. You can think of the dependent variable as the one you’re interested in studying.

 

The most common form of regression is ordinary least squares (OLS), which is a type of regression model that strives to find the best fit for the data by reducing the distance between the regression line and the data points.

 

Multiple Linear Regression

Oftentimes, you’ll be interested in seeing how more than one independent variable affects a dependent variable. When your regression model includes one dependent variable and two or more independent variables it is called multiple linear regression.

 

You can think of multiple linear regression as the extension of simple linear regression and OLS. The concept is that each independent variable should be able to explain more variability, or changes, in the dependent variable.

 

Gauss-Markov Theorem

The Gauss-Markov theorem is a concept that you will encounter a lot when dealing with linear regression models. This theorem states that if a specific set of assumptions are met, than the OLS estimators will be unbiased and have the smallest variance out of all of the possible linear estimators.

 

The six classic assumptions of the Gauss-Markov theorem are:

  1. Regression model is linear in coefficients and the error term
  2. The error term has an expected value of zero
  3. Homoscedasticity: the conditional variance of the error term is constant for all observations
  4. The error terms are independently distributed and are not correlated with each other
  5. No independent variable is correlated with the error term
  6. No multicollinearity

 

Matrix Multiple Linear Regression

In higher level statistics, you’re likely to encounter this equation, where capital letters are indicate a matrix.

 

While it may look very similar to the regular SLR model, there is one key difference: all variables and parameters are in matrix notation. Matrices are extremely important to understand if you want to delve deeper into statistics - specifically matrix properties and matrix multiplication.

 

Regression Problems

In this section, you will find a range of practice problems that you can use to solidify or test your knowledge in the basic to advanced concepts of regression. Try to solve the problems on your own based on other guides of this resources site and with the aid of the equations below. If you’re struggling to reach a solution, check out the step-by-step answers on each problem’s respective page.

 

Problem 3

You are interested in knowing the relationship between the weather and tourism levels. To investigate, you collect data from the touristic centre in a city during one month in the summer, counting the number of people that arrive at the square at the same time every day. Given the data set below, what is the correlation between temperature and tourism? Interpret the correlation and name a few other reasons why these two variables are or are not related.

 

Temperature Number of Visitors
12 87
21 150
20 110
25 90
17 85
15 70
13 90

 

Problem 4

There are two variables that need to be studied: weight loss and days spent exercising one month. You are given a data set in which individuals have been asked the number of days they exercise for more than half an hour in one month. What kind of regression model can you use here? What are the results of this regression given the data set below. Interpret the model’s estimators.

 

Exercise Days Weight Loss (in kg)
0 4
4 1
8 1.5
12 2
16 4
20 5
24 2

 

Problem 5

You’re curious about which factors play into the salary people earn. In order to find out you’d like to conduct a multiple linear regression analysis on data that has the salary, education level in years, and work experience for 10 individuals. Conduct a multiple regression analysis by finding the regression model on the following data set.

 

Education Experience Salary
11 10 30000
11 6 27000
12 10 20000
12 5 25000
13 5 29000
14 6 35000
14 5 38000
16 8 40000
16 7 45000
16 2 28000
18 6 30000
18 2 55000
22 5 65000
23 2 25000
24 1 75000

 

Problem 5.2

In the last problem, you were asked to build a multiple regression model based on the given data set. This data set dealt with information on 15 individuals and had each other their salary, education level in years and work experience in years.

 

Now that you have the multiple regression model, interpret what these results mean. Explain the meaning of each linear estimator, providing an example of one interpolation and extrapolation.

 

Next, see what would happen to the interpretation of your results if you transformed the variable of salary amount into logarithms. Give an example of what this might do to do interpretation of your regression model.

 

Problem 6

A classmate of yours is having trouble understanding what makes ordinary least squares, under the Gauss-Markov theorem, the best linear estimators as opposed to all the other estimators. Given the following example, explain what BLUE estimators mean and why they are important.

 

A sample is taken from a population that measures the money the observed companies spend on advertising and the amount of sales that they make in one month.

 

Problem 6.2

You are given the following dataset and multiple regression model that explores the relationship between car sales blood pressure, weight, height and age. You’d like to conduct a multiple regression analysis but first want to check through the 6 OLS assumptions. Through the use of graphs and statistics, do you think this model passes each assumption? Explain why or why not.

 

Blood Pressure Weight Height Age
105 75 172 19
106 80 175 18
108 89 170 20
110 90 174 20
113 93 178 21
115 95 179 22
118 96 180 24
119 99 183 25
120 101 185 29
122 102 188 30

 

Problem 8

A shop owner is interested in understanding the demand of certain goods in her store based off of the price. In order to help the store owner, you’re tasked with conducting a simple regression analysis. Given the following data set, your first task is to explain how to format these observations into matrices along with the benefits of using matrices in statistics.

 

Price Demand
120 110
125 100
130 90
135 45
140 20

 

Problem 9

In the previous problem, you were asked to format the data into matrices. Now, using these matrices, find the regression model equation and interpret the results in terms of what this means for the shop owner.

 

Equation Table

In this table, you will find all the equations you will need to use in order to solve the practice problems. If you’re having trouble understanding any of the formulas, make sure to review each page dealing with these formulas and the concepts behind them.

 

Hypothesis Type Description Test Result
H_{o} States that a parameter is equal to, less or greater than, or different from a hypothesized value H_{o} is rejected when p-value < 0.05
H_{1}, H_{a} States that a parameter is not equal to, less or greater, or different from a hypothesized value H_{1}, H_{a} is accepted when p-value < 0.05

 

Center & Spread Formulas Equations
\rho_{xy}

    \[ \frac{Cov(x,y)}{\sigma_{x} \sigma_{y}} \]

Cov(x,y)

    \[ \frac{\sum_{i=1}^{n}(x_{i}- \bar{x})(y_{i}- \bar{y})}{n-1} \]

\sigma_{x}

    \[ \sqrt{\frac{\sum(x_{i} - \bar{x}^2)}{n-1}} \]

\sigma_{y}

    \[ \sqrt{\frac{\sum(y_{i} - \bar{y}^2)}{n-1}} \]

 

Equations Regression Formulas
\hat{y} = b_{0} + b_{1}x_{1} + u_{i} Simple linear regression
\hat{y} = b_{0} + b_{1}x_{1} + … + b_{n}x_{n} + u_{i} Multiple linear regression
u_{i} = y_{i} - \hat{y} Residual
Y = AX + U MLR Matrix
U = U^{T}U SSE Matrix

 

Parameters and Variables Equations
b_{1} \frac{ (\sum x_{2}^2) (\sum x_{1}y) - (\sum x_{1}x_{2})(\sum x_{2}y) }{ (\sum x_{1}^2)(\sum x_{2}^2) - (\sum x_{1} x_{2})^2 }
b_{2} \frac{ (\sum x_{1}^2) (\sum x_{2}y) - (\sum x_{1}x_{2})(\sum x_{1}y) }{ (\sum x_{1}^2)(\sum x_{2}^2) - (\sum x_{1} x_{2})^2 }
\sum x_{i}^2 \sum x_{i}^2 - \frac{(\sum x_{i})^2}{N}
\sum x_{i}y \sum x_{i}y - \frac{(\sum x_{i})(\sum y)}{N}
A (X^{T}X^{-1})X^{T}Y

 

Matrix Operations Rules
Matrix Multiplication Dimensions have to have the same inner value (n x k) x (k x p)

Resulting matrix is (n x p)

Row values are multiplied by column values

Inverse of 2 x 2 Find the determinant \frac{1}{ad - bc}

Switch a and b

Switch d and c

Make a and b negative

Inverse of 3 x 2 and above Perform a set of elementary operations (subtraction, addition, multiplication and division)
Transpose  The matrix (n x k) becomes (k x n)

Rows become columns

 

Need a Maths teacher?

Did you like the article?

1 Star2 Stars3 Stars4 Stars5 Stars 5.00/5 - 1 vote(s)
Loading...

Danica

Located in Prague and studying to become a Statistician, I enjoy reading, writing, and exploring new places.