September 30, 2020
Chapters
What is Regression?
The statistics involved in data analysis all have one thing in common: trying to find patterns within data. Patterns are essential, not only because they help us process everyday phenomena, but also because they can help us try to make predictions about the future.
One of the tools you can use to make predictions about the future is regression modelling. This section gives a brief overview of the concepts involved in regression, as well as practice problems you can use to test what you’ve learned.
Linear Correlation Coefficient
Most people have heard of the concept of correlation, but not many understand what it actually is. The most common correlation coefficient is the Pearson productmoment correlation coefficient, which is simply a statistic that tells us how closely related two variables are.
It is called a linear correlation coefficient because it measures the strength of the linear relationship between two variables. A perfect linear relationship describes a situation in which a change in one variable leads to the exact unit change in another variable. This perfect linear relationship corresponds to a correlation coefficient of 1.
Simple Linear Regression
The correlation coefficient is only one part of analysing the relationship between variables. Linear regression modelling is another tool you can employ to analyse variables. In fact, it is one of the most powerful and commonly used methods of analysis in statistics.
Simple linear regression is a linear regression model that has only one independent variable and one dependent variable. An independent variable is the variable that you want to use to study a dependent variable. You can think of the dependent variable as the one you’re interested in studying.
The most common form of regression is ordinary least squares (OLS), which is a type of regression model that strives to find the best fit for the data by reducing the distance between the regression line and the data points.
Multiple Linear Regression
Oftentimes, you’ll be interested in seeing how more than one independent variable affects a dependent variable. When your regression model includes one dependent variable and two or more independent variables it is called multiple linear regression.
You can think of multiple linear regression as the extension of simple linear regression and OLS. The concept is that each independent variable should be able to explain more variability, or changes, in the dependent variable.
GaussMarkov Theorem
The GaussMarkov theorem is a concept that you will encounter a lot when dealing with linear regression models. This theorem states that if a specific set of assumptions are met, than the OLS estimators will be unbiased and have the smallest variance out of all of the possible linear estimators.
The six classic assumptions of the GaussMarkov theorem are:
 Regression model is linear in coefficients and the error term
 The error term has an expected value of zero
 Homoscedasticity: the conditional variance of the error term is constant for all observations
 The error terms are independently distributed and are not correlated with each other
 No independent variable is correlated with the error term
 No multicollinearity
Matrix Multiple Linear Regression
In higher level statistics, you’re likely to encounter this equation, where capital letters are indicate a matrix.
While it may look very similar to the regular SLR model, there is one key difference: all variables and parameters are in matrix notation. Matrices are extremely important to understand if you want to delve deeper into statistics  specifically matrix properties and matrix multiplication.
Regression Problems
In this section, you will find a range of practice problems that you can use to solidify or test your knowledge in the basic to advanced concepts of regression. Try to solve the problems on your own based on other guides of this resources site and with the aid of the equations below. If you’re struggling to reach a solution, check out the stepbystep answers on each problem’s respective page.
Problem 3
You are interested in knowing the relationship between the weather and tourism levels. To investigate, you collect data from the touristic centre in a city during one month in the summer, counting the number of people that arrive at the square at the same time every day. Given the data set below, what is the correlation between temperature and tourism? Interpret the correlation and name a few other reasons why these two variables are or are not related.
Temperature  Number of Visitors 
12  87 
21  150 
20  110 
25  90 
17  85 
15  70 
13  90 
Problem 4
There are two variables that need to be studied: weight loss and days spent exercising one month. You are given a data set in which individuals have been asked the number of days they exercise for more than half an hour in one month. What kind of regression model can you use here? What are the results of this regression given the data set below. Interpret the model’s estimators.
Exercise Days  Weight Loss (in kg) 
0  4 
4  1 
8  1.5 
12  2 
16  4 
20  5 
24  2 
Problem 5
You’re curious about which factors play into the salary people earn. In order to find out you’d like to conduct a multiple linear regression analysis on data that has the salary, education level in years, and work experience for 10 individuals. Conduct a multiple regression analysis by finding the regression model on the following data set.
Education  Experience  Salary 
11  10  30000 
11  6  27000 
12  10  20000 
12  5  25000 
13  5  29000 
14  6  35000 
14  5  38000 
16  8  40000 
16  7  45000 
16  2  28000 
18  6  30000 
18  2  55000 
22  5  65000 
23  2  25000 
24  1  75000 
Problem 5.2
In the last problem, you were asked to build a multiple regression model based on the given data set. This data set dealt with information on 15 individuals and had each other their salary, education level in years and work experience in years.
Now that you have the multiple regression model, interpret what these results mean. Explain the meaning of each linear estimator, providing an example of one interpolation and extrapolation.
Next, see what would happen to the interpretation of your results if you transformed the variable of salary amount into logarithms. Give an example of what this might do to do interpretation of your regression model.
Problem 6
A classmate of yours is having trouble understanding what makes ordinary least squares, under the GaussMarkov theorem, the best linear estimators as opposed to all the other estimators. Given the following example, explain what BLUE estimators mean and why they are important.
A sample is taken from a population that measures the money the observed companies spend on advertising and the amount of sales that they make in one month.
Problem 6.2
You are given the following dataset and multiple regression model that explores the relationship between car sales blood pressure, weight, height and age. You’d like to conduct a multiple regression analysis but first want to check through the 6 OLS assumptions. Through the use of graphs and statistics, do you think this model passes each assumption? Explain why or why not.
Blood Pressure  Weight  Height  Age 
105  75  172  19 
106  80  175  18 
108  89  170  20 
110  90  174  20 
113  93  178  21 
115  95  179  22 
118  96  180  24 
119  99  183  25 
120  101  185  29 
122  102  188  30 
Problem 8
A shop owner is interested in understanding the demand of certain goods in her store based off of the price. In order to help the store owner, you’re tasked with conducting a simple regression analysis. Given the following data set, your first task is to explain how to format these observations into matrices along with the benefits of using matrices in statistics.
Price  Demand 
120  110 
125  100 
130  90 
135  45 
140  20 
Problem 9
In the previous problem, you were asked to format the data into matrices. Now, using these matrices, find the regression model equation and interpret the results in terms of what this means for the shop owner.
Equation Table
In this table, you will find all the equations you will need to use in order to solve the practice problems. If you’re having trouble understanding any of the formulas, make sure to review each page dealing with these formulas and the concepts behind them.
Hypothesis Type  Description  Test Result 
States that a parameter is equal to, less or greater than, or different from a hypothesized value  is rejected when pvalue < 0.05  
States that a parameter is not equal to, less or greater, or different from a hypothesized value  is accepted when pvalue < 0.05 
Center & Spread Formulas  Equations 
 
 
 

Equations  Regression Formulas 
Simple linear regression  
Multiple linear regression  
Residual  
MLR Matrix  
SSE Matrix 
Parameters and Variables  Equations 
Matrix Operations  Rules 
Matrix Multiplication  Dimensions have to have the same inner value (n x k) x (k x p) Resulting matrix is (n x p) Row values are multiplied by column values 
Inverse of 2 x 2  Find the determinant Switch a and b Switch d and c Make a and b negative 
Inverse of 3 x 2 and above  Perform a set of elementary operations (subtraction, addition, multiplication and division) 
Transpose  The matrix (n x k) becomes (k x n) Rows become columns 