Understanding Linear Regression

Sometimes its amazing how very simple concepts can actual be made so complicated and for little reason. For those of us who don’t have a Stats background but want or need to understand statistical concepts it doesn’t help that they are taught in a way that doesn’t make it easy to understand. Here I will touch on understanding Linear Regression and how its used.

Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line is called a regression line. The black diagonal line below is the regression line and consists of the predicted score on Y for each possible value of X. The vertical lines from the points to the regression line represent the errors of prediction. As you can see, the red point is very near the regression line; its error of prediction is small. By contrast, the yellow point is much higher than the regression line and therefore its error of prediction is large.

The error of prediction for a point is the value of the point minus the predicted value (the value on the line). The table below shows the predicted values (Y’) and the errors of prediction (Y-Y’). For example, the first point has a Y of 1.00 and a predicted Y (called Y’) of 1.21. Therefore, its error of prediction is -0.21.



You may have noticed that we did not specify what is meant by “best-fitting line.” By far, the most commonly-used criterion for the best-fitting line is the line that minimizes the sum of the squared errors of prediction. That is the criterion that was used to find the line in the graph above

The formula for a regression line is:  Y’ = bX + A where Y’ is the predicted score, b is the slope of the line, and A is the Y intercept.

Computing the Regression Line

In the age of computers, the regression line is typically computed with statistical software. However, the calculations are relatively easy, and are given here for anyone who is interested. The calculations are based on the statistics shown in the table below. MX is the mean of X, MY is the mean of Y, sX is the standard deviation of X, sY is the standard deviation of Y, and r is the correlation between X and Y.

Statistics for computing regression line:


The slope (b) can be calculated as follows: b = r sY/sand the intercept (A) can be calculated as: A = MY – bMX. For these data, b = (0.627)(1.072)/1.581 = 0.425 and A = 2.06 – (0.425)(3) = 0.785

A Real Example

The case study “SAT and College GPA” contains high school and university grades for 105 computer science majors at a local state school. We now consider how we could predict a student’s university GPA if we knew his or her high school GPA. The scatter plot below shows the University GPA as a function of High School GPA. You can see from the figure that there is a strong positive relationship. The correlation is 0.78. The regression equation is:

University GPA’ = (0.675)(High School GPA) + 1.097

Therefore, a student with a high school GPA of 3 would be predicted to have a university GPA of

University GPA’ = (0.675)(3) + 1.097 = 3.12.

In conclusion, you get x and y data, plot it, then use the statistics for computing regression line and plot the line. Then importantly in order to correctly assess what a value might be (like example above if you are trying to find what a university GPA is from the High School GPA) you need to know what the correlation of the line to the data. This is important because if your correlation is very low you are not going to be able to predict with any amount of accuracy however if you have higher than say .75 or .80 correlation then you know your predictions will be much more accurate.Please let me know if this helps, is confusing, or have any questions :)Source: StatBook