Sometimes its amazing how very simple concepts can actual be made so complicated and for little reason. For those of us who don’t have a Stats background but want or need to understand statistical concepts it doesn’t help that they are taught in a way that doesn’t make it easy to understand. Here I will touch on understanding Linear Regression and how its used.
Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line is called a regression line. The black diagonal line below is the regression line and consists of the predicted score on Y for each possible value of X. The vertical lines from the points to the regression line represent the errors of prediction. As you can see, the red point is very near the regression line; its error of prediction is small. By contrast, the yellow point is much higher than the regression line and therefore its error of prediction is large.
The error of prediction for a point is the value of the point minus the predicted value (the value on the line). The table below shows the predicted values (Y’) and the errors of prediction (Y-Y’). For example, the first point has a Y of 1.00 and a predicted Y (called Y’) of 1.21. Therefore, its error of prediction is -0.21.
You may have noticed that we did not specify what is meant by “best-fitting line.” By far, the most commonly-used criterion for the best-fitting line is the line that minimizes the sum of the squared errors of prediction. That is the criterion that was used to find the line in the graph above
The formula for a regression line is: Y’ = bX + A where Y’ is the predicted score, b is the slope of the line, and A is the Y intercept.
Computing the Regression Line
In the age of computers, the regression line is typically computed with statistical software. However, the calculations are relatively easy, and are given here for anyone who is interested. The calculations are based on the statistics shown in the table below. MX is the mean of X, MY is the mean of Y, sX is the standard deviation of X, sY is the standard deviation of Y, and r is the correlation between X and Y.
Statistics for computing regression line:
The slope (b) can be calculated as follows: b = r sY/sX and the intercept (A) can be calculated as: A = MY – bMX. For these data, b = (0.627)(1.072)/1.581 = 0.425 and A = 2.06 – (0.425)(3) = 0.785
A Real Example
The case study “SAT and College GPA” contains high school and university grades for 105 computer science majors at a local state school. We now consider how we could predict a student’s university GPA if we knew his or her high school GPA. The scatter plot below shows the University GPA as a function of High School GPA. You can see from the figure that there is a strong positive relationship. The correlation is 0.78. The regression equation is:
University GPA’ = (0.675)(High School GPA) + 1.097
Therefore, a student with a high school GPA of 3 would be predicted to have a university GPA of
University GPA’ = (0.675)(3) + 1.097 = 3.12.
Joshua is an experienced analytics professional with focus on areas such as Analytics, Big Data, Business Intelligence, Data Science and Statistics. He has more than 13 years experience in Business Intelligence & Data Warehousing, Analtyics, IT Management, Software Engineering and Supply Chain Performance Management with Fortune 500 companies. He has specializations in building Analytics organizations, Mobile Reporting, Performance Management, and Business Analysis.
Please follow us :)5k
- Analytics (21)
- Big Data (9)
- Business Intelligence (59)
- Data Science (70)
- Miscellaneous (17)
Tags2008 Analysis Analytics Article Big Data Book Business Intelligence Charts Cognos Dashboards Data Data Visualization Data Warehouse Design Dimensional Fusion Tables Google Hadoop Humor IBM Logical Market Microsoft Model Modeling Operational Predictive Programming Python R Ralph Kimball Reporting Science Server SQL SQL Server SSIS Statistics TED Tools Tutorial Unstructured Video Visualization Warehousing