Understanding Linear Regression
Sometimes its amazing how very simple concepts can actual be made so complicated and for little reason. For those of us who don’t have a Stats background but want or need to understand statistical concepts it doesn’t help that they are taught in a way that doesn’t make it easy to understand. Here I will touch on understanding Linear Regression and how its used.
Linear regression consists of finding the bestfitting straight line through the points. The bestfitting line is called a regression line. The black diagonal line below is the regression line and consists of the predicted score on Y for each possible value of X. The vertical lines from the points to the regression line represent the errors of prediction. As you can see, the red point is very near the regression line; its error of prediction is small. By contrast, the yellow point is much higher than the regression line and therefore its error of prediction is large.
The error of prediction for a point is the value of the point minus the predicted value (the value on the line). The table below shows the predicted values (Y’) and the errors of prediction (YY’). For example, the first point has a Y of 1.00 and a predicted Y (called Y’) of 1.21. Therefore, its error of prediction is 0.21.
X  Y  Y’  YY’  (YY’)^{2} 

1.00  1.00  1.210  0.210  0.044 
2.00  2.00  1.635  0.365  0.133 
3.00  1.30  2.060  0.760  0.578 
4.00  3.75  2.485  1.265  1.600 
5.00  2.25  2.910  0.660  0.436 
You may have noticed that we did not specify what is meant by “bestfitting line.” By far, the most commonlyused criterion for the bestfitting line is the line that minimizes the sum of the squared errors of prediction. That is the criterion that was used to find the line in the graph above
The formula for a regression line is: Y’ = bX + A where Y’ is the predicted score, b is the slope of the line, and A is the Y intercept.
Computing the Regression Line
In the age of computers, the regression line is typically computed with statistical software. However, the calculations are relatively easy, and are given here for anyone who is interested. The calculations are based on the statistics shown in the table below. M_{X} is the mean of X, M_{Y} is the mean of Y, s_{X} is the standard deviation of X, s_{Y} is the standard deviation of Y, and r is the correlation between X and Y.
Statistics for computing regression line:
M_{X}  M_{Y}  s_{X}  s_{Y}  r 

3  2.06  1.581  1.072  0.627 
The slope (b) can be calculated as follows: b = r s_{Y}/s_{X }and the intercept (A) can be calculated as: A = M_{Y} – bM_{X}. For these data, b = (0.627)(1.072)/1.581 = 0.425 and A = 2.06 – (0.425)(3) = 0.785
A Real Example
The case study “SAT and College GPA” contains high school and university grades for 105 computer science majors at a local state school. We now consider how we could predict a student’s university GPA if we knew his or her high school GPA. The scatter plot below shows the University GPA as a function of High School GPA. You can see from the figure that there is a strong positive relationship. The correlation is 0.78. The regression equation is:
University GPA’ = (0.675)(High School GPA) + 1.097
Therefore, a student with a high school GPA of 3 would be predicted to have a university GPA of
University GPA’ = (0.675)(3) + 1.097 = 3.12.
Related

Data Enthusiast (@DataEnthusiast)

@data_nerd

Matt
Joshua Burkhow
Joshua is an experienced analytics professional with focus on areas such as Analytics, Big Data, Business Intelligence, Data Science and Statistics. He has more than 13 years experience in Business Intelligence & Data Warehousing, Analtyics, IT Management, Software Engineering and Supply Chain Performance Management with Fortune 500 companies. He has specializations in building Analytics organizations, Mobile Reporting, Performance Management, and Business Analysis.
Please follow us :)
Recent Comments
 Content Analyzer on 5 Useful Posts About Data Scraping
 Robin White on 5 Tools You Need To Know To Work With Big Data
 Mido Atia on Dashboards vs. Scorecards
 Matt on Understanding Linear Regression
 Richard Kozicki on 5 Tools You Need To Know To Work With Big Data
Categories
 Analytics (21)
 Data Analysis (4)
 Predictive Analytics (3)
 Big Data (9)
 Business Intelligence (59)
 Data Modeling (7)
 Data Warehousing (19)
 Reporting (5)
 Data Science (70)
 Computer Science (6)
 Algorithms (2)
 Machine Learning (4)
 Data Visualization (17)
 Mathematics (2)
 Programming (18)
 Statistics (15)
 Computer Science (6)
 Miscellaneous (17)
 Data Sources (3)
 General (1)
 Analytics (21)
Tags
2008 Analysis Analytics Article Big Data Book Business Intelligence Charts Cognos Dashboards Data Data Visualization Data Warehouse Design Dimensional Fusion Tables Google Hadoop Humor IBM Logical Market Microsoft Model Modeling Operational Predictive Programming Python R Ralph Kimball Reporting Science Server SQL SQL Server SSIS Statistics TED Tools Tutorial Unstructured Video Visualization Warehousing