Math 28 Project II: Linear Regression and Correlation
If we are given two ordered pairs and , we should be able to “work backwards” and find the equation of the line passing through them; assuming that x and y are linearly related, we can easily find the equation of the line passing through the points and . The process of finding the equation of a line passing through given points is called linear regression.
If we are given more than two points, the points may not be collinear (meaning all on a line). In fact, when collecting real-world data with more than two data points, the points are rarely collinear. However, after plotting the scatter of points in the xy-plane, it may appear that they
almost
fit on a line, displaying a linear trend:
Figure 1 - Sample data indicating a roughly linear trend
When a scatter of points exhibits a linear trend, we construct a line that best approximates that trend. This line is called the
Least-Squares Regression Line, or simply the
Line of Best Fit. We denote it by , where the “hat” over the y indicates that the calculated value of y is a prediction based on linear regression.
CALCULATING THE LINE OF BEST FIT Given a sample of n ordered pairs the line of best fit is denoted by , where the slope
aand the y-intercept
bare given by
and
where and denote the means of the x- and y-coordinates, respectively.
Example
Let’s look at an example: Given the ordered pairs (5, 14), (9, 17), (12, 16), (14, 18), and (17, 23), find the equation of the line of best fit, and graph it along with the data on the same coordinate system.
Example Solution
We need to organize the data and compute the appropriate sums:
(x, y)
|
x
|
x2
|
y
|
xy
|
(5, 14)
|
5 |
25 |
14 |
70 |
(9, 17)
|
9 |
81 |
17 |
153 |
(12, 16)
|
12 |
144 |
16 |
192 |
(14, 18)
|
14 |
196 |
18 |
252 |
(17, 23)
|
17 |
289 |
23 |
391 |
|
First, we find the slope a:
=.
Now we compute b:
.
Therefore, the line of best fit is
Rounding off to one decimal place, we have
To graph the line, we need to plot two points. One point is the
y-intercept(0, b) =
(0, 10.3). To find another point, we pick the maximum of the x-values in the data set, x = 18, and calculate :
.
Therefore the point (18, 21.1) is on the line of best fit. We plot the line using the two points below, along with the data:
Figure 2 - Line of Best Fit for Example 1
If our line fits the data well, we can use the line of best fit to
interpolatey-values, given some x within the range of the x-values of our data set. Note that we CANNOT
extrapolatey-values whose x-values are OUTSIDE of the range of our given x-values. In other words, we can use the line to “fill in” missing y-values BETWEEN our given points, but it is a very strong and likely false assumption to find y-values outside of our data range.
For instance, since our line appears to fit our data well, we can estimate when
x = 10:
This says that our line estimates the point (10, 16.3)
betweenour given data points.
Assignment:
Based on the above discussion, complete the problem below as a team, showing your work in the spaces provided.
Suppose data on the average hourly wage and the unemployment rate in the United States are given below (from the Federal Reserve Economic Data):
Figure 3 - Year, Average Hourly Wage, and Unemployment Rate
Year
|
Average Hourly Wage
|
Unemployment Rate
|
---|
1992
|
$10.57 |
7.5% |
1993
|
10.83 |
6.9 |
1994
|
11.12 |
6.1 |
1995
|
11.43 |
5.6 |
1996
|
11.82 |
5.4 |
1997
|
12.28 |
4.9 |
1998
|
12.77 |
4.5 |
- Letting
x
= average hourly wage, and
y
= unemployment rate, plot the data. Do the data exhibit a linear trend?
- Find the line of best fit. Use the table below to help with the computations:
(x, y)
|
x
|
x2
|
y
|
xy
|
(10.57, 7.5)
|
(10.83, 6.9)
|
(11.12, 6.1)
|
(11.43, 5.6)
|
(11.82, 5.4)
|
(12.28, 4.9)
|
(12.77, 4.5)
|
|
- Using Excel or a similar tool, graph the line of best fit, along with the data, on the same coordinate system. Does the line appear to fit the data well? Copy your graph below.
- Predict the unemployment rate when the average hourly wage is $12.00.
- Can we predict the unemployment rate when the average hourly wage is $14.00? Explain.
References
Data Source: FRED, Federal Reserve Economic Data, Federal Reserve Bank of St. Louis:
Civilian Unemployment Rate [UNRATE], Average Hourly Earnings of Production and Nonsupervisory Employees[AHETPI]; U.S. Department of Labor: Bureau of Labor Statistics; http://research.stlouisfed.org/fred2/; accessed October 14
th, 2014.