Can you please give me a quote on doing my homework?
WEEK 7 HW 7 PART 1: OVERVIEW of REGRESSION/CORRELATION Keep in mind that with REGRESSION (correlation) we are working with TWO variables, not just one. With the single variable x just referred to a data point, like the heights of 10 students where the 3rd student had a height of 65 inches. With regression the x-values are themselves a variable (the INDEPENDENT VARIABLE) , like years of education, and the y-values (DEPENDENT VARIABLE) are the salaries associated with those years of education. EACH DATA POINT IS REPRESENTED BY ITS X AND Y VALUES: (X,Y) When we did the exercise on MEANS for a single variable and you drew in the horizontal line you thought was the best fit, the one the minimized the distances all the data points to that line. in REGRESSION we are trying to do the same thing: get an equation of a STRAIGHT LINE, that minimizes the distances between each data point and that line. Or, as we did with the means we minimized the sum of the squared distances (to make them all positive and then added them up trying to get the lowest total distance – this was the “variance”) With regression we call it the “error of prediction”, but again we try to make it the smallest possible. Unlike the line on the graph of a single variable where the mean is a horizontal line, this regression line is intended to show the relationship (CORRELATION) between TWO variables and will have a SLOPE (be tilted). It can have a positive slope ( / ) indicating that as X increases, Y also increases (like speed and distance travelled). OR, it can have a negative slope (\) where Y decreases as X increases (car speed and gas use) The purpose of a regression line is to allow us to predict a y-value for a given x-value. BUT, since we are not typically dealing with “perfect” correlations, there is likely ERROR in the prediction. Here we get back into confidence intervals. We cover this “ERROR” later. Here is a set of (x,y) data to be used below: These are the number of hours 10 students have spent a week studying statistics and their final point grades Student Hours Final Points THE FORMULA FOR A REGRESSION LINE IS: Y' = bX + a Where Y' is the predicted point total based on the regression equation which has a slope, “b”, and a y-intercept, HERE IS THE EQUATION: SLOPE (b) = Σ[ (x − Mx ) * (y − My ) ] / Σ(x − Mx )2 1. Calculate the MEAN (M) and the Standard Deviation (SD) for BOTH the x and y values: Mx = _15.3___; My = _78.5___; SDx = __5.48___; SDy = __14.48_ 1 10 65 2 20 85 3 12 70 4 25 93 5 15 75 6 9 50 7 22 90 8 14 78 9 16 80 10 10 99 2. Fill in the following chart with the calculated values: X (X – Mx) (X – Mx) 2 Y (Y – My) (X – Mx)* (Y – My) 10 65 20 85 12 70 25 93 15 75 9 50 22 90 14 78 16 80 10 99 TOTAL Ʃ= > TOTAL Ʃ= NOW, DO THE FINAL SLOPE CALCULATION: b = Ʃ / Ʃ = _______? NEXT DETERMINE THE y-INTERCEPT “a”. (This is the value on the vertical Y-axis at which our “best fit” line crosses it because x = 0). “a” = My – b * Mx You already have the My and Mx and now have the “b”. Simply plug in those numbers to calculate “a”. a = __________ ? OUR “BEST FIT” LINE EQUATION IS NOW COMPLETE: Y’ = (b) X + (a) = ______________________ WHERE Y' IS THE PREDICTED SCORE FOR A GIVEN X, “b” IS THE SLOPE OF THE LINE, AND “a” IS THE Y’ INTERCEPT. (keep in mind that Y-prime values are estimates based on our line equation and NOT real data) Another thought is are there any UNUSUAL values that might be throwing off the line equation? How could you handle such data? Would it make a big difference if deleted? Something to think about. 3a) Using the original equation, what would be YOUR predicted point total if you were to study 13 hours a week? ____________ What if you put in 29 hours a week? _____________ Be advised that predicting performance BEYOND the range of the x-values is called EXTRAPOLATION and is NOT a good idea as the estimated Y’ value might be way off. (b) Are either of these generated Y’ values an extrapolation, and if so do you feel they are still realistic? BUT, HOW “GOOD” IS OUR LINE EQUATION AT EXPLAINING THE RELATIONSHIP BETWEEN OUR INDEPENDENT VARIABLE (X) AND OUR DEPENDENT VARIABLE (Y)? HOW MUCH ERROR IS THERE? WE MUST CALCULATE THE CORRELATION COEFFICIENT (r) FOR OUR LINE EQUATION. This “r” value not only allows us to determine the slope of the line, but also the strength or significance of the relationship between variable X and variable Y. The closer to + 1 or -1 the value of “r” is the stronger the correlation (relationship). If “r” = 0, there is no statistical relationship. To calculate “r” we basically STANDARDIZE (as in calculating z-values) each X and Y variable. Remember the formula for “z”? For the X variables it is Zx1 = (X1 – mean of X) / Sx and we do this for every X value and for every Y value. The formula does not look that simple, however, but it works. HERE IS THE FORMULA FOR CALCULATING “r” (remember that Ʃ means SUM and that the Ʃ(X2) = squaring each X value and then adding up (summing up) all those squared numbers ; whereas (ƩX)2 = adding up the X values and then squaring that sum. OR, IF THAT IS NOT QUITE CLEAR IT IS: r = [ nƩ (X*Y) – (ƩX) * (ƩY) ] / √ {[nƩX2 – (ƩX)2 ] * [n ƩY2 – (ƩY)2]} USE THIS TABLE OF GIVEN AND CALCULATED VALUES TO INSERT INTO THIS EQUATION: TOTALS > X X2 Y Y2 X*Y The “n” equals the number of data pairs (x,y), which is 10 in this case. 4) SHOW your setup in the equation and then calculate the resulting r: r = _____________ Keep in mind that “r” MUST be between 0 and 1. If you get a value greater than 1, it’s a math error. Always double-check your calculation anyway. NOW, calculate the COEFFICIENT OF DETERMINATION (r2)=_____% This is how much of the variation in y is explained by the variation in x (expressed as a %) WE CAN ALSO USE THE “r” VALUE ALONG WITH THE STANDARD DEVIATIONS OF X AND Y TO CALCULATE THE “BEST FIT” LINE SLOPE (b): SLOPE: b = r (Sx/Sy) IF WE SQUARE r , this is the COEFFICIENT OF DETERMINATION (r 2) • r2 , as a percent, is the percent of variation in the dependent variable Y that CAN be explained by variation in the independent variable X using the regression (best-fit) line. • 1 – r2 , as a percentage, is the percent of variation in Y that is NOT explained by variation in x using the regression line. In other words the more “scattered” the data points are the less the variation is explained by the regression line. Make sense? If there is too much “scatter” meaning too much UNEXPLAINED variation, we perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the SAMPLE data is strong enough to use to model the relationship in the POPULATION. The Following is OPTIONAL reading – no problems need to be worked. ONCE WE HAVE OUR BEST FIT LINE EQUATION WE CAN DETERMINE CONFIDENCE INTERVALS AND CONDUCT HYPOTHESIS TESTS ON THE CORRELATION COEFFICIENT (r) AND ON THE SLOPE (b) YOU WOULD FILL IN THE FOLLOWING TABLE. NOTE THAT WE WOULD BE USING OUR BEST FIT LINE EQUATION TO CALCULATE VALUES (Y’) FOR THE ORIGINAL X DATA VALUES. X Y Y - MY (Y – My)2 (SSY) Y’ Y’ – M y’ (Y’ – M y’)2 (SSY’) Y – Y’ (Y – Y’)2 (SSE) (total these = Ʃ ) (total these = Ʃ ) (total these = Ʃ ) So, what do we learn from these numbers? SSY = Sum of Squares of Y and is the SUM of the (Y – My)2 column The SSY can be divided into the SSY’ = Sum of Squares Predicted + SSE = Sum of Squares Error Where SSY’ = (Y’ – My’)2 which is the “EXPLAINED” variation and SSE = (Y- Y’)2 is the UNexplained variation. SSY’ / SSY = Proportion (percentage) of variation explained and this = r2 (where “r” is the correlation coefficient). Think about it: IF r = 0 then NONE of the variation is explained, and if r = 1.0 the ALL of the variation is explained. Let’s talk about the STANDARD ERROR of the estimate (Sest ). This is like a Standard Deviation, while the SQUARED numbers above were Variances. Sest = √ [Ʃ (y – y’) 2 / (N-2) ] It’s N - 2 since we estimated the slope (b) and the Y-intercept (a). We can also calculate Sest from: √ [ 1 – r2)* SSY / (N – 2)] Take your pick. We can conduct a SIGNIFICANCE TEST on the SLOPE (b) of our best fit line equation Use the t-test: t = (some statistic – hypothesized value)/estimated std error of that statistic In the case the STATISTIC