·
Take this Word document and insert your answer following each question.
·
Answer questions in sequence.
·
You will be asked to take screen shots of your Excel spreadsheet and paste them into the document.
Do not send a picture of your whole screen.
The screenshot should just include what is necessary for the question.
If you need guidance on capturing pictures, please watch the screen capture video in the assignment folder.
Assignment #4 Due: Friday, March 24th at 11:59pm Name: _______________________ Section: __________ Instructions: · Take this Word document and insert your answer following each question. · Answer questions in sequence. · You will be asked to take screen shots of your Excel spreadsheet and paste them into the document. Do not send a picture of your whole screen. The screenshot should just include what is necessary for the question. If you need guidance on capturing pictures, please watch the screen capture video in the assignment folder. · When you are finished, save the document as a single pdf document and upload it to TopHat. If you have problems submitting, send your instructor an email. Assignment #3 looked at building a model to predict fuel economy based upon published vehicle characteristics. The data set we will use can be found at https://www.kaggle.com/datasets/tymekurban/new-cars-usa-202223-dataset It was posted by Tymoteusz Urban. He has named the New Cars USA 2022/23 dataset and describes the data set as follows. Data scraped from ccarprice.com and thoroughly cleaned. More than unique 2500 cars (only those for which price info was available) and almost 80 columns. All important technical data such as drivetrain, power or torque as well as photo url, body, car dimensions and additional equipment. Each row is one car. There may be some mistakes and NAs, so when manipulating data remember some values in columns may be irrational. In Assignment 3, we built a model that took the form Estimated Average MPG = 44.48 -0.0285*Torque -0.1801*Height This model “explained” 61% of the variation in MPG and predictions would typically be off by 3.1 mpg, more or less. In Assignment 4 we will endeavor to improve predictions by adding variables, looking at possible non-linearities, and incorporating categorical variables, such as body type, transmission, engine type, and drivetrain. Open the file Assignment 4 Data.xlxs. 1. Go to the tab labeled Torque Height. This tab shows the results of the model using Torque and Height to predict AvgMPG that was built in Assignment 3. a. - Copy the Residuals to a new sheet. - From the Assignment 3 Data tab, copy the numeric variables that were not used (all variables from Price to Wheelbase, excluding Torque and Height). · Construct a correlation table (matrix). Paste a copy into your assignment. b. You should find that Cylinders has the highest correlation with residuals. - Copy Torque, Height, Cylinders and AvgMPG into a new sheet. - Use Regression to fit the model AvgMPG = intercept + coeff1*Torque + coeff2*Height + coeff3*Cylinders. - Request Residuals and Residual Plot for this model. - Copy the Regression Summary Output into your assignment. Do not copy the Residual Output or the Residual Chart. c. Write out the estimated regression formula. d. In Assignment #3, it was suggested that the strong correlation among Power, Torque, Displacement and Cylinders would give rise to issues of multicollinearity, if more than one was included in the model. Comment on the degree to which the coefficient for Torque has changed after Cylinders was added. Did it shift by much more than would have been expected based upon the standard error of the estimate in Assignment 3? The standard error reflects of how much error there might have been in the estimate. e. Has including Cylinders improved the “performance” of the model predictions? This is not simply an opinion. Back up your conclusion using model performance measures. 2. Let us try to improve the model further. - Copy the Residuals from question 1 to a new sheet. - Copy the unused numeric variables (all variables from Price to Wheelbase, excluding Torque, Height and Cylinders) from the Assignment 3 Data tab. a. Construct a correlation table (matrix). Paste a copy into your assignment. b. You should find that Doors has the highest correlation. Construct a scatter chart with Residuals on the vertical axis and Doors on the horizontal axis. Label the axes. Paste a copy of the chart into your assignment. c. Copy Torque, Height, Cylinders, Doors and AvgMPG into a new sheet and fit the model AvgMPG = intercept + coeff1*Torque + coeff2*Height + coeff3*Cylinders + coeff4*Doors. Request Residuals and Residual Plot for this model. Copy the Regression Summary Output into your assignment. Do not copy the Residual Output or the Residual Chart. d. In simple language, explain what the coefficient for Cylinders represents. e. In contrast to Torque and Cylinders, Doors was slightly correlated with Height, but did not have a strong correlation with engine variables. Did the inclusion of Doors have any impact on the coefficient for Torque? That is, did the Torque coefficient change significantly from what it was in Q2b? 3. If we construct a correlation table of residuals and unused variables, Width has the strongest correlation. When you add Width and examine the residuals, you may want to add Displacement. The model is getting more complicated but performance improves only marginally. Adding Displacement and Width increases the R-sq from 67.3% to 67.6%, and the standard error decreases from 2.85mpg to 2.84mpg. It appears that we are making no substantive improvement by adding more numeric variables. However, we have yet to look at categorical variables. In Assignment 3, it was suggested that Hybrid engines are more efficient than Petrol (gas) engines. - Copy the Residuals from Q2c into a new sheet. - From the tab Assignment 3 Data, Copy the variables Body, Fuel, Gearbox, and Drivetrain to this sheet. a. Construct a Box and Whisker Chart for Residuals. Copy a picture of this chart into your assignment. b. Edit the horizontal axis and select the values for Fuel type. Copy a picture of this chart into your assignment. c. Edit the horizontal axis and select Body type. Copy a picture of this chart into your assignment. d. The chart in Q3b should show a major difference between Hybrid and Petrol vehicles. The chart in Q3c should show no strong differences among Body types. If you had made charts for Gearbox and Drivetrain, you would see only small differences. Let us add Fuel type to the model. - Copy the column of Fuel type values into a new sheet. - Label the adjacent column Hybrid. - In cell B2, enter the formula =IF(A2=”Hybrid”, 1, 0). This will code all Hybrid vehicles as 1 and all others as 0. - Copy Torque, Height, Displacement, Cylinders, Doors and AvgMPG into adjacent columns. - Fit the model AvgMPG = intercept + coeff1*Hybrid + coeff2*Torque + coeff3*Height + coeff4*Cylinders + coeff5*Doors - Request Residuals and Residual Plot for this model. - Copy the Regression Summary Output into your assignment. Do not copy the Residual Output or the Residual Chart. e. Has including Fuel type improved the model? Justify your answer using model performance measures. f. Can you quantify the benefit of having a Hybrid? How much does it improve fuel efficiency? Remember that higher values of AvgMPG means that you can drive a greater distance wit the same amount of fuel. g. Your answer in Q3f is just an estimate. If you had a different sample, you would obtain a different estimate. The average benefit of driving a Hybrid could be more or less than this estimate. Based upon using the Empirical Rule, what is your lowest reasonable estimate of the benefit of a Hybrid? Explain why you think that the improvement in AvgMPG for a Hybrid is unlikely to be less than this amount. h. In Assignment 3, you estimated the AvgMPG for two Honda Accords. The estimated AvgMPG was 27.2 for the Hybrid and 26.8 for the Sedan. Data for these two vehicles are below: Using the model in Q3d, estimate the AvgMPG for the two Hondas. 4. We have assumed that there is a linear relationship among variables and AvgMPG, but in Assignment 3 you may have noted that the residual plot with Torque might have looked curved. a. We have added many variables, so this relationship may have changed. Copy the residual plot for Torque from the analysis done for Question 3d into your assignment. b. Although in class we suggested scaling variables when building nonlinear models, let us keep it simple and just create a new variable (Torque)2. - Copy the variables Hybrid to AvgMPG from question 3d into a new sheet. - Insert a new column adjacent to Torque and label it T2. - Assuming Torque is in column C and T2 is in column D, in cell D2 enter the formula =C2^2 and copy this formula into all cells in column D. - Fit a new model that includes all terms in Q3d as well as this new variable T2. Request Residuals and Residual Plot for this model. - Copy the Regression Summary Output into your assignment. Do not copy the Residual Output or the Residual Chart. c. Did adding this squared term for Torque improve the model? Justify your answer. d. It is possible to make some small further improvements in the model by adding more numeric and categorical variables as well as looking at nonlinearities or possibly interactions among variables. We have suggested that predictions with any model are usually within 2 standard errors. Our standard error is now close to 2mpg, so our prediction errors should be within 4mpg most of the time. Are they? Select the Residuals and Insert a Histogram. We only want the chart and not the frequency distribution, so you do not need to go to Data Analysis and using the Histogram function. The scale is awful though. - Select the horizontal labels and right click. - Select Format Axis. - Click on Bin Width and enter 0.5. - Click on Underflow bin and enter -8.0. This should give you a chart with a scale that you can interpret. The middle of the chart should be 0, since residuals should be centred around 0. Paste a copy of your chart into your assignment. e. In your opinion, are most residuals (prediction errors) between -4 and +4? Does the general guidance that most predictions are within 2 standard errors seem reasonable? As you can see, building models is not simple task. Different researchers may build different models using the same data, simply because they explored different patterns in the data. Rarely, is there just one or 2 variables that explain a target outcome. There may be many variables that each makes a small contribution. Improving student performance, reducing CO2 emissions, finding a family doctor for every Nova Scotian, … are all very complex problems that are not “fixed” with a few simple actions. And as we make an improvement, it becomes increasingly difficult to find the next action (variable) that will help us make further improvements. In many cases, we have already implemented the easy actions. 5. In Assignment 2, we looked at the different reasons that students had for choosing Saint Mary’s. We looked at whether there were differences in reasons between Domestic and International students, but not other factors. Did students with different reasons for choosing Saint Mary’s have different experiences. Below is a summary of reasons and whether their experiences met expectations. Which reason was the most important? Overall, my experience last fall Program Location Other Cost Grand Total was below my expectations 141 59 46 55 301 met my expectations 313 91 88 66 558 exceeded my expectations 116 32 27 23 198 Grand Total 570 182 161 144 1057 a. What is the probability that a randomly selected student said that their experience met expectations? b. What is the probability that a randomly selected student said that their main reason was Program AND their experience met expectations? c. What is the probability that a randomly selected student said that their main reason was Program OR their experience met expectations? d. Among students who said that their main reason was Program, what is the probability that a randomly selected student said their experience met or exceeded expectations? e. Among students who said that their main reason was Cost, what is the probability that a randomly selected student said their experience met or exceeded expectations? f. Among those that said that their experience was below expectations, what is the probability that a randomly selected student said their main reason was Program? g. Are the events “main reason is Program” and “experience Below Expectations” independent? Use probabilities to justify your conclusion. Assignment 3 Data IDBrandModel.NumberPhotoEngine.TypeBody.TypeFuel.TypeGearbox.TypeDrivetrainPricePower.hpDisplacement.lTorque.lbftCylindersSeatsDoorsHeight.inLength.inWidth.inWheelbase.inMPG.CityMPG.HighwayAvgMPG 1HondaAccord Hybrid Sport-L 2023https://www.ccarprice.com/products/Honda_Accord_Hybrid_Sport-L_2023.jpg2.0L Inline-4 HybridSedanHybridCVTfront31670204224745457.1196.173.3111.4444142.5 2HondaAccord Sedan 2023https://www.ccarprice.com/products/Honda_Accord_Sedan_2023.jpg1.5L Turbo Inline-4 GasSedanPetrolCVTfront280001921.526045457.1196