Question in LAb 4 pdf
Microsoft Word - Lab 4 - using cross-validation.docx Lab 4 – Using Cross-Validation to Evaluate your past models Due date: 11:59PM Tuesday June 2nd In this lab you will use k-fold cross validation on the models you used in the past two labs in order to test how they are likely to perform with data they have never seen before. This lab is a chance to practice cross-validation, but also to improve your models according to the feedback from myself and the TA regarding their strength and/or logic of your past models. If they weren’t done very well, you have permission to recreate totally different models or better models on a different data set. You have full permission of course to change your data set from the set of data you used in previous labs. Please send me a one or two-line description of your proposed set and a screenshot of the excel file so I get a sense of what it is. In the final project you will be using as many tools as possible to explore the same and/or related data sets (typically would be the covid data sets, but not restricted to this), so I do recommend as much as is possible to stay with a similar topic as you have been working with to help you with the final project. However, it can also be beneficial to see and experiment with new data to see the various tools in different contexts so trying something new can be an advantage as well. Some suggestions on content: • Choose the top 3* models (or so) you have created in past labs • Evaluate their test errors with cross-validation techniques comparing with similar models of less and more compexity • Use other approximations as well if available (cp, bic, aic) and comment on similarities or differences in the results these approximations give as compared to the CV process • Comment on what you believe to be the strength of your model/models Hints on process: • Give a brief recap on your models and explain again why they are important. Make sure you take time to improve old ones or try something new if you did not score well on a previous lab. You will still be marked on if the models and interpretations of those models make sense. You don’t have to go as in depth this time around, but the models should still make sense, you should still use a sufficient # of data points (I suggest at least 100, but really at least 1000 is appropriate given all the data we can have access to), and so on (see rubric for more details) • The rest of the lab is more technical than the others: we want to see that: o you have executed cross-validation properly and understand the difference between test and training sets o you understand how to perform k-fold cross validation properly and that you understand it is the preferred method of calculating realistic (although imperfect) test error rates o that you may want to use other estimates of test error such as Cp, AIC, BIC, but that you realize these are not actual test errors, just training errors that have been modified to reflect what the true test errors likely are. o You use and understand the one standard error rule to choose the “best” model. Note: there is a 0 (ZERO) tolerance policy on cheating and plagiarism. If any student is found duplicating all or part of the assignment of another student, they will be sent to the AIO (Academic Integrity Officer). The AIO will then begin the process of student discipline as they see fit. This may include failing the assignment or the course. You will be graded according to the following rubric: Grade Criteria 0 1 2 3 Robustness of Models Only 1 or 2 categories on the far right completed at all or all 3 completed insufficiently well. 1-3 categories to the right completed only reasonably well At least 2 of the categories described to the right done quite well The student Models: 1. based on sufficient # of data points (at the very least 100, but ideally at least 500 or 1,000), 2. interpreted well 3. help the reader understand the data better, possibly even be of practical use for decision makers Model performance vs. Model complexity The student hasn’t illustrated that they understand the idea of complexity vs. performance The student’s performance vs. complexity graph contains significant logical or structural errors The student has stopped short of creating too many different models and so their performance vs. complexity graph has too few points and is of limited usefulness The student has taken a model that shows promise and tried various combinations of possible predictors, and created a proper performance vs. complexity graph Choosing the “best” model Little to no understanding of the 1SE rule Understanding and use of the 1SE rule is weak The student seems to have understood the 1SE rule, but the data looks doubtful The student has understood and utilized the 1-standard-error rule properly Methods Student has very little understanding of the purpose of CV or test error estimation Student does the process mainly correctly but illustrates in their language they are unsure of what CV is or what Cp, AIC, BIC etc. are estimating The student does not perform KFCV quite correctly. Somewhat confuses the idea of the test error estimate calculated with CV and those created by modification of training error rates The student understands that K- Fold CV is best and depends on this data the most. They realize Cp, AIC, BIC, etc. are estimates based on modifications of calculated training error and use these to confirm their KFCV Lab 2 – Data Analytics Professor. Scott Flemming Student. Sleiman Yammine B00819918 Which of the two most impacted provinces in Canada are flattening the curve of Covid-19 more efficiently? We have seen Covid-19 cases begin to decrease drastically around the world and cities slowly beginning to reopen, and especially in Canada. According to the Chief Public Health Officer of Canada Dr. Theresa Tam, an estimated 50% of all Covid-19 cases in Canada have fully recovered; however, which provinces are doing the maximum for it's people and which isn't. Link: 50% Recovered in Canada In the first pie chart, we can see the case distribution among major provinces (where cases are >1000). Each percentage represents a percentage where the total cases in Canada are located. We can see that the two major hotspots in Canada are: Ontario, and Quebec. For the sake of demonstration, the information below will include the two major affected provinces and analyze their respective data while we compare them to each other. All while asking the question: Which of these two provinces is performing more efficiently to flatten the curve of new cases? A. Infectivity I. Quebec: Total infected - 41420 cases, Population (2020) - 8.45 Million II. Ontario: Total infected - 23147 cases, Population (2020) - 14.57 Million In graph 1, we can see that cases per capita in Quebec is higher than that of Ontario, suggesting Ontario has been making stronger political policies regarding lock-downs; however, let us see how Quebec performs in recovery. https://www.citynews1130.com/2020/05/17/50-percent-of-canada-covid-cases-recovered/ https://1.bp.blogspot.com/-9pUP1WxAelU/XsLhmkM12bI/AAAAAAAAEOo/dPNnp4XfhgEp7PFuk0gRo-n8Gl8LgLozACLcBGAsYHQ/s1600/Pie%2Bchart%2Bcases.png Graph 1 - Cumulative Covid-19 Cases (Ontario Vs. Quebec) B. Recovered Cases I. Quebec: Total recovered - 11039 cases, Percentage recovered = 26.65% II. Ontario: Total recovered - 16641 cases, Percentage recovered = 71.89% In graph 2, we can see that recoveries per capita in Quebec is lower than that of Ontario, can we suggest that Ontario has a more efficient health care system or is it because of better social distancing laws within the province? https://1.bp.blogspot.com/-fselFLLEZ9E/XsLl3si86tI/AAAAAAAAEPE/Akk0KawlhBcLR-TaIcJ6PAWA5arrCIkywCLcBGAsYHQ/s1600/OnVsQc-Cases.png Graph 2 - Cumulative Covid-19 Recoveries (Ontario Vs. Quebec) C. Linear Models - Ontario From Graph 1 and Graph 2, we can begin to see that Ontario is doing a much better job; however, is Ontario truly flattening the curve by decreasing their new amount of cases per day? https://1.bp.blogspot.com/-6F2mUtO0gFg/XsLl3p8SfKI/AAAAAAAAEPM/kMStslCZ8mMTu6P4B84inZZMijKYTSOIgCPcBGAYYCw/s1600/OnVsQu-Rec.png Graph 3 - Ontario Covid-19 New Cases In Graph 3, we see the new cases per day starting in March until May. We notice that the total new cases are decreasing as we move forward in time. If we compare this to a linear relationship where f(x) = x, (Blue line) the goodness of fit is 66.8% indicating that the model is decreasing over time (Good News!). If we look closer look to the Red and Green lines, which attempts to fit the model more appropriately in a decreasing manner and has a goodnight of fit of 74.5% and 81.87% respectively. This indicated that the cases are in fact decreasing, and Ontario is successfully flattening the curve. D. Linear Models - Quebec From Graph 1 and Graph 2, we can begin to question whether Quebec is actually flattening the curve or not. Let's take a close look. https://1.bp.blogspot.com/-uJLJZQcXOC4/XsLl3TqcuPI/AAAAAAAAEPM/He0SHmYWZZYZYVZtwNH-Rx7zPMNRsmd3gCPcBGAYYCw/s1600/Onatario%2BCases%2Bflattening%2Bthe%2Bcurve.png Graph 4 - Quebec Covid-19 New Cases In Graph 4, we see the new cases per day starting in March until May. We notice that the total new cases (much higher than Ontario's) slightly decrease as we move forward in time but not enough. Let's compare this to a linear relationship where f(x) = x, (Blue line) the goodness of fit is 72.3% indicating that the model is slowly decreasing over time but it has a higher percentage of fitting a linear line indicating that Ontario is performing better. Now, if we look at the Red and Green lines, which attempts to