pdf format
small.csv X1,X2,Y S,-0.1,19.19 S,2.53,22.74 S,4.86,23.91 M,0.26,7.07 M,2.55,7.93 M,4.87,8.93 L,0.08,20.63 L,2.62,23.46 L,5.09,25.75 __MACOSX/._small.csv part2.csv Month,Year,sales January,2012, February,2012, March,2012, April,2012, May,2012, June,2012, July,2012, August,2012, September,2012,1.71 October,2012,1.9 November,2012,2.74 December,2012,4.2 January,2013,1.45 February,2013,1.8 March,2013,2.03 April,2013,1.99 May,2013,2.32 June,2013,2.2 July,2013,2.13 August,2013,2.43 September,2013,1.9 October,2013,2.13 November,2013,2.56 December,2013,4.16 January,2014,2.31 February,2014,1.89 March,2014,2.02 April,2014,2.23 May,2014,2.39 June,2014,2.14 July,2014,2.27 August,2014,2.21 September,2014,1.89 October,2014,2.29 November,2014,2.83 December,2014,4.04 January,2015,2.31 February,2015,1.99 March,2015,2.42 April,2015,2.45 May,2015,2.57 June,2015,2.42 July,2015,2.4 August,2015,2.5 September,2015,2.09 October,2015,2.54 November,2015,2.97 December,2015,4.35 January,2016,2.56 February,2016,2.28 March,2016,2.69 April,2016,2.48 May,2016,2.73 June,2016,2.37 July,2016,2.31 August,2016,2.23 September,2016, October,2016, November,2016, December,2016, __MACOSX/._part2.csv categoricals.pptx CATEGORICAL VARIABLES -ENCODING- 1 We will use data visualization To understand the results of regression models with a categorical variable And to show the model performance EXAMPLES Example 1 EXAMPLES Consider the following dataset Numerical Categorical Predictors – EXAMPLE X1X2Y S-0.1019.19 S2.5322.74 S4.8623.91 M0.267.07 M2.557.93 M4.878.93 L0.0820.63 L2.6223.46 L5.0925.75 Increase size of table numbers 4 Consider the following dataset LABEL ENCODING X1X2Y S-0.1019.19 S2.5322.74 S4.8623.91 M0.267.07 M2.557.93 M4.878.93 L0.0820.63 L2.6223.46 L5.0925.75 X1X2Y 0-0.1019.19 02.5322.74 04.8623.91 10.267.07 12.557.93 14.878.93 20.0820.63 22.6223.46 25.0925.75 Numerical Categorical Predictors – EXAMPLE 5 X1 and X2 in the model as continuous variables R2 is close to 0.05, the explained variation of the response about the fitted equation is negligible The Adjusted R-squared is negative and equal to -0.23632 Both predictors X1 and X2 seem not to be useful for predicting Y. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.1678 5.6816 2.670 0.037 * x1 0.6019 3.4742 0.173 0.868 x2 0.7769 1.4275 0.544 0.606 Residual standard error: 8.505 on 6 degrees of freedom Multiple R-squared: 0.05259, Adjusted R-squared: -0.2632 F-statistic: 0.1665 on 2 and 6 DF, p-value: 0.8504 Numerical Categorical Predictors – EXAMPLE X1 and X2 in the model as continuous variables Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.1678 5.6816 2.670 0.037 * x1 0.6019 3.4742 0.173 0.868 x2 0.7769 1.4275 0.544 0.606 Residual standard error: 8.505 on 6 degrees of freedom Multiple R-squared: 0.05259, Adjusted R-squared: -0.2632 F-statistic: 0.1665 on 2 and 6 DF, p-value: 0.8504 Numerical Categorical Predictors – EXAMPLE 7 X1 and X2 in the model as continuous variables Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.1678 5.6816 2.670 0.037 * x1 0.6019 3.4742 0.173 0.868 x2 0.7769 1.4275 0.544 0.606 Residual standard error: 8.505 on 6 degrees of freedom Multiple R-squared: 0.05259, Adjusted R-squared: -0.2632 F-statistic: 0.1665 on 2 and 6 DF, p-value: 0.8504 Numerical Categorical Predictors – EXAMPLE 8 The fitted plane is 0.7769 X2 Numerical Categorical Predictors – EXAMPLE Replace X1 with binary variables X11 and X12 ONE-HOT ENCODING X11X12X2Y 00-0.1019.19 002.5322.74 004.8623.91 100.267.07 102.557.93 104.878.93 010.0820.63 012.6223.46 015.0925.75 X1X2Y S-0.1019.19 S2.5322.74 S4.8623.91 M0.267.07 M2.557.93 M4.878.93 L0.0820.63 L2.6223.46 L5.0925.75 Numerical Categorical Predictors – EXAMPLE 10 . Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.9650 0.5802 34.413 3.90e-07 *** x11 -14.0760 0.6703 -20.998 4.54e-06 *** x12 1.1974 0.6705 1.786 0.13418 x2 0.8155 0.1378 5.920 0.00196 ** Residual standard error: 0.8207 on 5 degrees of freedom Multiple R-squared: 0.9926, Adjusted R-squared: 0.9882 F-statistic: 225 on 3 and 5 DF, p-value: 9.416e-06 Numerical Categorical Predictors – EXAMPLE Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.9650 0.5802 34.413 3.90e-07 *** x11 -14.0760 0.6703 -20.998 4.54e-06 *** x12 1.1974 0.6705 1.786 0.13418 x2 0.8155 0.1378 5.920 0.00196 ** Residual standard error: 0.8207 on 5 degrees of freedom Multiple R-squared: 0.9926, Adjusted R-squared: 0.9882 F-statistic: 225 on 3 and 5 DF, p-value: 9.416e-06 Numerical Categorical Predictors – EXAMPLE The fitted equations for each level are Numerical Categorical Predictors – EXAMPLE What encoding is better? Numerical Categorical Predictors – EXAMPLE LABEL ENCODING ONE-HOT ENCODING X11X12X2Y 00-0.1019.19 002.5322.74 004.8623.91 100.267.07 102.557.93 104.878.93 010.0820.63 012.6223.46 015.0925.75 X1X2Y 0-0.1019.19 02.5322.74 04.8623.91 10.267.07 12.557.93 14.878.93 20.0820.63 22.6223.46 25.0925.75 Numerical Categorical Predictors – EXAMPLE 15 LABEL ONE-HOT ENCODINGENCODING R-squared0.052590.9926 Adjusted R-squared: -0.26320.9882 Numerical Categorical Predictors – EXAMPLE 16 Why are the models different? Numerical Categorical Predictors – EXAMPLE Label encoding prediction equation 0.7769 X2 One-hot encoding prediction equations Numerical Categorical Predictors – EXAMPLE label encoding results in a regression plane one-hot encoding results in a set of regression lines (one for each category of the categorical variable) Numerical Categorical Predictors – EXAMPLE If the observations are close to a plane then label encoding and one-hot encoding results are good Numerical Categorical Predictors – EXAMPLE With a large number of variables in the model it is not possible to have a display like this We may relay on R2 or cross-validation error to choose the best model Numerical Categorical Predictors – EXAMPLE Example 2 Forecasting Consider the following demand data EXAMPLE 2 Increase size of table numbers 23 Consider the following demand data EXAMPLE 2 Increase size of table numbers 24 Wide format to long format EXAMPLE 2 Increase size of table numbers 25 Model 1 Linear Regression Use linear regression predict sales using time as predictor EXAMPLE 2 Increase size of table numbers 27 EXAMPLE 2 – LINEAR REGRESION MODEL 1 Increase size of table numbers 28 sales vs Period EXAMPLE 2 – LINEAR REGRESION MODEL 1 Increase size of table numbers 29 Model 2 Categorical variable and label encoding predict sales using Year and Month EXAMPLE 2 – MODEL 2 Increase size of table numbers 31 predict sales using Year and Month label encode Month EXAMPLE 2 – MODEL 2 Increase size of table numbers 32 predict sales using Year and Month label encode Month EXAMPLE 2 – MODEL 2 Increase size of table numbers 33 predict sales using Year and Month label encode Month EXAMPLE 2 – MODEL 2 Increase size of table numbers 34 predict sales using Year and Month label encode Month model sales using Year and Period (both numeric) EXAMPLE 2 – MODEL 2 Increase size of table numbers 35 EXAMPLE 2 – MODEL 2 Increase size of table numbers 36 sales vs Year and Period EXAMPLE 2 – MODEL 2 Increase size of table numbers 37 Model 3 Categorical variable and one-hot encoding predict sales using Year and Month one-hot encode Month EXAMPLE 2 – MODEL 3 Increase size of table numbers 39 EXAMPLE 2 – MODEL 3 Increase size of table numbers 40 sales vs Year and Month EXAMPLE 2 – MODEL 3 Increase size of table numbers 41 __MACOSX/._categoricals.pptx Day 6.docx Day 6: Assignment Submit Assignment · Submitting a file upload · File Types pdf, doc, and docx Instructions Write and run the code developed in class to predict the total sales from file sales.csv and submit the jupyter notebook in pdf format. Submission Click on the blue button in the top right corner to submit your assignment. Click Next (below) to progress through the course. Rubric Assignment Rubric Assignment Rubric Criteria Ratings Pts This criterion is linked to a Learning OutcomeCoding 60.0 to >30.0 pts Full Marks No coding errors 30.0 to >20.0 pts No Marks More than three coding errors 20.0 to >0 pts Partial Marks One to three coding errors 60.0 pts This criterion is linked to a Learning OutcomeFormat & Editing 40.0 to >20.0 pts Full Marks Code is clear and easy to follow. Plots display effective visualization.