Analytics and Computing for Industrial Engineers
Data Preprocessing and Linear Regression
1 University at Buffalo, Industrial and Systems Engineering IE322 Analytics and Computing for Industrial Engineers Lab#2 Fall 2022 Data Preprocessing and Linear Regression (This is an individual lab) Name: ________________ Due 23:59 October 23th, 2022 Description: The dataset for this lab is SalesData.csv (the same one as used in Lab 1), and it is available on UBlearns. The dataset is used to record the sales information of a company. Requirements: Draft a report to document your R code and results (or partial results if there are too many) in each step. Note that your report will be graded on both technical content (70%) and report quality (30%). Submit two files to UBLearns: 1) your report, and 2) your R script. 1. Data preprocessing (30 pts) Read the SalesData.csv data into R console as D0. (a) Identify the variable(s) that have missing data, and determine the number of missing values for each variable identified in your report. (b) Install R package “imputeTS” into your R environment, and import it using library(imputeTS). Note: this package is used for data imputation. (c) Perform data imputation by following methods: i. Using D0, delete all rows with missing data, and name the new dataset as S1. ii. Using D0, replace the missing values with the mean value of each numerical variable, and name the new dataset as S2. (Hint: you may use na_mean() in “imputeTS” package, or any other method for this data imputation). (d) Calculate the correlation between “Profit” and “Sales” for both S1 and S2. To compare how those two correlations are affected by different data imputation methods. Document your findings in the report. (e) Use any method to identify any outliers in “Shipping.Cost” column. Count how many outliers identified. Document your findings in the report. (f) Delete those outliers you identified in 2(e), and name the new dataset as S3. Abdullah Fahad Abdullah Fahad Abdullah Fahad Abdullah Fahad 2 2. EDA (20 pts) Using D0 for the following tasks. (a) Draw the histogram for the “Shipping.Cost” column, and change the y-axis as frequency. Then, add a density curve on it. Present the plot into your report. (b) Draw a single scatter plot to show the relationship between “Profit” (as in y-axis) and “Sales” (as in x-axis). Based on this plot: a. add a color of your choice for each “Ship.Mode” (i.e., Delivery Truck, Express Air, Regular Air). Make sure you choose different colors; b. add a legend. Present this plot into your report. (c) Draw a scatter plot in a matrix form (see slides Lecture 5, page 16) for "Order.Quantity", "Profit", and "Shipping.Cost". Present this plot into your plot. 3. Linear regression (50 pts) For the following tasks, use the “Profit” as dependent variable, and four independent variables: “Order.Quantity”, “Sales”, “Unit.Price”, “Shipping.Cost”. (a) Build four linear regression models (name as m0, m1, m2, m3) on dataset D0, S1, S2, and S3 respectively, using lm() function. (b) Fill out the following table in your report using summary(). Models Dataset & description Model performance RMSE R2 m0 D0: raw data m1 S1: excluded NAs m2 S2: imputed Nas with mean m3 S3: excluded outliers Your proposed model in (e) (c) Compare the model performance among m0, m1 and m2, and discuss how the data imputation methods affects the model performance. (d) Compare the model performance between m0 and m3, and discuss how the exclusion of outliers affects the model performance. (e) Propose another way to further improve the model performance in your report. Justify your method by building such a linear regression model. You may consider add additional variables (e.g., Store, company), or interaction/polynomial terms, based on the dataset S3. Fill out the table in (b) with your model results.