Below are a series of tasks for you to undertake for Part A. 1. Undertake EDA on this dataset. a. Do...

Question

Below are a series of tasks for you to undertake for Part A. 1. Undertake EDA on this dataset. a. Do you need to clean the data in any way? Justify what you decide to do (or not do). b. Describe two insights gained just from EDA that would be of interest to the sales manager. 2. Basic model fitting: a. Creating the model: i. Create an aggregated data set using the fields date, industry and location, with a mean of monthly_amount. ii. Create a line plot of the variable monthly_amount for industry = 1 and location = 1. Note the seasonality by month in this time series. iii. For industry = 1 and location = 1, train a linear regression model with monthly_amount as the target. i. Note 1 :Remember that time is very important in this model, so be sure to include variable(s) for the time sequence. (Hint: on your plot you may see local trend like seasonality. Consider how you could craft a variable to capture this?. You may also see a global upwards or downwards slope, could you craft a variable to capture this? Therefore there are two simple variables you could create to capture time. Could you craft more complex ones, perhaps with polynomials to capture local or global trends? Experiment and see! ii. Note 2: Carefully think about how you split your test and train sets. (Hint: Random is not appropriate!) iv. Create a prediction for monthly_amount in December 2016. Comment on how reasonable this prediction is. For example, if you were to plot it on the same plot as 2aii, would it sit somewhere reasonable? b. Describe the model: i. How well does your model fit the data it is trained on in a statistical sense? Define & describe an appropriate quantitative measure. Justify your choice of measure. ii. How well does your model predicting out-of-sample? Define & describe an appropriate quantitative measure. Justify your choice of measure. 3. Advanced model fitting: a. Apply the modelling process you built for industry 1 and location 1 to all industries and locations programmatically. b. Calculate your evaluation measure for the training data and your testing data, for all models. Identify the two industries and two locations for which your method performs worst. i. Ensure your models all make a prediction for December 2016. c. What might be causing the models on these two industries and locations to be performing poorly (HINT: Some plots may help here…)? How might you fix this in future? 4. Reporting a. Using all the notes and answers you have above, wrap up all your work into a report for the sales manager that follows the CRISP-DM methodology. Whilst the sales manager is not a data scientist, they are intelligent and have some experience in data analytics. Therefore it will be an important task to ensure you include enough technical details to withstand QA and technical scrutiny whilst positioning for a business audience. b. Ensure you include your predictions on the test set and for December 2016 as an appendix. (Your predictions, the actual, the difference) which can be referenced in your report. 5. Submission. a. You must submit your report and professionally commented R-code.

Mohd · Accepted Answer

Analysis Summary:
We have conducted exploratory data analysis and linear regression modelling on transaction dataset. There are 94248 rows with five columns. We have plot bar charts to analyze counts of location and industry counts of total observations.

Below are a series of tasks for you to undertake for Part A. 1. Undertake EDA on this dataset. a. Do you need to clean the data in any way? Justify what you decide to do (or not do). b. Describe two...

Answer To: Below are a series of tasks for you to undertake for Part A. 1. Undertake EDA on this dataset. a. Do...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment