Project Assigment as atached
project_proposal.docx PROJECT PROPOSAL The following dataset https://www.kaggle.com/taranmarley/perth-temperatures-and-rainfall is a time series dataset measuring daily minimum and maximum temperature and rainfall in Perth, Australia from 1944 to 2020. Proposed solution: Model and forecast the temperature to better understand the rate of climate change and encourage regulatory bodies to take action to reduce the effects of global warming. ECON 3343/6645 BUSINESS FORECASTING Fall 2020 PROJECT GUIDELINES Please make sure you follow the below steps in your project. You can add more steps if you feel it necessary. Each project should be original and different from others. Your submission will consist of a written report and your R script of your code used in your report. Remember to keep the R code and written analysis separate, the R script should have all of your coding, and the written report should contain only analysis and graphs (no code). The deadline for submission will be Tuesday December 15, 2020 at 11:59pm. All files will be submitted through Canvas under the Final Project entry listed in the assignment tab. Report Format: The format of the report will adhere to APA standards. For information on formatting a paper to APA please see the guide below: https://owl.purdue.edu/owl/research_and_citation/apa_style/apa_formatting_and_style_guide/general_format.html 1. Introduction: Introduce the data you chose to examine. What problem you are trying to solve by using this data? 2. Data: · What is your data? · What is the source, where did you obtain it from (give the link, source, etc.)? · What is time span of your data (ex: 2001.1 to 2016.10 daily, quarterly, annual, how many observations)? · What are the variables, what do they measure? 3. Plot your data. Put the graph for the whole period. · What can you identify from the time series plot (trend, cyclicity, seasonality)? · If your data has multiple variables, plot a relational scatter plot between X and Y variables, is there evidence of a relationship? 4. Dividing the data: Partition the data into two sets: training, and testing. The distribution of the split is up to you, but some common splits are 70/30, 75/25 and 80/20. The training set is the data that you will use to train your models later. The test data is what will be used to measure the accuracy of your model. 5. Test if there is seasonality and/or trend in the data (hint: Decomposition, ACF function). If there is evidence of seasonality or trend, what information does it show? If there is no evidence, comment on that as well, how would a lack of seasonality effect our model selection? 6. Establish a baseline accuracy measure. · Use a benchmark forecast model (meanf, drift, snaive, etc) to establish a baseline accuracy. 7. Use Exponential Smoothing (ETS) techniques to train your model with the training data and then check the accuracy of your model against the test data. · If there is no seasonality or trend smooth the data with MA, moving averages. · Use all the commands in the chapter to see which model fits the best (Holt, Holt- Winters, damped model etc.) · Which model do you prefer to use? Why? Does the composition of the data make one model more applicable than others? · Check the residuals of your preferred model. Does it look like white noise? What do the residuals indicate about your model? · Forecast for the Test Period. · Compare your forecast to the test data. What is the forecast performance? · Decide if this is an acceptable model to use. Why is it acceptable? 8. Use ARIMA(p,d,q) modeling to forecast. · Explain when you can use ARIMA modeling. Is the series stationary, for the Training Period? · Test if the series is stationary. What did you decide? Is the series stationary or not? · If the data is not stationary, make the data stationary (differencing). · Use the necessary command to use ARIMA(p,q) to model the series. What do you find? What is the model? What is p and q? Is it integrated(differenced)? · Check if the residuals are normally distributed? · Plot the ACF of the residuals. What do you observe? · Decide if the model is acceptable to use for forecasting? · Forecast the test period, · What is the forecast error? 9. Compare your accuracy of both models (ETS and ARMA)? Which model should you use to forecast with? 10. Run ARIMA(p,q) model for the natural logarithms of the series, did you results change? (Use lambda=0 in your R command and show your results). Microsoft Word - CS229_FinalProject_Jpao_Dsulliv2.docx 1 Time Series Sales Forecasting James J. Pao*, Danielle S. Sullivan** *
[email protected], **
[email protected] Abstract—The ability to accurately forecast data is highly desirable in a wide variety of fields such as sales, stocks, sports performance, and natural phenomena. Presented here is a study of several time series forecasting methods applied to retail sales data, comprising weekly sales figures from various Walmart department stores across the United States over a period of approximately 2 and a half years. Significant surges in sales are noticeable in the data during pre-holiday and holiday weeks, which present a challenge for any developed forecasting models. The prediction models implemented herein are regression decision trees, Seasonal-Trend Decomposition using Loess and Autoregressive Integrated Moving-Average (STL + ARIMA) models, and time- lagged feed-forward neural networks (FFNNs). In particular, the STL + ARIMA and the time-lagged FFNN’s performed reasonably well in forecasting the weekly sales data. The best FFNN implementation, using a time-lag value d = 4 and mean weekly sales as inputs, achieved a mean absolute error of 1252. Weekly sales for the store departments are in the tens of thousands. It is also notable that the results achieved by the time-lagged FFNN’s did not require any deseasonalizing of the sales data, indicating that neural networks may be able to effectively detect and consider any seasonality during training and prediction. ———————————————————————— 1 INTRODUCTION N a world today where competitive margins are becoming increasingly narrower and actions must be decisive yet informed, the ability to accu- rately make forecasts is of premier importance. This is certainly true in the forecasting of numeri- cal data such as the health of a country’s economy or the movements of a stock market from day to day. Forecasting is even beneficial in domains such as environmental monitoring or sports perfor- mance, and, accordingly, much forecasting work has been done across a broad swath of exciting fields and disciplines. A more traditional yet still thoroughly compel- ling application of forecasting is sales prediction, which is the focus of this work. As markets become more and more global and competition is ruthless, optimizing an organization’s operational effi- ciency is of premium importance. When compa- nies must spread their resources broadly and con- sumers have a surfeit of choices, every advantage a company can squeeze out will make a difference. If a company can match the demand of a product with just the right amount of supply, then there will be no lost sales due to a lack of inventoy as well as no costs from overstocking. Sales forecast- ing uses patterns gleaned from historical data to predict future sales, allowing for informed courses-of-action such as allocating or diverting existing inventory, or increasing or decreasing fu- ture production. This work investigates the performance of a va- riety of predictive models for the application of de- partmental sales forecasting. As a baseline method, a regression decision tree is implemented. Then, the more sophisticated models of Seasonal-Trend Decomposition using Loess and Autoregressive Integrated Moving-Average (STL + ARIMA) and feed-forward neural networks using time-lagged inputs were used. 2 RELATED WORK Two currently popular approaches to nonlinear time series prediction problems are statistical ap- proaches using ARIMA and machine learning ap- proaches using Artificial Neural Networks (ANNs). ANNs have shown to perform well in time series forecasting because of their ability to accurately represent non-linear data [1]. Both of these approaches have had success when applied to sales forecasting and stock predictions [2]. When applied to financial data, the ARIMA model is able to leverage the fact that financial time series data is generally related to past values [3]. Provided there are no sudden changes in value or behavior, an ARIMA model will also be very effec- tive for financial time series forecasting [4]. In his 2010 paper Adebiyi [4] applies the ARIMA model to accurately forecast the Nokia stock prices. It is important to note that the linear assump- tions of the ARIMA model have resulted in poor forecasting models in cases of stock price predic- tion when the dataset includes values coming into and coming out of an economic recession (chang- ing properties). I 2 In a separate paper Adebiyi [2] implements an ANN model and an ARIMA model to to predict Dell stock prices. In his model comparison, the ANN slightly outperforms the ARIMA model. Adebiyi attributes this partially to the fact that the ARIMA model assumes that the times series is generated from a linear process. 3 DATASET The dataset used was provided by Walmart Inc., an American multinational retail corporation, for a 2014 data science competition (Kaggle). The dataset contains historical weekly sales data from 45 Walmart department stores in different re- gions across the United States. The training set has 421,570 samples. Each sample has the following features: departmental weekly sales, the associated department (81 departments, each listed as a num- ber), the associated store (listed as a number), the store type, the date of the week’s start day, a flag indicating if the week contains a major holiday (Super Bowl, Labor Day, Thanksgiving, Christ- mas). Also supplied is a corresponding set of features for each week-store combination which includes temperature, fuel price, CPI, unemployment rate, and promotional markdown data. There is no publicy available test set. Specifi- cally, the ground-truth values for the test set are not available, so assessing each model against the official test set must be done by making test pre- dictions and submitting to Kaggle’s online plat- form. Hold-out sets are generated from the pro- vided training samples for local validation, but for some models (namely the neural