Please review and then let me know if you will be able to accurately write all those commands, and provide all required screenshots with explanations. This course is called DBST 667: Data Mining.
9 Due date: Sunday October 4th 2020. Section 1 This section involves you running some R commands, and providing screenshot for all command output. Then at the end, generate an R script for all commands in this lab. All commands in the R script must have descriptive comments about what the command does. Then you are to submit the R script alongside with this word document that contains all your commands, screenshots, and explanations to all included questions. All commands in this exercise must be executed using Rstudio only. Deliverables: Two Files: (1) Submit this lab report with answers to all questions including output screenshots. (2) Submit an R script that contains all commands with comments that briefly describe each commands purpose. Grading: This exercise is worth 80% of the course grade. All questions must be answered in your own words with any paraphrased references properly cited using in-text citations and a reference list as needed. In addition, grammatical and spelling errors may affect the grade. Part 2 – Run an exercise on the imports-85 dataset from imports-85, completing this report and providing the commands, output screenshots, and discussion/interpretation as requested. Ensure that all commands are saved in this report AND in an R script. For Reference: UCI Machine Learning Repository: Imports 85 a. Introduction: i. Identify the dependent variable and independent variables in the imports-85 data set. ii. Based on what you have learned this week about multiple linear regression, provide a one-paragraph masters-level response describing what you anticipate that the lm algorithm will accomplish for the imports-85 data? Be specific about the behavior and structure of multiple linear regression model. b. Data Pre-Processing: Load the imports-85 data into R Studio using the read.csv command (do not use File > Import Dataset > From CSV in the R Studio GUI as this uses read_csv() resulting in significant different variable types!!!). i. Run the commands to remove the following variables: engine_type, make, num_of_cylinders, fuel_system. Include the commands and output screenshot. Command(s): > Output: ii. What additional data pre-processing (if any) does the lm() method require for the imports-85 data? Include the commands you ran and the output screenshot. Command(s): > Output: c. Multiple Linear Regression – Running the Method with Training Data: i. Run ‘set.seed(12345)’ and then split the data into a training set consisting of 70% of the instances and a test set containing the remaining 30% of the instances. Includes the commands below. Commands: > ii. Run the lm() function to build the multiple linear regression model storing the results in a variable called ‘mlr_model’. Include the command you ran and a brief discussion about the default input parameters used. Command: > Discussion: iii. Run the command ‘summary(mlr_model)’. Include the output screenshot and answer the following questions: Output: How does the model represent the relationship between dependent and independent variables in the import-85 dataset? The model represents the relationship between dependent and independent variables by placing asterisks on the dependent variables. How does the method handle categorical variables? It handles categorical variables by changing them to numbers. What does the residuals section of the output mean? What are the coefficients and what do they mean? What is an intercept and what does it mean? What do the p-values tell about the significance of each variable? What is the overall accuracy of the model? d. Multiple Linear Regression – Evaluate the Model with Test Data: i. Run the command to evaluate the ‘mlr_model’ on the imports-85 test data Include the command below. Command: > ii. Run the command to build the predicted vs. actual (observed) value scatter plot. Add a diagonal line to this plot. Include the commands and the final plot with the diagonal line below. Commands: > Output: iii. What does the distance between points and the diagonal line tell us about the accuracy of the prediction? e. Multiple Linear Regression – Residual Plots: i. Run the ‘plot(mlr_model)’ command to build the residuals plots. Interpret at least one of the plots. Include the command, the plot, and the interpretation of that plot below. Command: Output: Interpretation: f. Multiple Linear Regression – Minimum Adequate Model: i. What is the minimal adequate model? Why do we build it? Provide a one-paragraph, masters-level response. ii. Run the command to build the minimum adequate model and store the model in a variable named ‘mlr_model_min’. Include the command and output screenshot. Command: > Output: iii. Run the ‘summary(mlr_model_min)’ command. Include the command, output screenshot, and answers to the following questions: Command: > Output: Which variables were eliminated and which variables remain? What are the coefficients and the intercept? What do the coefficient and intercept mean? Compare the prediction accuracy of the minimum adequate model with the prediction accuracy of the original model. Provide a one-paragraph, masters-level response. g. New Instance: i. Suppose that we have a new car added to the imports-85 data set. We know the values of the independent variables. How would you use the model to predict the value of the dependent variable for the new car? (Hint: Use the lessons learned and hints from the prior week to complete this exercise). Include the command you would run below: Command: > h. Summary: i. Is the multiple linear regression method appropriate for predicting the values of dependent variables in the imports-85 dataset? Explain why or why not. Provide a one-paragraph, masters-level response. References Section 2 The purpose of this exercise is to make practical sense of mining data streams. This section DOES NOT require using R Studio. The assignment consists of two parts below. Put all of your answers in the spaces provided. Answer all questions in your own words. This assignment is designed to be free from the need for external research. Should the need arise to include these, ensure that you properly cite and attribute all non-original content. Part 1 – Website Optimization a. Visit http://www.websiteoptimization.com/services/analyze/ and analyze any web page of your choice in the ‘Enter URL to diagnose’ field. For the purpose of this exercise, it would be beneficial and easier for analysis if you choose a web page that has a perceived latency when responding to your web requests. i. Discuss what page you elected to use for analysis and then what the website optimization report reveals about that web page. Include the report and report interpretation below. (Hint: Focus on the ‘Analysis and Recommendations’ section) ii. Imagine that you are a website manager. Explain how you could use the website optimization report to improve the performance of your website. Provide at least two examples from the website optimization report. Part 2 – Time Series Analysis a. Visit https://datamarket.com/data/list/?q=provider:tsdl and choose one of the datasets in the list that would be good for time series analysis. i. Discuss what data is in the dataset at a high-level and why you selected this data for time series analysis. ii. Imagine that your course project will be using this dataset for time series analysis. Now, imagine what insight or purpose your project would try to uncover by exploring this dataset with time series analysis. (Note: This is a purposefully abstract prompt. The only wrong answers are those that do not use your imagination and analytical abilities). iii. Choose one of the following time series methods and discuss how you would use the method in your study: · Lossy Counting · Random Sampling · Very Fast Decision Tree (VFDT) · Concept-Adapting Very Fast Decision Tree (CVFDT) · Hoeffding Tree · CluStream · Sequential Pattern Mining (Hint: Your weekly lab projects and course project have been building towards this form of thought so look to the structure/implementation of these for your answer.) iv. Would the method you selected meet the purpose of the study? Are there any potential drawbacks and/or any additional considerations that must be made? References