Need to be filled and solve in R studio and then knit in a html document (with my name on the top).I provide the pdf version of the homework
5/19/2019 Math 270: Lab # 5 Solution file:///C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression (2) (1).html 1/13 Math 270: Lab # 5 Solution YOUR NAME HERE Due May 23, 2018 Task #1: Exploratory data analysis – Honey production in the US from 1998-2012 Task 2: Simple linear regression. 2a. Briefly explore the data 2b. Correlation 2c . Build the linear model 2d. Summarize and evaluate the linear regression 2e. Analyze residuals. 2f. Make predictions and evaluate the model on a new set of data. Task 3: Understanding Linear Regression Task 4: how is goodness of fit related to residuals? 4a. Zero error 4b. Non-zero error 4c. Create your own Task #1: Exploratory data analysis – Honey production in the US from 1998-2012 For this task you will want to import the honeyproduction.csv file available through the Canvas page. It is very important that you understand the contents of this file and the meaning of each variable. This information is available through Kaggle (a site that hosts machine learning/data science competitions) at https://www.kaggle.com/jessicali9530/honey-production (https://www.kaggle.com/jessicali9530/honey-production) The easiest method to import a dataset in RStudio is to use the “Environment” panel – Import Dataset and choose the “text” option (as this is a csv file). An interface for you to select the file and fine tune settings will pop up. For cases where you’d want to automate this process, there are (as with most programming languages) also read() file commands. There are several questions the Kaggle site posed for exploring this data. They are: 1. How has honey production yield changed from 1998 to 2012? 2. Over time, which states produce the most honey? Which produce the least? Which have experienced the most change in honey yield? 3. Does the data show any trends in terms of the number of honey producing colonies and yield per colony before vs. after 2006, which was when concern over Colony Collapse Disorder spread nationwide? 4. Are there any patterns that can be observed between total honey production and value of production over this time period? 5. How has value of production, which in some sense could be tied to demand, changed over this time period? https://www.kaggle.com/jessicali9530/honey-production 5/19/2019 Math 270: Lab # 5 Solution file:///C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression (2) (1).html 2/13 For this task, please write a report such as one you would turn in to a manager at a company. This means that you should have (1) succinct but meaningful writing, (2) nice visuals that are easily understood and whose relevance is summarized clearly, (3) no code visible (set echo=FALSE ) (4) professional language and attention to layout and design. This task is meant to simulate (to a limited extent, with fairly clean data) the experience you’d have if you were working as a data scientist.** Please note: I know about the code on Kaggle. If you refer to it and get ideas for analysis or visuals please refer to the author of the code within your report. Task 2: Simple linear regression. You don’t need to write this as an official report– just like a normal lab. Consider the cars dataset built in to R. 2a. Briefly explore the data Write a short preliminary analysis of the cars data set. This should include the following: - A short paragraph describing the contents of this data set. [Use ?cars ] - A scatterplot of the data, side-by-side boxplots of the variables, and a description of visual trends evident from both plots. 2b. Correlation Find the covariance and correlation of the speed to the stopping distance. What do these provide evidence for? 2c . Build the linear model Build a linear model using the following command: CarsModel=lm(dist~speed,data=cars) . This automatically creates the least-squares (best fit) linear model predicting the cars$dist variable from the cars$speed data. Then create a plot of the line on top of the scatterplot. To do this, first generate the scatterplot, and then use the command abline(CarsModel) . To make plots prettier, let’s use a plotting library: ggplot. You’ll first have to install it. In your R console, enter the code install.packages("ggplot2") We’ve avoided using it until now because the arguments for the function are a bit more opaque. The following “cheat sheet” may be helpful: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf (https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) library(ggplot2) data("cars") The following plot includes a 45% confidence interval around the predicted line– this accounts for some uncertainty regarding the slope and intercept by shading in a region that, given your data, has a 45% chance of containing the “true” regression line you’d find from the full population. Change the code to build a 90% confidence interval– what happens to the shaded region, and why does that make sense? Also, why does the confidence interval widen at the extremities? ggplot2::ggplot(cars,aes(x=speed,y=dist))+geom_point(color='#2980B9',size=4)+geom_smooth(method= lm,se=TRUE, color='#2C3E50',level=0.45) https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf 5/19/2019 Math 270: Lab # 5 Solution file:///C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression (2) (1).html 3/13 CarsModel=lm(dist~speed,data=cars) ggplot2::ggplot(cars,aes(CarsModel$residuals))+geom_histogram(binwidth=5) 5/19/2019 Math 270: Lab # 5 Solution file:///C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression (2) (1).html 4/13 2d. Summarize and evaluate the linear regression Use the command summary(CarsModel) to get a numeric summary of the linear model you generated. Use it to write out answers to the following questions. As you do this, you may find the description of the linear model summary (midway down the page at http://blog.yhat.com/posts/r-lm-summary.html (http://blog.yhat.com/posts/r-lm- summary.html)) to be helpful. It’s not mathematically rigorous, but gives some good rules of thumb for interpretting model fits– if you have questions aobut where those rules of thumb come from, please ask. i. The middle 50% of residuals (model errors) are between what two values? ii. The equation of the line (in format) is… iii. Interpret the Pr(>|t|) column. iv. Give and fully interpret (as a proportion of variance explained) the R-squared value. v. What is the residual standard error, and what does it mean? 2e. Analyze residuals. Residuals are the error in estimating an output as a function of the input. This is often denoted as : if , then . In an “ideal” situation the follow a normal distribution and have a constant standard deviation. The reasoning is below: y = mx + b ϵ Y ≈ f(X) Y − f(X) = ϵ ϵ http://blog.yhat.com/posts/r-lm-summary.html 5/19/2019 Math 270: Lab # 5 Solution file:///C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression (2) (1).html 5/13 In order to reasonably bound the error of your predictions based on a model, it’s useful to be able to rely on the normal distribution (so that you can be 95% sure the true value is within 2 standard deviations of the predicted value). We like the property of the normal distribution where errors are most frequently closest to 0 (meaning the points are close to the predicted value). To test this assumption of normality, let’s start by examining the distribution of residuals– we’ll do this by looking at a histogram. Use hist(CarsModel$residuals) and explain your findings: do the residuals look roughly normally distributed? In order to be able to easily evaluate the accuracy of a prediction, most analyses assume the residuals are “homoscedastic” (their standard deviation does not rely on , and so the error of prediction is constant across all inputs). To see if the residuals here are homoscedastic, plot the residuals vs. the speed variable. The residuals should form a (relatively) constant band around 0– they should not balloon out as speed gets higher, for example. Decide: are the residuals homoscedastic? There are other useful plots and statistics for testing that residuals (or any data) are normally distributed. One common one is a Quantile-Quantile plot, which plots the ordered data (in this case residuals) against the theoretical quantiles for a normal distribution with the same mean and variance. Generally speaking, the 25% mark in the ordered data should match the 25% mark in the normal distribution: so if the residuals are normally distributed the QQ-plot should be a straight line. The QQ-plot is the second (of four useful plots the rest of which we won’t discuss here) available from calling plot(CarsModel) . Call this function and examine the QQ-plot: If there are any causes for concern, how do they match up with your histogram? 2f. Make predictions and evaluate the model on a new set of data. Suppose you want to test the model against another set of data. I’ll have you use this set of data: testData=data.frame(speed=c(10,15,22,31,44,45,78,82,90),dist=c(12,50,40,90,154,139,286,284,345)) Use the command predictedDist=predict.lm(CarsModel,newdata=testData,interval="prediction",level=0.9) and also output the predicted values and 90% intervals of prediction. These intervals are built using the normal distribution centered on the regression line with standard deviation equal to that of the residuals. Just as we did earlier in class, the normal distribution allows you to predict a range that contains the middle X% of values (e.g 68% of values are within one standard deviation). So these intervals should contain approximately 9 of 10 predictions. How do the fit values compare to the actual distance values of the testData set? Are the actual values at least contained in the prediction intervals? To visualize this, run the following code by setting eval=TRUE . I’m providing it to you because it’s unfortunately messy to include error bars in R. There are some packages you can install to make it easier, but alas… it’s never as easy as it should be. x 5/19/2019 Math 270: Lab # 5 Solution file:///C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression (2) (1).html 6/13 ##First we'll just plot the x values vs. their predicted values: plot(testData$speed,predictedDist[,1]) #predictedDist[,1] is the `fit' column of the prediction output. ##Next we will do the messy part: adding `error bars' based on the 90% prediction interval. I'l l use the "arrow" function which draws an arrow from (x0,y0) to (x1,y1)-- here we just make the y0 and y1 the lwr and upr bounds of the prediction interval for each x value. arrows(x0=testData$speed, y0=predictedDist[,2], x1=testData$speed, y1=predictedDist[,3], length=0.05, angle=90, code=3) #length, angle, code are all just hacky ways of getting the "arrow" to instead have a flat head. ## Finally, let's add the `actual' stopping distance of the testData in red to see how the predi ctions match up: points(testData,col="red") What conclusions do you draw? Does the model reasonably fit the new (test) data? Task 3: Understanding Linear Regression One way of thinking about models (like linear regression), is that they are functions that estimate the output ( ) values based on input data . Hence we are estimating . More precisely for each data point where is the index of the data point and is the (hopefully small) error– also called residual for that data point. With linear regression, is linear– but the accuracy of the model all has to do with the