*** FOR THE ASSIGNMENT, I JUST NEED ASSISTANCE WITH QUESTIONS 3, 4, 5. I ALREADY DID 1 AND 2 AND I CAN FINISH THE REST ***
Option #1:Linear Regression Model
In this Critical Thinking Assignment, you will install R Markdown, explore and summarize a dataset as well as create a linear regression model. Your assignment submission will be an R Markdown generated Word document.
Install R Markdown
Download Install R Markdown
. Create a new R Markdown file by performing the following steps.
- Open R Studio
- Select File | New | R Markdown
- UseModule 3 CT Option 1as the Title
- Use your name as the Author
- Select the Word output format
- Delete all default content after the R Setup block of code, which is all content from line 12 through the end of the file.
Explore Boston housing in the
BostonHousing.csv
Download BostonHousing.csv
file by performing the following steps.
- Apply what you learned in Modules 1 and 2 about data exploration by selecting and running appropriate data exploration functions. Run at least five functions.
- For your assignment submission, copy your commands into your R Markdown file.
- Include R comments on all your code.
- Separate sections of R code by using appropriate R Markdown headings.
- Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM by following the process underExample: Predicting the Price of Used Toyota Corolla Carsin section 6.3.
- Use the R code example shown in Figure 6.3.
Hint:You will need to remove the categorial variable CAT…MEDV prior to fitting the multiple linear regression model.
- Create a scatter plot with the plot() function with the following attributes.
- Use MEDV as the y-axis
- Use the most significant attribute as the x-axis.
- Use the abline() function to add a linear regression line to the scatter plot. Use the y-intercept as the y-value and the factor value of the most significant attribute as the slope value.
- For your assignment submission, copy your commands into your R Markdown file.
- Include R comments on all your code.
- Separate sections of R code by using appropriate R Markdown headings.
- Use the R MarkdownKnitdrop-down menu to selectKnit to Wordto create the Word document for your assignment submission.
Your assignment submission must be one Word document that meets the following requirements:
- Is an R Markdown generated Word document containing all R code used in this assignment, appropriate R comments on code, and appropriate R Markdown headings?
- Does not include a cover page.
- Does not include an abstract.
- Includes a one-page description of what you did and what you learned. Add this description to the end of the R Markdown document as a new page. This page must conform to APA guidelines in the
CSU Global Writing CenterLinks to an external site.
.
**************************************************
"6.3 Estimating the Regression Equation and Prediction
Once we determine the predictors to include and their form, we estimate the coefficients of the regression formula from the data using a method called ordinary least squares (OLS). This method finds values that minimize the sum of squared deviations between the actual outcome values (Y) and their predicted values based on that model ().
To predict the value of the outcome variable for a record with predictor values x1, x2, …, xp, we use the equation
(6.2)"
"Predictions based on this equation are the best predictions possible in the sense that they will be unbiased (equal to the true values on average) and will have the smallest mean squared error compared to any unbiased estimates if we make the following assumptions:
The noise ε (or equivalently, Y) follows a normal distribution.
The choice of predictors and their form is correct (linearity).
The records are independent of each other.
The variability in the outcome values for a given set of predictors is the same regardless of the values of the predictors (homoskedasticity).
An important and interesting fact for the predictive goal is that even if we drop the first assumption and allow the noise to follow an arbitrary distribution, these estimates are very good for prediction, in the sense that among all linear models, as defined by equation (6.1), the model using the least squares estimates, , will have the smallest mean squared errors. The assumption of a normal distribution is required in explanatory modeling, where it is used for constructing confidence intervals and statistical tests for the model parameters.
Even if the other assumptions are violated, it is still possible that the resulting predictions are sufficiently accurate and precise for the purpose they are intended for. The key is to evaluate predictive performance of the model, which is the main priority. Satisfying assumptions is of secondary interest and residual analysis can give clues to potential improved models to examine.
Example: Predicting the Price of Used Toyota Corolla Cars
A large Toyota car dealership offers purchasers of new Toyota cars the option to buy their used car as part of a trade-in. In particular, a new promotion promises to pay high prices for used Toyota Corolla cars for purchasers of a new car. The dealer then sells the used cars for a small profit. To ensure a reasonable profit, the dealer needs to be able to predict the price that the dealership will get for the used cars. For that reason, data were collected on all previous sales of used Toyota Corollas at the dealership. The data include the sales price and other information on the car, such as its age, mileage, fuel type, and engine size. A description of each of these variables is given in Table 6.1. A sample of this dataset is shown in Table 6.2. The total number of records in the dataset is 1000 cars (we use the first 1000 cars from the dataset ToyotoCorolla.csv). After partitioning the data into training (60%) and validation (40%) sets, we fit a multiple linear regression model between price (the outcome variable) and the other variables (as predictors) using only the training set. Table 6.3 shows the estimated coefficients. Notice that the Fuel Type predictor has three categories (Petrol, Diesel, and CNG). We therefore have two dummy variables in the model: Fuel_TypePetrol (0/1) and Fuel_TypeDiesel (0/1); the third, for CNG (0/1), is redundant given the information on the first two dummies. Including the redundant dummy would cause the regression to fail, since the redundant dummy will be a perfect linear combination of the other two; R’s “lm” routine handles this issue automatically.
Table 6.1 Variables in the Toyota Corolla Example
Variable
Description
Price
Offer price in Euros
Age
Age in months as of August 2004
Kilometers
Accumulated kilometers on odometer
Fuel Type
Fuel type (Petrol, Diesel, CNG)
HP
Horsepower
Metallic
Metallic color? (Yes = 1, No = 0)
Automatic
Automatic (Yes = 1, No = 0)
CC
Cylinder volume in cubic centimeters
Doors
Number of doors
QuartTax
Quarterly road tax in Euros
Weight
Weight in kilograms"
"Table 6.2 Prices and Attributes for Used Toyota Corolla Cars (selected rows and columns only)
Price
Age
Kilometers
Fuel Type
HP
Metallic
Automatic
CC
Doors
Quart Tax
Weight
13500
23
46986
Diesel
90
1
0
2000
3
210
1165
13750
23
72937
Diesel
90
1
0
2000
3
210
1165
13950
24
41711
Diesel
90
1
0
2000
3
210
1165
14950
26
48000
Diesel
90
0
0
2000
3
210
1165
13750
30
38500
Diesel
90
0
0
2000
3
210
1170
12950
32
61000
Diesel
90
0
0
2000
3
210
1170
16900
27
94612
Diesel
90
1
0
2000
3
210
1245
18600
30
75889
Diesel
90
1
0
2000
3
210
1245
21500
27
19700
Petrol
192
0
0
1800
3
100
1185
12950
23
71138
Diesel
69
0
0
1900
3
185
1105
20950
25
31461
Petrol
192
0
0
1800
3
100
1185
19950
22
43610
Petrol
192
0
0
1800
3
100
1185
19600
25
32189
Petrol
192
0
0
1800
3
100
1185
21500
31
23000
Petrol
192
1
0
1800
3
100
1185
22500
32
34131
Petrol
192
1
0
1800
3
100
1185
22000
28
18739
Petrol
192
0
0
1800
3
100
1185
22750
30
34000
Petrol
192
1
0
1800
3
100
1185
17950
24
21716
Petrol
110
1
0
1600
3
85
1105
16750
24
25563
Petrol
110
0
0
1600
3
19
1065
16950
30
64359
Petrol
110
1
0
1600
3
85
1105
15950
30
67660
Petrol
110
1
0
1600
3
85
1105
16950
29
43905
Petrol
110
0
1
1600
3
100
1170
15950
28
56349
Petrol
110
1
0
1600
3
85
1120
16950
28
32220
Petrol
110
1
0
1600
3
85
1120
16250
29
25813
Petrol
110
1
0
1600
3
85
1120
15950
25
28450
Petrol
110
1
0
1600
3
85
1120
17495
27
34545
Petrol
110
1
0
1600
3
85
1120
15750
29
41415
Petrol
110
1
0
1600
3
85
1120
11950
39
98823
CNG
110
1
0
1600
5
197
1119
Table 6.3 Linear regression model of price vs. car attributes
code for fitting a regression model
car.df <->->
# use first 1000 rows of data
car.df <- car.df[1:1000,="">->
# select variables for regression
selected.var <- c(3,="" 4,="" 7,="" 8,="" 9,="" 10,="" 12,="" 13,="" 14,="" 17,="">->
# partition data
set.seed(1) # set seed for reproducing the partition
train.index <- sample(c(1:1000),="">->
train.df <- car.df[train.index,="">->
valid.df <- car.df[-train.index,="">->
# use lm() to run a linear regression of Price on all the predictors in the
# training set (it will automatically turn Fuel_Type into dummies).
# use . after ~ to include all the remaining columns in train.df as predictors.
car.lm <- lm(price="" ~="" .,="" data="">->
# use options() to ensure numbers are not displayed in scientific notation.
options(scipen = 999)
summary(car.lm)
Partial Output
> summary(car.lm)
Call:
lm(formula = Price ~ ., data = train.df)
Residuals:
Min 1Q Median 3Q Max
-8212.5 -839.2 -14.3 831.5 7270.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1774.877829 1643.744823 -1.080 0.2807
Age_08_04 -135.430875 4.875906 -27.776 < 0.0000000000000002="">
KM -0.019003 0.002341 -8.116 0.00000000000000283 ***
Fuel_TypeDiesel 1208.339159 534.431400 2.261 0.0241 *
Fuel_TypePetrol 2425.876714 520.587979 4.660 0.00000391697679667 ***
HP 38.985537 5.587183 6.978 0.00000000000811621 ***
Met_Color 84.792715 126.883452 0.668 0.5042
Automatic 306.684154 289.433138 1.060 0.2898
CC 0.031966 0.099075 0.323 0.7471
Doors -44.157742 64.056530 -0.689 0.4909
Quarterly_Tax 16.677343 2.602668 6.408 0.00000000030287017 ***
Weight 12.667487 1.536587 8.244 0.00000000000000109 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1406 on 588 degrees of freedom
Multiple R-squared: 0.8567,
Adjusted R-squared: 0.854
F-statistic: 319.6 on 11 and 588 DF, p-value: <>
"The regression coefficients are then used to predict prices of individual used Toyota Corolla cars based on their age, mileage, and so on. Table 6.4 shows a sample of predicted prices for 20 cars in the validation set, using the estimated model. It gives the predictions and their errors (relative to the actual prices) for these 20 cars. Below the predictions, we have overall measures of predictive accuracy. Note that the mean error (ME) is < ent=""> $ −40 and RMSE = $1321. A histogram of the residuals (Figure 6.1) shows that most of the errors are between ± $ 2000. This error magnitude might be small relative to the car price, but should be taken into account when considering the profit. Another observation of interest is the large positive residuals (under-predictions), which may or may not be a concern, depending on the application. Measures such as the mean error, and error percentiles are used to assess the predictive performance of a model and to compare models."