Answer To: Project description is attached. I only need help with part 1
Sivaranjan answered on Aug 06 2021
Solution/Solution.Rmd
---
title: "Project 2"
author: "Soumya Mohanty"
date: "08-05-2020"
output:
---
```{r setup_r, include=FALSE}
# Set Up
require("knitr")
wr_dir <- "./data"
sourcedir="http://www.phaget4.org/R/myImagePlot.R"
opts_knit$set(root.dir = sourcedir)
#library(forecast)
#library(mtsdi)
#library(MTS)
#library(psych)
```
1. Modelling categorical data: Explore the cars data (cars data.csv on
collab) and address the following items
# Load Cars data
```{r load_data, warning=FALSE}
setwd(wr_dir) # set working directory with setwd()
cars = read.csv('cars_data.csv')
summary(cars)
```
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains
a value above its median, and a 0 if mpg contains a value below its
median. Make Honda as a base case in brand variable.
```{r q1_a, warning=FALSE}
#Use cut command to create the binary variable mpg01 based on the mpg value
mpg01 <- cut(cars$mpg, c(min(cars$mpg), median(cars$mpg), max(cars$mpg)), include.lowest=T, labels = c(0,1))
#Add mpg01 to the cars data frame for analysis
cars <- data.frame(cars,mpg01)
#create a df for just Honda
#cars_honda <- cars
```
(b) Explore the data graphically in order to investigate the association
between mpg01 and the other features. Which of the other features
seem most likely to be useful in predicting mpg01? Scatterplots and
boxplots may be useful tools to answer this question. Describe your fndings.
```{r}
# 1st let's do scatterplot with all the variables
par(mfrow=c(3,3))
# 1 - mpg vs cylinder
plot(cars$cylinders, cars$mpg)
# 2 - mpg vs displacement
plot(cars$displacement, cars$mpg)
# 3 - mpg vs weight
plot(cars$weight, cars$mpg)
# 4 - mpg vs acceleration
plot(cars$acceleration, cars$mpg)
# 5- plot mpg vs year
plot(cars$year, cars$mpg)
# 6 - mpg vs origin
plot(cars$origin, cars$mpg)
# 7 - mpg vs brand
brand = as.numeric(cars$brand)-1 # we convert the brands into numeric type
plot(brand, cars$mpg)
# we also do boxplot
par(mfrow=c(3,3))
# 1 - mpg vs cylinder
boxplot(cars$cylinders, cars$mpg)
# 2 - mpg vs displacement
boxplot(cars$displacement, cars$mpg)
# 3 - mpg vs weight
boxplot(cars$weight, cars$mpg)
# 4 - mpg vs acceleration
boxplot(cars$acceleration, cars$mpg)
# 5- plot mpg vs year
boxplot(cars$year, cars$mpg)
# 6 - mpg vs origin
boxplot(cars$origin, cars$mpg)
# 7 - mpg vs brand
boxplot(brand, cars$mpg)
```
From the scatter plots, the variables most associated with mpg are:
1. Displacement
2. Weight
3. Year
Additionally, two more variables which shows somewhat similarity are:
4. cylinders
5. Accelerations
(C) Split the data into a training set and a test set with a split of 80:20
(train:test).
```{r q1_c, warning=FALSE}
smpl_sz <- floor(0.8 * nrow(cars))
## set the seed to make your partition reproducible
#set.seed(123)
train_indx <- sample(seq_len(nrow(cars)), size = smpl_sz)
training_cars <- cars[train_indx,]
test_cars <- cars[-train_indx, ]
```
(d) Perform logistic regression on the training data in order to predict
mpg01 using the variables that seemed most associated with mpg01
in (b). Compute confusion matrix at a threshold of 0.5 .
```{r q1_d, warning=FALSE}
# it is important to normalize all input parameters before a logistic regression
logreg = glm(mpg01~displacement+weight+cylinders+year+acceleration, family="binomial", data=training_cars)
summary(logreg)
pred_norm = predict(logreg, test_cars, type="response")
pred_binary = rep(0, dim(test_cars)[1])
pred_binary[pred_norm>0.5]=1
# plot c$onfusion matrix
library(cvms)
library(broom) # tidy()
library(tibble) # tibble()
d_binomial <- tibble("target" = test_cars$mpg01,
"prediction" = pred_binary)
basic_table <- table(d_binomial)
cfm <- tidy(basic_table)
plot_confusion_matrix(cfm,
targets_col = "target",
predictions_col = "prediction",
counts_col = "n")
```
(e) Perform the logistic regression after Log transforming your continuous
predictors. Use categorical features as such, if any. Follow 80:20 data
split rule. Perform prediction and compute confusion matrix at a
threshold of 0.5.
```{r}
log_cars <- data.frame("mpg" = cars$mpg,
"cylinders" = cars$cylinders,
"displacement" = log(cars$displacement),
"weight" = log(cars$weight),
"acceleration" = log(cars$acceleration),
"year" = cars$year,
"origin" = cars$origin,
"brand" = cars$brand,
"mpg01" = cars$mpg01)
smpl_sz <- floor(0.8 * nrow(cars))
## set the seed to make your partition reproducible
train_indx <- sample(seq_len(nrow(log_cars)), size = smpl_sz)
training_log_cars <- log_cars[train_indx,]
test_log_cars <- log_cars[-train_indx, ]
logreg_log = glm(mpg01~displacement+weight+cylinders+year+acceleration, family="binomial", data=training_log_cars)
summary(logreg_log)
pred_norm_log = predict(logreg_log, test_log_cars, type="response")
pred_binary_log = rep(0, dim(test_log_cars)[1])
pred_binary_log[pred_norm_log>0.5]=1
# plot confusion matrix
d_binomial2 <- tibble("target" = test_log_cars$mpg01,
"prediction" = pred_binary_log)
basic_table2 <- table(d_binomial2)
cfm2 <- tidy(basic_table2)
plot_confusion_matrix(cfm2,
targets_col = "target",
predictions_col = "prediction",
counts_col = "n")
```
(f) Perform Principal Components Regression using all the predictors as
the input such that the principal components account for 95% of the
variance. Make sure you follow the 80:20 rule of data division. Com-
pute the confusion matrix at a threshold of 0.5.
```{r q1_d, warning=FALSE}
library(pls)
pcr_cars <- data.frame("mpg" = cars$mpg,
"cylinders" = cars$cylinders,
"displacement" = cars$displacement,
"weight" = cars$weight,
"acceleration" = cars$acceleration,
"year" = cars$year,
"origin" = cars$origin,
"brand" = brand,
"mpg_norm" = cars$mpg/max(cars$mpg))
smpl_sz <- floor(0.8 * nrow(pcr_cars))
## set the seed to make your partition reproducible
train_indx <- sample(seq_len(nrow(pcr_cars)), size = smpl_sz)
training_pcr_cars <- pcr_cars[train_indx,]
test_pcr_cars <- pcr_cars[-train_indx, ]
set.seed(777)
pcr.fit<-pcr(mpg_norm~displacement+weight+cylinders+year+acceleration,data=training_pcr_cars,scale=T,validation="CV")
summary(pcr.fit)
```
(g) Compute Confusion Matrices for (f) at thresholds of 0.2, 0.5, and 0.8.
Describe your fndings
*PLEASE NOTE: The program shows an error when threshold = 0.8. This is because it detects all elements as 0.
To avoid the situation, we have used 0.7 instead of 0.8*
```{r q1_d, warning=FALSE}
pcr_test_mpg01 = rep(0, dim(test_pcr_cars)[1])
pcr_test_mpg01[test_pcr_cars$mpg_norm>0.5] = 1
pred_norm_pcr = predict(pcr.fit, test_pcr_cars, ncomp=3)
# when ncomp = 4, we get the principal components account for 97% of the variance which
# is the closest to 95%.
pred_binary_pcr_0.2 = rep(0, dim(test_pcr_cars)[1])
pred_binary_pcr_0.5 = rep(0, dim(test_pcr_cars)[1])
pred_binary_pcr_0.7 = rep(0, dim(test_pcr_cars)[1])
for (i in 1:80){
if(pred_norm_pcr[i]>0.2){
pred_binary_pcr_0.2[i] = 1
}
if(pred_norm_pcr[i]>0.5){
pred_binary_pcr_0.5[i] = 1
}
if(pred_norm_pcr[i]>0.7){
pred_binary_pcr_0.7[i] = 1
}
}
# plot confusion matrix
# for threshold = 0.5
d_binomial2 <- tibble("target" = pcr_test_mpg01,
"prediction" = pred_binary_pcr_0.2)
basic_table2 <- table(d_binomial2)
cfm2 <- tidy(basic_table2)
# for threshold = 0.5
d_binomial5 <- tibble("target" = pcr_test_mpg01,
"prediction" = pred_binary_pcr_0.5)
basic_table5 <- table(d_binomial5)
cfm5 <- tidy(basic_table5)
# for threshold = 0.7
d_binomial7 <- tibble("target" = pcr_test_mpg01,
"prediction" = pred_binary_pcr_0.7)
basic_table7 <- table(d_binomial7)
cfm7 <- tidy(basic_table7)
par(mfrow=c(2,2))
plot_confusion_matrix(cfm2,
targets_col = "target",
predictions_col = "prediction",
counts_col = "n")
plot_confusion_matrix(cfm5,
targets_col = "target",
predictions_col = "prediction",
counts_col = "n")
plot_confusion_matrix(cfm7,
targets_col = "target",
predictions_col = "prediction",
counts_col = "n")
```
(h) Draw ROC curves using the models obtained in (d), (e), and (f).
Describe your fndings.
```{r q1_d, warning=FALSE}
library(pROC)
par(mfrow=c(2,2))
roc(test_cars$mpg01, predict(logreg, test_cars, type="response"),plot = TRUE)
title('(d) Logistic Regression')
roc(test_log_cars$mpg01, predict(logreg_log, test_log_cars, type="response"),plot = TRUE)
title('(e) Logistic Regression with Log Transform')
roc(pcr_test_mpg01, pred_norm_pcr,plot = TRUE)
title('(f) Principal Component Regression')
```
(i) Identify the interactions between the predictors.
i. Use graphical approach to identify interactions. Describe each
plot.
```{r q1_d, warning=FALSE}
# intercatuon between prediction in the first case:
logreg_int = glm(mpg01~displacement*weight*cylinders*year*acceleration, family="binomial", data=training_cars)
summary(logreg)
pred_norm = predict(logreg_int, test_cars, type="response")
pred_binary = rep(0, dim(test_cars)[1])
pred_binary[pred_norm>0.5]=1
# plot c$onfusion matrix
library(cvms)
library(broom) # tidy()
library(tibble) # tibble()
d_binomial <- tibble("target" = test_cars$mpg01,
"prediction" = pred_binary)
basic_table <- table(d_binomial)
cfm <- tidy(basic_table)
plot_confusion_matrix(cfm,
targets_col = "target",
predictions_col = "prediction",
counts_col = "n")
```
```{r}
# Case 2: Logistic regression with log-transform
logreg_log_int = glm(mpg01~displacement*weight*cylinders*year*acceleration, family="binomial", data=training_log_cars)
summary(logreg_log)
pred_norm_log = predict(logreg_log_int, test_log_cars, type="response")
pred_binary_log = rep(0, dim(test_log_cars)[1])
pred_binary_log[pred_norm_log>0.5]=1
# plot confusion matrix
d_binomial2 <- tibble("target" = test_log_cars$mpg01,
"prediction" = pred_binary_log)
basic_table2 <- table(d_binomial2)
cfm2 <- tidy(basic_table2)
plot_confusion_matrix(cfm2,
targets_col = "target",
predictions_col = "prediction",
counts_col = "n")
```
```{r}
# Case 3: Principal Component Regression Analysis
pcr_int.fit<-pcr(mpg_norm~displacement*weight*cylinders*year*acceleration,data=training_pcr_cars,scale=T,validation="CV")
pred_norm_pcr = predict(pcr_int.fit, test_pcr_cars, ncomp=3)
pred_binary_pcr = rep(0, dim(test_pcr_cars)[1])
for (i in 1:80){
if(pred_norm_pcr[i]>0.5){
pred_binary_pcr[i] = 1
}
}
d_binomial <- tibble("target" = pcr_test_mpg01,
"prediction" = pred_binary_pcr)
basic_table <- table(d_binomial)
cfm <- tidy(basic_table)
plot_confusion_matrix(cfm,
targets_col = "target",
predictions_col = "prediction",
counts_col = "n")
```
From the above analysis it is observed that when the interations between the predictors are considered, the performance of the model becomes worse. This is common in case of a small dataset.
ii. Use aov() function as well. Describe each result.
```{r q1_d, warning=FALSE}
fit = aov(mpg~displacement+weight+cylinders+year+acceleration, family="binomial", data=training_cars)
plot(fit)
```
```{r}
fit_int <- aov(mpg~displacement*weight*cylinders*year*acceleration, data=training_cars)
plot(fit_int)
```
It is observed that when the interactions between the predictors are considered, the performance of the models becomes poor. The reason being that there is no significant interactions between the data in this case. Each predictor has an independent impact on the result.
The residuals spread over a longer range in the X-axis (Leverage) when the interations between the predictors is considered. The high leverage distorts the accuracy of the...