Introduction For this week’s take-home lab, you will work on the same data set from Week 4/5 Take-Home Labs. You will solve the very same problem studied in this week’s in-class lab on a much larger...

1 answer below »

Introduction


For this week’s take-home lab, you will work on the same data set from Week 4/5 Take-Home Labs. You will solve the very same problem studied in this week’s in-class lab on a much larger and more interesting dataset. The data contained in the file UCI_Credit_Card.csv contains 30,000 consumer records with 24 different variables. You can read a detailed description of the different fields at the following website:https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clientsThe description from the UCI says marriage should have levels: Marital status (1 = married; 2 = single; 3 = others) However, there are levels (0,1,2,3). You should treat 0 as unknown. the description from the UCI says Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). However, there are levels 1 to 6 for education. Thus here 5 = 6 = unknown. X6-X11: The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. However, there are many factors that are -2. This is also unknown. So every unknown you should treat them as NA.


Your task is to build the best possible model for predicting whether or not a consumer will default on their credit card payment for the next month (the last column in the dataset).


Assignment


Perform the following tasks:




  • Conduct a training/test split of the data, building a 20% held out test dataset




  • Fit the best RF model you can (consider feature selection etc.) to the data to predict consumer default.




  • Then plot ROC curves for the logistic regression, SVM, KNN, CART, and RF models, and compare their performance.




  • Compute the AUC for the logistic regression, SVM, KNN, CART, and RF models, and compare their performance.




  • Provide a summary and discussion of your work in written form (.docx or .pdf) that includes the following:




    • Q1 Summarize the model/feature selection process you used to fit your RF model




    • Q2 Provide a summary of the fitted RF model (i.e.model summary)




    • Q3 Provide performance evaluation of the fitted RF model using confusion matrix.




    • Q4 How well do you think the fitted RF model to this dataset works?




    • Q5 Using ROC curves and AUC, which one of logistic regression, SVM, KNN, CART, and RF models works better with the dataset over all?






Submission Instructions


For this weekly lab assignment, you should submit:




  • An R script file (or Rmd file)




  • A written summary/discussion of your work (as discussed above) in .docx or .pdf format.



Answered 1 days AfterFeb 20, 2022

Answer To: Introduction For this week’s take-home lab, you will work on the same data set from Week 4/5...

Mohd answered on Feb 22 2022
114 Votes
-
-
-
2/21/2022
Loading Packages
library(dplyr)
library(caret)
library(MASS)
library(e1071)
library(magrittr)
library(rmarkdown)
library(readxl)
library(pROC)
ucicreditcard <- read_excel("~/ucicreditcard.xlsx")
#View(ucicreditcard)
ucicreditcard$default_payment<-ucicreditcard$`default payment next month`
ucicreditcard<-ucicreditcard[,-25]
#Assigning values to NA
ucicreditcard$MAR
RIAGE<-replace(ucicreditcard$MARRIAGE,ucicreditcard$MARRIAGE==0,NA)
ucicreditcard%>%
count(EDUCATION)
## # A tibble: 7 x 2
## EDUCATION n
##
## 1 0 14
## 2 1 10585
## 3 2 14030
## 4 3 4917
## 5 4 123
## 6 5 280
## 7 6 51
ucicreditcard$EDUCATION<-replace(ucicreditcard$EDUCATION,ucicreditcard$EDUCATION==6,NA)
ucicreditcard$EDUCATION<-replace(ucicreditcard$EDUCATION,ucicreditcard$EDUCATION==5,NA)
ucicreditcard%>%
count(PAY_0)
## # A tibble: 11 x 2
## PAY_0 n
##
## 1 -2 2759
## 2 -1 5686
## 3 0 14737
## 4 1 3688
## 5 2 2667
## 6 3 322
## 7 4 76
## 8 5 26
## 9 6 11
## 10 7 9
## 11 8 19
Checking Null Values
sum(is.na(ucicreditcard$PAY_0))
## [1] 0
ucicreditcard$PAY_0<-replace(ucicreditcard$PAY_0,ucicreditcard$PAY_0==-2,NA)
ucicreditcard$PAY_2<-replace(ucicreditcard$PAY_2,ucicreditcard$PAY_2==-2,NA)
ucicreditcard$PAY_3<-replace(ucicreditcard$PAY_3,ucicreditcard$PAY_3==-2,NA)
ucicreditcard$PAY_4<-replace(ucicreditcard$PAY_4,ucicreditcard$PAY_4==-2,NA)
ucicreditcard$PAY_5<-replace(ucicreditcard$PAY_5,ucicreditcard$PAY_5==-2,NA)
ucicreditcard$PAY_6<-replace(ucicreditcard$PAY_6,ucicreditcard$PAY_6==-2,NA)
ucicreditcard%>%
count(PAY_0)
## # A tibble: 11 x 2
## PAY_0 n
##
## 1 -1 5686
## 2 0 14737
## 3 1 3688
## 4 2 2667
## 5 3 322
## 6 4 76
## 7 5 26
## 8 6 11
## 9 7 9
## 10 8 19
## 11 NA 2759
sum(is.na(ucicreditcard$PAY_0))
## [1] 2759
Training/test partition of the dataset
#removing NA
ucicreditcard<-na.omit(ucicreditcard)
set.seed(549)
ucicreditcard<-ucicreditcard[,2:25]
inp <- sample(2, nrow(ucicreditcard), replace = TRUE, prob = c(0.8, 0.2))
training_data <- ucicreditcard[inp==1, ]
test_data <- ucicreditcard[inp==2, ]
Fitting the best KNN model and CART model
train.dep<-training_data$default_payment
test.dep<-test_data$default_payment
train.indep<-training_data[,2:24]
test.indep<-test_data[,2:24]
Loading packages
library(class)
library(rpart)
library(rpart.plot)
library(gridExtra)
library(ISLR)
KNN Model with Summary
knn.1<-knn(train.indep,test.indep,train.dep,k=1)
knn.5<-knn(train.indep,test.indep,train.dep,k=5)
knn.15<-knn(train.indep,test.indep,train.dep,k=15)
Accuracy at different K
sum(test.dep==knn.1)/length(test.dep)
## [1] 0.6833814
sum(test.dep==knn.5)/length(test.dep)
## [1] 0.7503883
sum(test.dep==knn.15)/length(test.dep)
## [1] 0.770801
Hyper Parameter Tuning of KNN MOdel for better result(accuracy)
class(train.dep)
## [1] "numeric"
train.dep<-as.factor(train.dep)
knn_cross<-tune.knn(x=train.indep,y=train.dep,k=1:40,tunecontrol=tune.control(sampling="cross"),cross=5)
summary(knn_cross)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## k
## 34
##
## - best performance: 0.2269582
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.3064167 0.010980007
## 2 2 0.3087763 0.011192464
## 3 3 0.2673313 0.010152256
## 4 4 0.2667953 0.012618763
## 5 5 0.2504415 0.010790730
## 6 6 0.2504416 0.010458181
## 7 7 0.2412733 0.008250576
## 8 8 0.2402546 0.009672923
## 9 9 0.2386459 0.010230342
## 10 10 0.2373058 0.008317957
## 11 11 0.2357510 0.007349066
## 12 12 0.2351612 0.007564908
## 13 13 0.2318371 0.008142404
## 14 14 0.2326413...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here