HOMEWORK Homework: This homework will help you develop understanding and knowledge on importance of data balancing for predictive modeling and applications. Please submit your answers for each...

1 answer below »

See attached file

HOMEWORK Homework: This homework will help you develop understanding and knowledge on importance of data balancing for predictive modeling and applications. Please submit your answers for each question in a word document with your explanations with a supporting R script file you used for this assignment. 1. Install and load package ROSE in Rstudio. The package comes with an inbuilt imbalanced data set named as “hacide”. It comprises of two files: hacide.train and hacide.test. Load these data files to your working environment with the following code. install.packages("ROSE") library(ROSE) data(hacide) str(hacide.train) As you can see, the training data set contains 3 variables of 1000 observations. “cls” is the response binary variable. x1 and x2 are dependent variables. Visualize the distribution of all three variables using the right visualization tool (histogram etc) 2. What is the imbalance severity in this data set? 3. Build a decision tree model D1 to predict “cls” with all other variables in the data set. Use the hacide.test data to assess the accuracy, F1 score, precision and recall of this model D1. Hint: D1= rpart(cls ~ ., data = hacide.train) 4. Use over sampling with oven.sample() function of the ROSE package to balance the data. What is the total observation number in the new dataset? 5. Build a decision tree model D2 to predict “cls” again with all other variables in the data set. Use the hacide.test data to assess the accuracy, F1 score, precision and recall of this model D2. 6. What are the AUC statistics of both models? Plot the ROC curves for both models.

homework-lcoevszz.docx

Answered 1 days AfterOct 06, 2021

Answer To: HOMEWORK Homework: This homework will help you develop understanding and knowledge on importance of...

Charulata Anil answered on Oct 07 2021

136 Votes

HOMEWORK
#install packages
> install.packages("ROSE")
> library(ROSE)
The package ROSE comes with an inbuilt imbalanced data set named as hacide. It comprises of two files: hacide.train and hacide.test. Let’s load it in R environment:
> data(hacide)
> str(hacide.train)
hist(hacide.train$x1)
hist(hacide.train$x2)
'data.frame': 1000 obs. of 3 variables:
$ cls: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ x1 : num 0.2008 0.0166 0.2287 0.1264 0.6008 ...
$ x2 : num 0.678 1.5766 -0.5595 -0.0938 -0.2984 ...
As you can see, the data set contains 3 variable of 1000 observations. cls is the response variable. x1 and x2 are dependent variables. Let’s check the severity of imbalance in this data set:
#check table
table(hacide.train$cls)
0 1
980 20
#check classes distribution
prop.table(table(hacide.train$cls))
0 1
0.98 0.02
As we see, this data set contains only 2% of positive cases and 98% of negative cases. This is a severely imbalanced data set. So, how badly can this affect our prediction accuracy ?...

SOLUTION.PDF

HOMEWORK Homework: This homework will help you develop understanding and knowledge on importance of data balancing for predictive modeling and applications. Please submit your answers for each...

Answer To: HOMEWORK Homework: This homework will help you develop understanding and knowledge on importance of...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment