HOMEWORK Homework: This homework will help you develop understanding and knowledge on importance of data balancing for predictive modeling and applications. Please submit your answers for each...

1 answer below »
See attached file


HOMEWORK Homework: This homework will help you develop understanding and knowledge on importance of data balancing for predictive modeling and applications. Please submit your answers for each question in a word document with your explanations with a supporting R script file you used for this assignment. 1. Install and load package ROSE in Rstudio. The package comes with an inbuilt imbalanced data set named as “hacide”. It comprises of two files: hacide.train and hacide.test. Load these data files to your working environment with the following code. install.packages("ROSE") library(ROSE) data(hacide) str(hacide.train) As you can see, the training data set contains 3 variables of 1000 observations. “cls” is the response binary variable. x1 and x2 are dependent variables. Visualize the distribution of all three variables using the right visualization tool (histogram etc) 2. What is the imbalance severity in this data set? 3. Build a decision tree model D1 to predict “cls” with all other variables in the data set. Use the hacide.test data to assess the accuracy, F1 score, precision and recall of this model D1. Hint: D1= rpart(cls ~ ., data = hacide.train) 4. Use over sampling with oven.sample() function of the ROSE package to balance the data. What is the total observation number in the new dataset? 5. Build a decision tree model D2 to predict “cls” again with all other variables in the data set. Use the hacide.test data to assess the accuracy, F1 score, precision and recall of this model D2. 6. What are the AUC statistics of both models? Plot the ROC curves for both models.
Answered 1 days AfterOct 06, 2021

Answer To: HOMEWORK Homework: This homework will help you develop understanding and knowledge on importance of...

Charulata Anil answered on Oct 07 2021
136 Votes
HOMEWORK
#install packages
> install.packages("ROSE")
> library(ROSE)
The package ROSE comes with an
inbuilt imbalanced data set named as hacide. It comprises of two files: hacide.train and hacide.test. Let’s load it in R environment:
> data(hacide)
> str(hacide.train)
hist(hacide.train$x1)
hist(hacide.train$x2)
'data.frame': 1000 obs. of 3 variables:
$ cls: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ x1 : num 0.2008 0.0166 0.2287 0.1264 0.6008 ...
$ x2 : num 0.678 1.5766 -0.5595 -0.0938 -0.2984 ...
As you can see, the data set contains 3 variable of 1000 observations. cls is the response variable. x1 and x2 are dependent variables. Let’s check the severity of imbalance in this data set:
#check table
table(hacide.train$cls)
  0     1
980    20
#check classes distribution
prop.table(table(hacide.train$cls))
  0      1
0.98   0.02
As we see, this data set contains only 2% of positive cases and 98% of negative cases. This is a severely imbalanced data set. So, how badly can this affect our prediction accuracy ?...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here