Coding Homework 2
For the following exercises, either use the heart failure data set again or any data set of your choice. Make sure to submit some proof of your code along with your answers.
Exercise 1
Use all available variables to predict death (or your binary outcome of choice if you use your own data set). Split the data into training and testing sets (75%/25% split) before answering the following questions.
1.Fit and plot a decision tree with the training data. Which predictors seem to be the most important based on how the tree looks?
2.What is the ROC AUC of the model on the training set? On the testing set?
Exercise 2
Using the training set, use LASSO to perform variable selection with death as the outcome variable (or your binary outcome of choice if you use your own data set). What are the selected variables? Do these variables match what you were expecting?
Bonus
1.Attempt to make a decision tree that has a better ROC AUC on the testing set than the one you made in Exercise 1. You can try playing around with the optional minsplit (the minimum number of observations that must exist in a node in order for a split to be attempted), minbucket (the minimum number of observations in any terminal leaf node), maxdepth (the maximum depth of any node of the final tree), and/or cp (complexity parameter) arguments of the rpart() function to do this. If you were successful, why do you think were? If you were not successful, explain the thought process behind your attempt.
2.Create a prediction model that you have not already made in either homework (e.g., random forest [ranger], neural network [nnet], SVM [kernlab], gradient boosted trees [xgboost]) to predict an outcome of your choice with a data set of your choice (including the Alzheimer’s data). Why did you choose to make this model? If you searched for help with the implementation of your chosen model, how difficult was this process for you?