You have the chance to explore a dataset you would not have seen before in the course.This dataset is a real credit approval dataset available atUCI Repository(Links to an external site.)and includes an outcome variable "Default payment" as a binary outcome (0/1) like in past binary classification problems. The other variables in the data include credit amount, education level, gender, age, repayment status by month (-1 for paid, 1 delayed 1 month, 2 delayed 2 months etc.), amount of bill per month and amount of payment per month (both prior months). The website presents a study based on this dataset. This dataset has been carefully picked from a set of potential candidate datasets based on feasibility of the tasks and completeness of the data.
The analysis should include the following parts:
- Summary statistics table and data cleaning steps including: missing values (if any); transformations of the data (if required, such as normalizing or converting to factor variables); a narrative of which variables you think make sense as predictors for default
- Exploratory analysis including: a regression model on the entire dataset, with the outcome default. Here you should run a logit and a probit and select the best fitting model. You can try various sets of predictors and comment on which attempts you tried and which model seems to fit the best.
- Training a classifier using the CARET package. We suggest you try at 3 algorithm types (example, knn, naive_bayes, svmLinear, svmRadial, rf, nnet - the latter two are very slow so be patient while running) and pick the one with the best overall accuracy and report which model you picked. Based on our own runtime, a crossvalidation with 5 folds and a tunelength of 10 in CARET is reasonable, but feel free to experiment if you wish. Some of the models (like randomforest or neural net) can take a long time to run (hours in some settings). If a particular algorithm is too slow on your machine, tweak down the number of runs or move to another algorithm.
This will be graded holistically. You can choose for example only six predictors out of the many available and still get a pretty good accuracy and a tunelength of 5 and get pretty good accuracy as well just make sure to explain why you used certain predictors (i.e., it's likely that repayment history helps predict ... or education could help predict because...).
Deliverables:
- One word or PDF or LaTex formatted file including a writeup of the work completed (a few pages including tables). Being concise is OK.
- One R code file (with comments as appropriate)
- Any ancillary files required to run the R code (e.g. if you modify the source file which you might need to from excel to CSV and to remove the first row which is X1,X2,...)
In all, this assignment can be done in an afternoon of concentrated effort, but we suggest you start it early. Please do reach out to the TAs if you have questions. If you have incomplete responses or have bugs, we will still grade for partial credit, so please do submit the requested files. And finally, explore the new data and have fun trying the powerful CARET package!