{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9OBvBOCkPrga"
},
"source": [
"## Assignment 4"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "bEmSTWZSPrgb"
},
"source": [
"This assignment is based on content discussed in module 8 and using Decision Trees and Ensemble Models in classification and regression problems."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "1cUoTzQLPrgc"
},
"source": [
"## Learning outcomes "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Q1ygYVo_Prgc"
},
"source": [
"- Understand how to use decision trees on a Dataset to make a prediction\n",
"- Learning hyper-parameters tuning for decision trees by using RandomGrid \n",
"- Learning the effectiveness of ensemble algorithms (Random Forest, Adaboost, Extra trees classifier, Gradient Boosted Tree)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9hjVbQlVPrgd"
},
"source": [
"In the first part of this assignment, you will use Classification Trees for predicting if a user has a default payment option active or not. You can find the necessary data for performing this assignment [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) \n",
"\n",
"This dataset is aimed at the case of customer default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default.\n",
"\n",
"Required imports for this project are given below. Make sure you have all libraries required for this project installed. You may use conda or pip based on your set up.\n",
"\n",
"__NOTE:__ Since data is in Excel format you need to install `xlrd` in order to read the excel file inside your pandas dataframe. You can run `pip install xlrd` to install "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "R376ZBnBPrge"
},
"outputs": [],
"source": [
"#required imports\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ddF9R5pdPrgi"
},
"source": [
"After installing the necessary libraries, proceed to download the data. Since reading the excel file won't create headers by default, we added two more operations to substitute the columns."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "CtNCjjr7Prgj"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n"
]
}
],
"source": [
"#loading the data\n",
"dataset = pd.read_excel(\"https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls\")\n",
"#dataset.columns = dataset.iloc[0]\n",
"#dataset.drop(['ID'], inplace=True)\n",
"dataset.drop(dataset.columns[dataset.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)\n",
"print(dataset.drop(0,inplace=True))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "cMh-sEIdPrgl"
},
"source": [
"In the following, you can take a look into the dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "E0lAPOXQPrgl",
"outputId": "ea66ba57-f32c-4b39-c60a-e52402acbca1"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"X1 | \n",
"X2 | \n",
"X3 | \n",
"X4 | \n",
"X5 | \n",
"X6 | \n",
"X7 | \n",
"X8 | \n",
"X9 | \n",
"X10 | \n",
"... | \n",
"X15 | \n",
"X16 | \n",
"X17 | \n",
"X18 | \n",
"X19 | \n",
"X20 | \n",
"X21 | \n",
"X22 | \n",
"X23 | \n",
"Y | \n",
"
\n",
"\n",
"\n",
"\n",
"1 | \n",
"20000 | \n",
"2 | \n",
"2 | \n",
"1 | \n",
"24 | \n",
"2 | \n",
"2 | \n",
"-1 | \n",
"-1 | \n",
"-2 | \n",
"... | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"689 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"1 | \n",
"
\n",
"\n",
"2 | \n",
"120000 | \n",
"2 | \n",
"2 | \n",
"2 | \n",
"26 | \n",
"-1 | \n",
"2 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"... | \n",
"3272 | \n",
"3455 | \n",
"3261 | \n",
"0 | \n",
"1000 | \n",
"1000 | \n",
"1000 | \n",
"0 | \n",
"2000 | \n",
"1 | \n",
"
\n",
"\n",
"3 | \n",
"90000 | \n",
"2 | \n",
"2 | \n",
"2 | \n",
"34 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"... | \n",
"14331 | \n",
"14948 | \n",
"15549 | \n",
"1518 | \n",
"1500 | \n",
"1000 | \n",
"1000 | \n",
"1000 | \n",
"5000 | \n",
"0 | \n",
"
\n",
"\n",
"4 | \n",
"50000 | \n",
"2 | \n",
"2 | \n",
"1 | \n",
"37 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"... | \n",
"28314 | \n",
"28959 | \n",
"29547 | \n",
"2000 | \n",
"2019 | \n",
"1200 | \n",
"1100 | \n",
"1069 | \n",
"1000 | \n",
"0 | \n",
"
\n",
"\n",
"5 | \n",
"50000 | \n",
"1 | \n",
"2 | \n",
"1 | \n",
"57 | \n",
"-1 | \n",
"0 | \n",
"-1 | \n",
"0 | \n",
"0 | \n",
"... | \n",
"20940 | \n",
"19146 | \n",
"19131 | \n",
"2000 | \n",
"36681 | \n",
"10000 | \n",
"9000 | \n",
"689 | \n",
"679 | \n",
"0 | \n",
"
\n",
"\n",
"6 | \n",
"50000 | \n",
"1 | \n",
"1 | \n",
"2 | \n",
"37 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"... | \n",
"19394 | \n",
"19619 | \n",
"20024 | \n",
"2500 | \n",
"1815 | \n",
"657 | \n",
"1000 | \n",
"1000 | \n",
"800 | \n",
"0 | \n",
"
\n",
"\n",
"7 | \n",
"500000 | \n",
"1 | \n",
"1 | \n",
"2 | \n",
"29 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"... | \n",
"542653 | \n",
"483003 | \n",
"473944 | \n",
"55000 | \n",
"40000 | \n",
"38000 | \n",
"20239 | \n",
"13750 | \n",
"13770 | \n",
"0 | \n",
"
\n",
"\n",
"8 | \n",
"100000 | \n",
"2 | \n",
"2 | \n",
"2 | \n",
"23 | \n",
"0 | \n",
"-1 | \n",
"-1 | \n",
"0 | \n",
"0 | \n",
"... | \n",
"221 | \n",
"-159 | \n",
"567 | \n",
"380 | \n",
"601 | \n",
"0 | \n",
"581 | \n",
"1687 | \n",
"1542 | \n",
"0 | \n",
"
\n",
"\n",
"9 | \n",
"140000 | \n",
"2 | \n",
"3 | \n",
"1 | \n",
"28 | \n",
"0 | \n",
"0 | \n",
"2 | \n",
"0 | \n",
"0 | \n",
"... | \n",
"12211 | \n",
"11793 | \n",
"3719 | \n",
"3329 | \n",
"0 | \n",
"432 | \n",
"1000 | \n",
"1000 | \n",
"1000 | \n",
"0 | \n",
"
\n",
"\n",
"10 | \n",
"20000 | \n",
"1 | \n",
"3 | \n",
"2 | \n",
"35 | \n",
"-2 | \n",
"-2 | \n",
"-2 | \n",
"-2 | \n",
"-1 | \n",
"... | \n",
"0 | \n",
"13007 | \n",
"13912 | \n",
"0 | \n",
"0 | \n",
"0 | \n",
"13007 | \n",
"1122 | \n",
"0 | \n",
"0 | \n",
"
\n",
"\n",
"
\n",
"
10 rows × 24 columns
\n",
"
"
],
"text/plain": [
" X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ... X15 X16 X17 \\\n",
"1 20000 2 2 1 24 2 2 -1 -1 -2 ... 0 0 0 \n",
"2 120000 2 2 2 26 -1 2 0 0 0 ... 3272 3455 3261 \n",
"3 90000 2 2 2 34 0 0 0 0 0 ... 14331 14948 15549 \n",
"4 50000 2 2 1 37 0 0 0 0 0 ... 28314 28959 29547 \n",
"5 50000 1 2 1 57 -1 0 -1 0 0 ... 20940 19146 19131 \n",
"6 50000 1 1 2 37 0 0 0 0 0 ... 19394 19619 20024 \n",
"7 500000 1 1 2 29 0 0 0 0 0 ... 542653 483003 473944 \n",
"8 100000 2 2 2 23 0 -1 -1 0 0 ... 221 -159 567 \n",
"9 140000 2 3 1 28 0 0 2 0 0 ... 12211 11793 3719 \n",
"10 20000 1 3 2 35 -2 -2 -2 -2 -1 ... 0 13007 13912 \n",
"\n",
" X18 X19 X20 X21 X22 X23 Y \n",
"1 0 689 0 0 0 0 1 \n",
"2 0 1000 1000 1000 0 2000 1 \n",
"3 1518 1500 1000 1000 1000 5000 0 \n",
"4 2000 2019 1200 1100 1069 1000 0 \n",
"5 2000 36681 10000 9000 689 679 0 \n",
"6 2500 1815 657 1000 1000 800 0 \n",
"7 55000 40000 38000 20239 13750 13770 0 \n",
"8 380 601 0 581 1687 1542 0 \n",
"9 3329 0 432 1000 1000 1000 0 \n",
"10 0 0 0 13007 1122 0 0 \n",
"\n",
"[10 rows x 24 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "r4jchSRoPrgr"
},
"source": [
"## Questions (15 points total)\n",
"\n",
"#### Question 1 (2 pts)\n",
"Build a classifier by using decision tree and calculate the confusion matrix. Try different hyper-parameters (at least two) and discuss the result."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "1Qr1SPGlPrgr"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"
\n",
"Int64Index: 30000 entries, 1 to 30000\n",
"Data columns (total 24 columns):\n",
"X1 30000 non-null object\n",
"X2 30000 non-null object\n",
"X3 30000 non-null object\n",
"X4 30000 non-null object\n",
"X5 30000 non-null object\n",
"X6 30000 non-null object\n",
"X7 30000 non-null object\n",
"X8 30000 non-null object\n",
"X9 30000 non-null object\n",
"X10 30000 non-null object\n",
"X11 30000 non-null object\n",
"X12 30000 non-null object\n",
"X13 30000 non-null object\n",
"X14 30000 non-null object\n",
"X15 30000 non-null object\n",
"X16 30000 non-null object\n",
"X17 30000 non-null object\n",
"X18 30000 non-null object\n",
"X19 30000 non-null object\n",
"X20 30000 non-null object\n",
"X21 30000 non-null object\n",
"X22 30000 non-null object\n",
"X23 30000 non-null object\n",
"Y 30000 non-null object\n",
"dtypes: object(24)\n",
"memory usage: 5.7+ MB\n",
"[[14306 3261]\n",
" [ 2883 2050]]\n",
"\n",
"\n",
"[[16868 699]\n",
" [ 3316 1617]]\n",
"[[16669 898]\n",
" [ 3122 1811]]\n"
]
}
],
"source": [
"# YOUR CODE HERE\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import accuracy_score,confusion_matrix\n",
"dataset.info()\n",
"dataset.describe()\n",
"# dividing data into dependent and independent variables\n",
"ind=dataset.iloc[:,0:23].values\n",
"dep=dataset.iloc[:,23:24].values\n",
"dep=dep.astype('int')\n",
"# spliting data into train and test phase\n",
"x_train,x_test,y_train,y_test=train_test_split(ind,dep,test_size=0.75,random_state=0)\n",
"# building model\n",
"tree=DecisionTreeClassifier()\n",
"tree.fit(x_train,y_train)\n",
"pred=tree.predict(x_test)\n",
"print(confusion_matrix(y_test,pred))\n",
"#changing first hyperparameter\n",
"tree=DecisionTreeClassifier(criterion=\"entropy\",max_depth=2,min_samples_leaf=1,min_samples_split=2)\n",
"tree.fit(x_train,y_train)\n",
"pred=tree.predict(x_test)\n",
"print(type(pred))\n",
"print(type(y_test))\n",
"print(confusion_matrix(y_test,pred))\n",
"#changing second hyperparameter\n",
"tree=DecisionTreeClassifier(criterion=\"gini\",max_depth=4,min_samples_leaf=2,min_samples_split=3)\n",
"tree.fit(x_train,y_train)\n",
"pred=tree.predict(x_test)\n",
"print(confusion_matrix(y_test,pred))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "QwcecRukPrgw"
},
"source": [
"#### Question 2 (4 pts)\n",
"\n",
"Try to build the decision tree which you built for the previous question, but this time by RandomGrid search over hyper-parameters. Compare the results."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "4XHRmsWOPrgx"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"[[16654 913]\n",
" [ 3112 1821]]\n"
]
}
],
"source": [
"# YOUR CODE HERE\n",
"from sklearn.model_selection import GridSearchCV\n",
"parameters = {'criterion':('gini','entropy'),'max_depth':(2,3,4,5,6,7,8),'min_samples_leaf':(2,3,4,5,6,7,8)}\n",
"grid=GridSearchCV(DecisionTreeClassifier(),param_grid=parameters,cv=3)\n",
"grid_model=grid.fit(x_train,y_train)\n",
"grid_model.best_estimator_\n",
"tree=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,\n",
" max_features=None, max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=4, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
" splitter='best')\n",
"tree.fit(x_train,y_train)\n",
"pred=tree.predict(x_test)\n",
"print(type(pred))\n",
"print(type(y_test))\n",
"print(confusion_matrix(y_test,pred))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "dEvsYwiXPrg3"
},
"source": [
"#### Question 3 (6 pts)\n",
"\n",
"Try to build the same classifier...