09_random_forests/images/feature_importance_demo.png 09_random_forests/README.md # Random Forests This exercise focuses on ensemble methods using decision trees. In particular we will explore Random...

1 answer below »

In Readme.md for both E09 and E10 there are questions that needs to be answered by sending in code, plots or text. This is all explained in the “Readme.md” for E09 and E10


There are 18 questions in E09 and 7 questions in E10












09_random_forests/images/feature_importance_demo.png 09_random_forests/README.md # Random Forests This exercise focuses on ensemble methods using decision trees. In particular we will explore Random Forests and Extremely Randomized Trees which are methods that take extra steps in creating variety in the base learner, the decision trees, in the ensemble. In this assignment we will continue to use the cancer dataset that we used in the previous assignment. ## Section 1 In the first section we will take steps towards ensemble classifier methods using vanilla decision trees. ### Section 1.1 Before we try out an ensemble classifier based on trees, let's first apply a single decision tree to see what we can improve on. In this assignment we will be using the class `CancerClassifier` to carry out experiments. The only two input arguments to `CancerClassifier` is `classifier` and `train_ratio`. In general, `classifier` can be any type of `sklearn` classifier like `sklearn.tree.DecisionTreeClassifier`. To start with, finish implementing the following methods of the `CancerClassifier` class: * `confusion_matrix()` * `accuracy()` * `precision()` * `recall()` * `cross_validation_accuracy()` The last method should perform 10-fold cross validation on the entire cancer dataset and return the average accuracy of those 10 folds. Example of usage: ``` classifier_type = sklearn.some.Classifier(param=3) cc = CancerClassifier(classifier_type) print(cc.accuracy()) ``` ### Section 1.2 *This question should be answered on Mimir* Now run `CancerClassifier` with a `DecisionTreeClassifier`. and evaluate the performance with the methods that you have finished implementing. Answer the following questions: 1. Upload the result for each metric (confusion matrix, accuracy, precision, recall, cross validation accuracy) 2. What does the precision and recall tell us that the accuracy can't? 3. What could possibly explain the difference between accuracy and cross validation accuracy? 4. How would you suggest a confusion matrix, precision and recall for cross validation would be formulated? ## Section 2 Bagging where the base learners are decision trees are often referred to as *tree bagging*. Each decision tree is trained using a bootstrap subset of the training set to create variety amongst the base learners. Random forests go even further and create more variety by sampling the features when making splits in each node in each decision tree. This means that in a given node, instead of using for example all $D=10$ features, the algorithm will choose a subset of a set value, say $d=3$ , in this instance features 4, 6 and 10. Another node would be a different subset, say features 2, 5 and 7. A typical choice of $d$ is $\sqrt{D}$ or $\text{log}_2(D)$. The number of features (attributes) for our problem is of course 30, the dimensionality of each feature vector in the dataset. ### Section 2.1 *This question should be answered via Mimir* Implement a vanilla random forest and apply to the problem in the same way that the decision tree was evaluated, using your `CancerClassifier`. 1. Upload the result for each metric (confusion matrix, accuracy, precision, recall, cross validation accuracy) 2. What is the best combination of a total number of trees in the forest (`n_estimators`) and the maximum number of features considered in each split (`max_features`) that you can find? What are the metric results for this parameter selection? ### Section 2.2. Add a method to your class called `feature_importance` that plots a bar plot of feature importances for the random forest and returns a list of feature indices sorted by importance (highest first). You can access feature importance of a `sklearn` classifier with `classifier.feature_importances_`. Example inputs and outputs: ``` cc = CancerClassifier(my_classifier) feature_idx = cc.feature_importance() ``` First a plot should appear then when the user has escaped the plot window the user can access the feature indexes. ``` >> feature_idx [22, 6, 7, ...] ``` For the first 5 features, this plot looks like the following: ![Feature Importance](images/feature_importance_demo.png) Upload your plot as `2_2_1.png` ### Section 2.3 *This question should be answered via Mimir* 1. Describe how feature importance is calculated 2. Which feature is the most important and which is the least important. Use information from either [assignment 8](../08_SVM/README.md) or [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) to name these features. ### Section 2.4 [OOB Errors for Random Forests](https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html#sphx-glr-auto-examples-ensemble-plot-ensemble-oob-py) is the average error for each training example calculated using predictions from the trees that do not contain that training sample in their respective bootstrap sample. This means that a Random forest can be trained and validated simultaneously. The validation data is precisely the data that is **not** used to determine the parameters in the decision trees. Let's examine the development of the out-of-bag error as we increase the size of the forest. Adapt the code in the link to the cancer dataset and generate the plot. You can use `_plot_oob_error` for this. Turn in your plot as `2_4_1.png`. ### Section 2.5 *This question should be answered via Mimir* 1. What can be said about the relationship between the OOB error rate and the number of estimators? 2. Do all three types of ensembles follow this correlation? ## Section 3 Extremely Randomized Trees are just like Random Forests in that they are an ensemble of decision trees that are trained on bootstrap subsets of the training data and in each split a random subset of features are taken into consideration. But instead of searching for an optimum threshold amongst the candidate features in each split, random thresholds are tested, one per candidate feature, and then the best split is chosen amongst those. ### Section 3.1 *This question should be answered via Mimir* Now run `CancerClassifier` with a `ExtraTreesClassifier` Plot the same feature importance bar plot as before. Upload it as `3_1_1.png`. 1. Upload the result for each metric (confusion matrix, accuracy, precision, recall, cross validation accuracy) 2. What is the most important feature and the least important feature? ### Section 3.2 Now make a similar plot to the one in section 2.4 but now with extremely randomized trees. You can use `_plot_extreme_oob_error` for this. Upload your plot as `3_2_1.png`. ### Independent section This is an open ended independent section. You are welcome to analyze these models further, test them out on other datasets, compare parameters w.r.t. accuracy, recall, etc. 09_random_forests/template.py import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import (confusion_matrix, accuracy_score, recall_score, precision_score) from collections import OrderedDict class CancerClassifier: ''' A general class to try out different sklearn classifiers on the cancer dataset ''' def __init__(self, classifier, train_ratio: float = 0.7): self.classifier = classifier cancer = load_breast_cancer() self.X = cancer.data # all feature vectors self.t = cancer.target # all corresponding labels self.X_train, self.X_test, self.t_train, self.t_test =\ train_test_split( cancer.data, cancer.target, test_size=1-train_ratio, random_state=109) # Fit the classifier to the training data here ... def confusion_matrix(self) -> np.ndarray: '''Returns the confusion matrix on the test data ''' ... def accuracy(self) -> float: '''Returns the accuracy on the test data ''' ... def precision(self) -> float: '''Returns the precision on the test data ''' ... def recall(self) -> float: '''Returns the recall on the test data ''' ... def cross_validation_accuracy(self) -> float: '''Returns the average 10-fold cross validation accuracy on the entire dataset. ''' ... def feature_importance(self) -> list: ''' Draw and show a barplot of feature importances for the current classifier and return a list of indices, sorted by feature importance (high to low). ''' ... def _plot_oob_error(): RANDOM_STATE = 1337 ensemble_clfs = [ ("RandomForestClassifier, max_features='sqrt'", RandomForestClassifier( n_estimators=100, warm_start=True, oob_score=True, max_features="sqrt", random_state=RANDOM_STATE)), ("RandomForestClassifier, max_features='log2'", RandomForestClassifier( n_estimators=100, warm_start=True, max_features='log2', oob_score=True, random_state=RANDOM_STATE)),
Answered 3 days AfterOct 18, 2021

Answer To: 09_random_forests/images/feature_importance_demo.png 09_random_forests/README.md # Random Forests...

Vicky answered on Oct 22 2021
123 Votes
09_random_forests/.ipynb_checkpoints/Untitled-checkpoint.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.datasets import load_breast_cancer\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.model_selection import train_test_split, cross_val_score\n",
"from sklearn.metrics import (confusion_matrix, accuracy_score, recall_score,\n",
" precision_score)\n",
"\n",
"from collections import OrderedDict\n",
"\n",
"\n",
"class CancerClassifier:\n",
" '''\n",
" A general class to try out different sklearn classifiers\n",
" on the cancer dataset\n",
" '''\n",
" def __init__(self, classifier, train_ratio: float = 0.7):\n",
" self.classifier = classifier\n",
" cancer = load_breast_cancer()\n",
" self.X = cancer.data # all feature vectors\n",
" self.t = cancer.target # all corresponding labels\n",
" self.X_train, self.X_test, self.t_train, self.t_test =\\\n",
" train_test_split(\n",
" cancer.data, cancer.target,\n",
" test_size=1-train_ratio, random_state=109)\n",
"\n",
" # Fit the classifier to the training data here\n",
" self.classifier.fit(self.X_train, self.t_train)\n",
"\n",
" def confusion_matrix(self) -> np.ndarray:\n",
" '''Returns the confusion matrix on the test data\n",
" '''\n",
" return confusion_matrix(self.t_test, self.classifier.predict(self.X_test))\n",
" \n",
"\n",
" def accuracy(self) -> float:\n",
" '''Returns the accuracy on the test data\n",
" '''\n",
" return accuracy_score(self.t_test, self.classifier.predict(self.X_test))\n",
"\n",
" def precision(self) -> float:\n",
" '''Returns the precision on the test data\n",
" '''\n",
" return precision_score(self.t_test, self.classifier.predict(self.X_test))\n",
"\n",
" def recall(self) -> float:\n",
" '''Returns the recall on the test data\n",
" '''\n",
" return recall_score(self.t_test, self.classifier.predict(self.X_test))\n",
"\n",
" def cross_validation_accuracy(self) -> float:\n",
" '''Returns the average 10-fold cross validation\n",
" accuracy on the entire dataset.\n",
" '''\n",
" return np.mean(cross_val_score(self.classifier,self.X,self.t,cv=10))\n",
"\n",
" def feature_importance(self) -> list:\n",
" '''\n",
" Draw and show a barplot of feature importances\n",
" for the current classifier and return a list of\n",
" indices, sorted by feature importance (high to low).\n",
" '''\n",
" plt.bar(range(len(self.classifier.feature_importances_[:5])), self.classifier.feature_importances_[:5])\n",
" plt.xlabel(\"Feature index\")\n",
" plt.ylabel(\"Feature importance\")\n",
" plt.show()\n",
" return np.argsort(self.classifier.feature_importances_)[::-1]\n",
"\n",
"\n",
"def _plot_oob_error():\n",
" RANDOM_STATE = 1337\n",
" ensemble_clfs = [\n",
" (\"RandomForestClassifier, max_features='sqrt'\",\n",
" RandomForestClassifier(\n",
" n_estimators=100,\n",
" warm_start=True,\n",
" oob_score=True,\n",
" max_features=\"sqrt\",\n",
" random_state=RANDOM_STATE)),\n",
" (\"RandomForestClassifier, max_features='log2'\",\n",
" RandomForestClassifier(\n",
" n_estimators=100,\n",
" warm_start=True,\n",
" max_features='log2',\n",
" oob_score=True,\n",
" random_state=RANDOM_STATE)),\n",
" (\"RandomForestClassifier, max_features=None\",\n",
" RandomForestClassifier(\n",
" n_estimators=100,\n",
" warm_start=True,\n",
" max_features=None,\n",
" oob_score=True,\n",
" random_state=RANDOM_STATE))]\n",
"\n",
" # Map a classifier name to a list of (, ) pairs.\n",
" error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)\n",
"\n",
" min_estimators = 30\n",
" max_estimators = 175\n",
"\n",
" for label, clf in ensemble_clfs:\n",
" for i in range(min_estimators, max_estimators + 1):\n",
" clf.set_params(n_estimators=i)\n",
" cancer = load_breast_cancer()\n",
" clf.fit(cancer.data, cancer.target) # Use cancer data here\n",
" oob_error = 1 - clf.oob_score_\n",
" error_rate[label].append((i, oob_error))\n",
"\n",
" # Generate the \"OOB error rate\" vs. \"n_estimators\" plot.\n",
" for label, clf_err in error_rate.items():\n",
" xs, ys = zip(*clf_err)\n",
" plt.plot(xs, ys, label=label)\n",
"\n",
" plt.xlim(min_estimators, max_estimators)\n",
" plt.xlabel(\"n_estimators\")\n",
" plt.ylabel(\"OOB error rate\")\n",
" plt.legend(loc=\"upper right\")\n",
" plt.show()\n",
"\n",
"\n",
"def _plot_extreme_oob_error():\n",
" RANDOM_STATE = 1337\n",
" ensemble_clfs = [\n",
" (\"ExtraTreesClassifier, max_features='sqrt'\",\n",
" ExtraTreesClassifier(\n",
" n_estimators=100,\n",
" warm_start=True,\n",
" bootstrap=True,\n",
" oob_score=True,\n",
" max_features=\"sqrt\",\n",
" random_state=RANDOM_STATE)),\n",
" (\"ExtraTreesClassifier, max_features='log2'\",\n",
" ExtraTreesClassifier(\n",
" n_estimators=100,\n",
" warm_start=True,\n",
" bootstrap=True,\n",
" max_features='log2',\n",
" oob_score=True,\n",
" random_state=RANDOM_STATE)),\n",
" (\"ExtraTreesClassifier, max_features=None\",\n",
" ExtraTreesClassifier(\n",
" n_estimators=100,\n",
" warm_start=True,\n",
" bootstrap=True,\n",
" max_features=None,\n",
" oob_score=True,\n",
" random_state=RANDOM_STATE))]\n",
"\n",
" # Map a classifier name to a list of (, ) pairs.\n",
" error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)\n",
"\n",
" min_estimators = 30\n",
" max_estimators = 175\n",
"\n",
" for label, clf in ensemble_clfs:\n",
" for i in range(min_estimators, max_estimators + 1):\n",
" clf.
set_params(n_estimators=i)\n",
" cancer = load_breast_cancer()\n",
" clf.fit(cancer.data, cancer.target) # Use cancer data here\n",
" oob_error = 1 - clf.oob_score_\n",
" error_rate[label].append((i, oob_error))\n",
"\n",
" # Generate the \"OOB error rate\" vs. \"n_estimators\" plot.\n",
" for label, clf_err in error_rate.items():\n",
" xs, ys = zip(*clf_err)\n",
" plt.plot(xs, ys, label=label)\n",
"\n",
" plt.xlim(min_estimators, max_estimators)\n",
" plt.xlabel(\"n_estimators\")\n",
" plt.ylabel(\"OOB error rate\")\n",
" plt.legend(loc=\"upper right\")\n",
" plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Now run `CancerClassifier` with a `DecisionTreeClassifier`. and evaluate the performance with the methods that you have finished implementing. Answer the following questions:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Upload the result for each metric (confusion matrix, accuracy, precision, recall, cross validation accuracy)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 59 4]\n",
" [ 3 105]]\n",
"0.9590643274853801\n",
"0.963302752293578\n",
"0.9722222222222222\n",
"0.9209586466165414\n"
]
}
],
"source": [
"classifier_type = sklearn.tree.DecisionTreeClassifier()\n",
"cc = CancerClassifier(classifier_type)\n",
"print(cc.confusion_matrix())\n",
"print(cc.accuracy())\n",
"print(cc.precision())\n",
"print(cc.recall())\n",
"print(cc.cross_validation_accuracy())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confusion Matrix = [[ 59 4]\n",
" [ 3 105]]\n",
" \n",
"Accuracy = 0.9590643274853801\n",
"\n",
"Precision = 0.963302752293578\n",
"\n",
"Recall = 0.9722222222222222\n",
"\n",
"Cross Validation Accuracy = 0.9209586466165414"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What does the precision and recall tell us that the accuracy can't?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall: The ability of a model to find all the relevant cases within a data set. Mathematically, we define recall as the number of true positives divided by the number of true positives plus the number of false negatives.\n",
"\n",
"Precision: The ability of a classification model to identify only the relevant data points. Mathematically, precision the number of true positives divided by the number of true positives plus the number of false positives."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. What could possibly explain the difference between accuracy and cross validation accuracy?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Accuracy of the model is the average of the accuracy of each fold. That cross validation is a procedure used to avoid overfitting and estimate the skill of the model on new data. There are common tactics that you can use to select the value of k for your dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. How would you suggest a confusion matrix, precision and recall for cross validation would be formulated?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confusion Matrix = [[TP FP] \n",
"[FN TN]]\n",
"\n",
"Precision = TP/(TP+FP)\n",
"\n",
"Recall = TP/(TP+FN)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Upload the result for each metric (confusion matrix, accuracy, precision, recall, cross validation accuracy)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 58 5]\n",
" [ 0 108]]\n",
"0.9707602339181286\n",
"0.9557522123893806\n",
"1.0\n",
"0.9596491228070176\n"
]
}
],
"source": [
"classifier_type = sklearn.ensemble.RandomForestClassifier()\n",
"cc = CancerClassifier(classifier_type)\n",
"print(cc.confusion_matrix())\n",
"print(cc.accuracy())\n",
"print(cc.precision())\n",
"print(cc.recall())\n",
"print(cc.cross_validation_accuracy())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confusion Matrix = [[ 58 5]\n",
" [ 0 108]]\n",
" \n",
"Accuracy = 0.9707602339181286\n",
"\n",
"Precision = 0.9557522123893806\n",
"\n",
"Recall = 1.0\n",
"\n",
"Cross Validation Accuracy = 0.9596491228070176"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What is the best combination of a total number of trees in the forest (`n_estimators`) and the maximum number of features considered in each split (`max_features`) that you can find? What are the metric results for this parameter selection?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"n_exstimators = 135, max_features=None"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAEGCAYAAABy53LJAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAb3ElEQVR4nO3df5QfdX3v8eeLFYQKGpCFxiSwKd1SI60B94Yg99xrEdokqAutP8BKEDiGXBOBK709UY8VT++PSEUUy0lugFwSa41UuWUvbk3TlB9FDWYDISbElD0hypI9yYoS+VGhgff9Yz4rwzff3Z1Zdvb7Tfb1OGfOd+bzY+Y9cNg385mZzygiMDMzK+qwRgdgZmYHFycOMzMrxYnDzMxKceIwM7NSnDjMzKyU1zU6gPFw/PHHR1tbW6PDMDM7qGzatOlnEdFaWz4hEkdbWxs9PT2NDsPM7KAi6Sf1yj1UZWZmpVSaOCTNkbRDUq+kJXXqJemmVL9F0hk19S2SHpZ0d67sOEnrJD2Wfo+t8hzMzOzVKkscklqAm4G5wAzgYkkzaprNBdrTsgBYVlN/NbC9pmwJsD4i2oH1advMzMZJlVccs4DeiNgZES8Ca4DOmjadwOrIbAAmSZoMIGkqcD5wa50+q9L6KuCCqk7AzMwOVGXimAI8kdvuS2VF23wZ+HPg5Zo+J0ZEP0D6PaHewSUtkNQjqWdgYGB0Z2BmZgeoMnGoTlntjIp120h6D7A3IjaN9uARsSIiOiKio7X1gKfJzMxslKpMHH3AtNz2VGB3wTZnA++TtItsiOscSX+T2uzJDWdNBvaOfehmZjaUKhPHRqBd0nRJRwAXAV01bbqA+enpqtnAvojoj4hPRcTUiGhL/f45Ij6S63NpWr8UuKvCczAzsxqVvQAYEfslLQbWAi3AyojYJmlhql8OdAPzgF7geeCyArteCtwh6Qrgp8AHqojfzMzq00T4kFNHR0f4zXGz4tqWfKfRIYyZXUvPb3QIBy1JmyKio7bcb46bmVkpThxmZlaKE4eZmZXixGFmZqU4cZiZWSlOHGZmVooTh5mZleLEYWZmpThxmJlZKU4cZmZWihOHmZmV4sRhZmalVDY77qHCk72Zmb2arzjMzKwUJw4zMyvFicPMzEpx4jAzs1KcOMzMrJRKn6qSNAf4Ctk3x2+NiKU19Ur188i+Of7RiHhI0pHA/cDrU4zfiojPpT7XAR8DBtJuPh0R3VWeh01MfqLOrL7KEoekFuBm4DygD9goqSsiHs01mwu0p+VMYFn6fQE4JyKelXQ48ICkf4iIDanfjRHxxapiNzOzoVU5VDUL6I2InRHxIrAG6Kxp0wmsjswGYJKkyWn72dTm8LREhbGamVlBVSaOKcATue2+VFaojaQWSZuBvcC6iHgw126xpC2SVko6tt7BJS2Q1COpZ2BgoF4TMzMbhSoTh+qU1V41DNkmIl6KiJnAVGCWpNNS/TLgFGAm0A/cUO/gEbEiIjoioqO1tXU08ZuZWR1VJo4+YFpueyqwu2ybiHgauBeYk7b3pKTyMnAL2ZCYmZmNkyoTx0agXdJ0SUcAFwFdNW26gPnKzAb2RUS/pFZJkwAkHQWcC/w4bU/O9b8Q2FrhOZiZWY3KnqqKiP2SFgNryR7HXRkR2yQtTPXLgW6yR3F7yR7HvSx1nwysSk9mHQbcERF3p7rrJc0kG9LaBVxZ1TmYmdmBKn2PI71f0V1Ttjy3HsCiOv22AKcPsc9LxjhMMzMrwW+Om5lZKU4cZmZWihOHmZmV4sRhZmalOHGYmVkpThxmZlaKE4eZmZXixGFmZqU4cZiZWSlOHGZmVooTh5mZleLEYWZmpThxmJlZKU4cZmZWihOHmZmV4sRhZmalOHGYmVkplSYOSXMk7ZDUK2lJnXpJuinVb5F0Rio/UtIPJT0iaZukz+f6HCdpnaTH0u+xVZ6DmZm9WmWJI30v/GZgLjADuFjSjJpmc4H2tCwAlqXyF4BzIuLtwExgjqTZqW4JsD4i2oH1advMzMZJlVccs4DeiNgZES8Ca4DOmjadwOrIbAAmSZqctp9NbQ5PS+T6rErrq4ALKjwHMzOrUWXimAI8kdvuS2WF2khqkbQZ2Ausi4gHU5sTI6IfIP2eUO/gkhZI6pHUMzAw8JpPxszMMlUmDtUpi6JtIuKliJgJTAVmSTqtzMEjYkVEdERER2tra5muZmY2jCoTRx8wLbc9Fdhdtk1EPA3cC8xJRXskTQZIv3vHLmQzMxtJocQh6ShJp5bc90agXdJ0SUcAFwFdNW26gPnp6arZwL6I6JfUKmnS4LGBc4Ef5/pcmtYvBe4qGZeZmb0GIyYOSe8FNgPfTdszJdUmgANExH5gMbAW2A7cERHbJC2UtDA16wZ2Ar3ALcDHU/lk4B5JW8gS0LqIuDvVLQXOk/QYcF7aNjOzcfK6Am2uI3tC6l6AiNgsqa3IziOimyw55MuW59YDWFSn3xbg9CH2+RTw7iLHNzOzsVdkqGp/ROyrPBIzMzsoFLni2Crpw0CLpHbgKuD71YZlZmbNqsgVxyeAt5G9zf23wD7gmiqDMjOz5jXiFUdEPA98Ji1mZjbBFXmqat3go7Fp+1hJa6sNy8zMmlWRoarj00t4AETELxhimg8zMzv0FUkcL0s6aXBD0skcOHWImZlNEEWeqvoM8ICk+9L2fyKbAt3MzCagIjfHv5s+sDSbbFLC/xoRP6s8MjMza0pFrjgAXg/8PLWfIYmIuL+6sMzMrFmNmDgkfQH4ELANeDkVB+DEYWY2ARW54rgAODUiXqg6GDMza35FnqraSfbpVjMzs0JXHM8DmyWtJ5t2BICIuKqyqMzMrGkVSRxdHPgBJjMzm6CKPI67ajwCMTOzg0ORp6ragf8FzACOHCyPiN+qMC4zM2tSRW6O/x9gGbAf+ANgNfC1KoMyM7PmVSRxHBUR6wFFxE8i4jrgnCI7lzRH0g5JvZKW1KmXpJtS/Zb0hjqSpkm6R9J2SdskXZ3rc52kJyVtTsu8YqdqZmZjocjN8V9JOgx4TNJi4EkKzI4rqQW4GTgP6AM2SuqKiEdzzeYC7Wk5k+zK5kyyq5trI+IhSccAmySty/W9MSK+WOwUzcxsLBW54rgG+A2yT8a+A/gIML9Av1lAb0TsjIgXgTVAZ02bTmB1ZDYAkyRNjoj+iHgIICKeAbYDUwqdkZmZVapI4miLiGcjoi8iLouIPwFOGrFX9of+idx2Hwf+8R+xjaQ24HTgwVzx4jS0tVLSsfUOLmmBpB5JPQMDAwXCNTOzIookjk8VLKulOmW13/EYto2ko4FvA9dExC9T8TLgFGAm0A/cUO/gEbEiIjoioqO1tbVAuGZmVsSQ9zgkzQXmAVMk3ZSreiPZPYiR9AHTcttTgd1F20g6nCxpfD0i7hxsEBF7cjHeAtxdIBYzMxsjw11x7AZ6gF8Bm3JLF/BHBfa9EWiXNF3SEcBFHPgGehcwPz1dNRvYFxH9kgTcBmyPiC/lO0ianNu8ENhaIBYzMxsjQ15xRMQjkrYCfziat8cjYn96Cmst0AKsjIhtkham+uVAN9lVTS/ZnFiXpe5nA5cAP5K0OZV9OiK6geslzSQb0toFXFk2NjMzG71hH8eNiJckvVnSEenJqFLSH/rumrLlufUAFtXp9wD1738QEZeUjcPMzMZOkfc4fgJ8T1IX8NxgYe0QkpmZTQxFEsfutBwGHFNtOGZm1uyKzI77eYD0BndExLOVR2VmZk1rxPc4JJ0m6WGyp5e2Sdok6W3Vh2ZmZs2oyAuAK4BPRsTJEXEycC1wS7VhmZlZsyqSON4QEfcMbkTEvcAbKovIzMyaWpGb4zslfZZXvsHxEeDx6kIyM7NmVuSK43KgFbgT+L9p/bJhe5iZ2SGryFNVvwCukvQm4OU0zbmZmU1QRZ6q+g+SfgQ8QjYFyCOS3lF9aGZm1oyK3OO4Dfh4RPwLgKT/SPYd8t+vMjAzM2tORe5xPDOYNODX80h5uMrMbIIqcsXxQ0n/G/gG2Yy0HwLulXQGwOAnXs3MbGIokjhmpt/P1ZS/kyyRnDOmEZmZWVMr8lTVH4xHIGZmdnAYMXFImgTMB9ry7SPiqurCMjOzZlVkqKob2AD8CHi52nDMzKzZFUkcR0bEJ0ezc0lzgK+QfTr21ohYWlOvVD+P7NOxH42IhyRNA1YDv0mWrFZExFdSn+OAb5JdAe0CPpheUjQzs3FQ5HHcr0n6mKTJko4bXEbqJKkFuBmYC8wALpY0o6bZXKA9LQuAZal8P3BtRLwVmA0syvVdAqyPiHZgfdo2M7NxUiRxvAj8FfADYFNaegr0mwX0RsTO9L3yNUBnTZtOYHVkNgCTJE2OiP7Bx3zTFCfbgSm5PqvS+irgggKxmJnZGCkyVPVJ4Lcj4mcl9z0FeCK33QecWaDNFKB/sEBSG3A68GAqOjEi+gEiol/SCfUOLmkB2VUMJ510UsnQzcxsKEWuOLaR3X8oS3XKokwbSUcD3wauiYhfljl4RKyIiI6I6GhtbS3T1czMhlHkiuMlYLOke4AXBgsLPI7bB0zLbU8FdhdtI+lwsqTx9Yi4M9dmz+BwlqTJwN4C52BmZmOkSOL4+7SUtRFolzQdeBK4CPhwTZsuYLGkNWTDWPtSQhDZ5IrbI+JLdfpcCixNv3eNIjYzMxulIm+OrxqpzRD99ktaDKwlexx3ZURsk7Qw1S8ne0dkHtBLNhw2+IGos4FLyKZx35zKPh0R3WQJ4w5JVwA/BT4wmvjMzGx0hkwcku6IiA+mb3HU3psgIkacVj39oe+uKVueWw9gUZ1+D1D//gcR8RTw7pGObWZm1RjuiuPq9Pue8QjEzMwODkMmjtwjrz8Zv3DMzKzZFbk5bhNY25LvNDqEMbFr6fmNDsHskFHkPQ4zM7NfK5Q4JB0l6dSqgzEzs+Y3YuKQ9F5gM/DdtD1TUlfVgZmZWXMqcsVxHdmEhU8DRMRmsinNzcxsAiqSOPZHxL7KIzEzs4NCkaeqtkr6MNAiqR24Cvh+tWGZmVmzKnLF8QngbWQTHP4tsA+4psqgzMyseQ17xZG+4tcVEecCnxmfkMzMrJkNe8URES8Bz0t60zjFY2ZmTa7IPY5fkc1Suw54brCwwPc4zMzsEFQkcXwnLWZmZtV9j8PMzA5NIyYOSY9T/3scv1VJRGZm1tSKDFV15NaPJPvi3nHVhGNmZs1uxPc4IuKp3PJkRHwZOKfIziXNkbRDUq+kJXXqJemmVL9F0hm5upWS9kraWtPnOklPStqclnlFYjEzs7FRZKjqjNzmYWRXIMcU6NcC3AycB/QBGyV1RcSjuWZzgfa0nAksS78AtwN/Dayus/sbI+KLI8VgZmZjr8hQ1Q259f3A48AHC/SbBfRGxE4ASWuATiCfODqB1enb4xskTZI0OSL6I+J+SW0FjmNmZuOoSOK4YvCP/yBJ0wv0mwI8kdvu45WrieHaTAH6R9j3YknzgR7g2oj4RYF4zMxsDBSZq+pbBctqqU5Z7dNZRdrUWgacAswkSzA31GskaYGkHkk9AwMDI8VqZmYFDXnFIel3ySY3fJOkP85VvZHs6aqR9AHTcttTgd2jaPMqEbEnF+MtwN1DtFsBrADo6OgYKRmZmVlBww1VnQq8B5gEvDdX/gzwsQL73gi0p2GtJ4GLgA/XtOkiG3ZaQzaMtS8ihh2mGrwHkjYvBLYO197MzMbWkIkjIu4C7pJ0VkT8oOyOI2K/pMXAWqAFWBkR2yQtTPXLgW5gHtALPA9cNthf0jeAdwHHS+oDPhcRtwHXS5pJNqS1C7iybGxmZjZ6RW6OPyxpEdmw1a+HqCLi8pE6RkQ3WXLIly3PrQewaIi+Fw9RfkmBmM3MrCJFbo5/DfhN4I+A+8juQzxTZVBmZta8iiSO346IzwLPpQkPzwd+r9qwzMysWRVJHP+efp+WdBrwJqCtsojMzKypFbnHsULSscBnyZ6COhr4i0qjMjOzplXkexy3ptX7AE+lbmY2wY04VCXpREm3SfqHtD1D0hXVh2ZmZs2oyD2O28nexXhL2v5X4JqqAjIzs+ZWJHEcHxF3AC9D9mIf8FKlUZmZWdMqkjiek/Rm0uSDkmYD+yqNyszMmlaRp6o+SfY01SmSvge0Au+vNCozM2taw82Oe1JE/DQiHpL0n8kmPRSwIyL+fah+ZmZ2aBtuqOrvc+vfjIhtEbHVScPMbGIbLnHkP7Lk9zfMzAwYPnHEEOtmZjaBDXdz/O2Sfkl25XFUWidtR0S8sfLozMys6Qz3IaeW8QzEzMwODkXe4zAzM/s1Jw4zMyul0sQhaY6kHZJ6JS2pUy9JN6X6LZLOyNWtlLRX0taaPsdJWifpsfR7bJXnYGZmr1ZZ4pDUAtwMzAVmABdLmlHTbC7QnpYFwLJc3e3AnDq7XgKsj4h2YH3aNjOzcVLlFccsoDcidkbEi8AaoLOmTSewOjIbgEmSJgNExP3Az+vstxNYldZXARdUEr2ZmdVVZeKYAjyR2+5LZWXb1DoxIvoB0u8J9RpJWiCpR1LPwMBAqcDNzGxoVSYO1SmrfZGwSJtRiYgVEdERER2tra1jsUszM6PaxNEHTMttTwV2j6JNrT2Dw1npd+9rjNPMzEqoMnFsBNolTZd0BHAR2fTseV3A/PR01Wxg3+Aw1DC6gEvT+qXAXWMZtJmZDa+yxJG+FLiY7LOz24E7ImKbpIWSFqZm3cBOoBe4Bfj4YH9J3wB+AJwqqS/3nfOlwHmSHgPOS9tmZjZOinzIadQiopssOeTLlufWA1g0RN+Lhyh/Cnj3GIZpZmYl+M1xMzMrxYnDzMxKceIwM7NSnDjMzKwUJw4zMyvFicPMzEpx4jAzs1KcOMzMrBQnDjMzK8WJw8zMSql0yhEzs4NR25LvNDqEMbNr6fljvk9fcZiZWSlOHGZmVooTh5mZleLEYWZmpThxmJlZKU4cZmZWSqWJQ9IcSTsk9UpaUqdekm5K9VsknTFSX0nXSXpS0ua0zKvyHMzM7NUqSxySWoCbgbnADOBiSTNqms0F2tOyAFhWsO+NETEzLd2Ymdm4qfKKYxbQGxE7I+JFYA3QWdOmE1gdmQ3AJEmTC/Y1M7MGqDJxTAGeyG33pbIibUbquzgNba2UdGy9g0taIKlHUs/AwMBoz8HMzGpUmThUpywKthmu7zLgFGAm0A/cUO/gEbEiIjoioqO1tbVYxGZmNqIq56rqA6bltqcCuwu2OWKovhGxZ7BQ0i3A3WMXspmZjaTKK46NQLuk6ZKOAC4CumradAHz09NVs4F9EdE/XN90D2TQhcDWCs/BzMxqVHbFERH7JS0G1gItwMqI2CZpYapfDnQD84Be4HngsuH6pl1fL2km2dDVLuDKqs7BzMwOVOm06ulR2e6asuW59QAWFe2byi8Z4zDNzKwEvzluZmalOHGYmVkpThxmZlaKE4eZmZXixGFmZqU4cZiZWSlOHGZmVooTh5mZleLEYWZmpThxmJlZKU4cZmZWihOHmZmV4sRhZmalOHGYmVkpThxmZlaKE4eZmZXixGFmZqU4cZiZWSmVJg5JcyTtkNQraUmdekm6KdVvkXTGSH0lHSdpnaTH0u+xVZ6DmZm9WmWJQ1ILcDMwF5gBXCxpRk2zuUB7WhYAywr0XQKsj4h2YH3aNjOzcVLlFccsoDcidkbEi8AaoLOmTSewOjIbgEmSJo/QtxNYldZXARdUeA5mZlbjdRXuewrwRG67DzizQJspI/Q9MSL6ASKiX9IJ9Q4uaQHZVQzAs5J2jOYkxtHxwM+qPIC+UOXeXxOfe8Um8vlP5HOH13z+J9crrDJxqE5ZFGxTpO+wImIFsKJMn0aS1BMRHY2OoxF87hPz3GFin//BfO5VDlX1AdNy21OB3QXbDNd3TxrOIv3uHcOYzcxsBFUmjo1Au6Tpko4ALgK6atp0AfPT01WzgX1pGGq4vl3ApWn9UuCuCs/BzMxqVDZUFRH7JS0G1gItwMqI2CZpYapfDnQD84Be4HngsuH6pl0vBe6QdAXwU+ADVZ3DODtohtUq4HOfuCby+R+0566IUrcOzMxsgvOb42ZmVooTh5mZleLE0WAjTctyKJO0UtJeSVsbHct4kzRN0j2StkvaJunqRsc0XiQdKemHkh5J5/75Rsc03iS1SHpY0t2NjmU0nDgaqOC0LIey24E5jQ6iQfYD10bEW4HZwKIJ9O/+BeCciHg7MBOYk56qnEiuBrY3OojRcuJorCLTshyyIuJ+4OeNjqMRIqI/Ih5K68+Q/RGZ0tioxkeaYujZtHl4WibMUzqSpgLnA7c2OpbRcuJorKGmXLEJRFIbcDrwYGMjGT9pqGYz2Qu86yJiwpw78GXgz4GXGx3IaDlxNNZrnlrFDm6Sjga+DVwTEb9sdDzjJSJeioiZZLNCzJJ0WqNjGg+S3gPsjYhNjY7ltXDiaKwi07LYIUrS4WRJ4+sRcWej42mEiHgauJeJc6/rbOB9knaRDU2fI+lvGhtSeU4cjVVkWhY7BEkScBuwPSK+1Oh4xpOkVkmT0vpRwLnAjxsb1fiIiE9FxNSIaCP77/2fI+IjDQ6rNCeOBoqI/cDg1CrbgTtyU6sc8iR9A/gBcKqkvjSNzERxNnAJ2f9xbk7LvEYHNU4mA/dI2kL2P0/rIuKgfCx1ovKUI2ZmVoqvOMzMrBQnDjMzK8WJw8zMSnHiMDOzUpw4zMysFCcOm7AkvZR7FHZzmvqj7D4uqGpyQklvkfStkn0+Kumvq4jHbFBln441Owj8W5r24rW4ALgbeLRoB0mvS+/wDCsidgPvfw2xmVXCVxxmOZLeIek+SZskrZU0OZV/TNLG9A2Jb0v6DUnvBN4H/FW6YjlF0r2SOlKf49PUEoNXAn8n6f8B/yjpDel7JBvTdxkOmBVZUtvgt0pS/zslfVfSY5Kuz7W7TNK/SrqP7MXCwfLWFOvGtJydyu+SND+tXynp6xX947RDlK84bCI7Ks3QCvA48EHgq0BnRAxI+hDwP4DLgTsj4hYASf8duCIiviqpC7g7Ir6V6oY73lnA70fEzyX9T7LpJi5P02/8UNI/RcRzw/SfSTaL7gvADklfJfuux+eBdwD7gHuAh1P7rwA3RsQDkk4im6HgrcAC4HuSHgeuJfseiFlhThw2kb1qqCrN0HoasC4lgBagP1WflhLGJOBosj/CZa2LiMHvj/wh2WR3f5a2jwROYviP+6yPiH0p1keBk4HjgXsjYiCVfxP4ndT+XGBGLpm9UdIxEbFH0l+QJZkLczGZFeLEYfYKAdsi4qw6dbcDF0TEI5I+CrxriH3s55Uh4CNr6vJXEwL+JCJ2lIjvhdz6S7zy3+9Q8wYdBpwVEf9Wp+73gKeAt5Q4vhngexxmeTuAVklnQTbtuaS3pbpjgP40Ffqf5vo8k+oG7SIbNoLhb2yvBT6RZslF0umjjPlB4F2S3pxi+0Cu7h/JJtEkHWNm+p1F9rni04E/kzR9lMe2CcqJwyxJn+99P/AFSY8Am4F3purPkv2RXserpwBfA/y3dIP7FOCLwH+R9H2yYaSh/CXZJ1O3pBvgfznKmPuB68hmGf4n4KFc9VVAh6QtaWhroaTXA7cAl6entq4FVkrD35wxy/PsuGZmVoqvOMzMrBQnDjMzK8WJw8zMSnHiMDOzUpw4zMysFCcOMzMrxYnDzMxK+f8rId0I2VzAXwAAAABJRU5ErkJggg==\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"array([ 7, 27, 6, 22, 20, 23, 26, 3, 0, 2, 13, 1, 24, 21, 5, 25, 19,\n",
" 10, 28, 4, 12, 11, 29, 16, 9, 17, 14, 15, 8, 18], dtype=int64)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cc = CancerClassifier(classifier_type)\n",
"feature_idx = cc.feature_importance()\n",
"feature_idx"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Describe how feature importance is calculated"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Which feature is the most important and which is the least important. Use information from either [assignment 8](../08_SVM/README.md) or [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) to name these features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Feature 7,27,6,22,20 are most important and feature 17,14,15,8,18 are least important."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"_plot_oob_error()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. What can be said about the relationship between the OOB error rate and the number of estimators?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As number of estimators increases, the OOB error rate decreases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Do all three types of ensembles follow this correlation?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yes, as overall all three types of ensembles follow this correlation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Upload the result for each metric (confusion matrix, accuracy, precision, recall, cross validation accuracy)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 59 4]\n",
" [ 0 108]]\n",
"0.9766081871345029\n",
"0.9642857142857143\n",
"1.0\n",
"0.9683583959899748\n"
]
}
],
"source": [
"classifier_type = sklearn.ensemble.ExtraTreesClassifier()\n",
"cc = CancerClassifier(classifier_type)\n",
"print(cc.confusion_matrix())\n",
"print(cc.accuracy())\n",
"print(cc.precision())\n",
"print(cc.recall())\n",
"print(cc.cross_validation_accuracy())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confusion Matrix = [[ 59 4]\n",
" [ 0 108]]\n",
" \n",
"Accuracy = 0.9766081871345029\n",
"\n",
"Precision = 0.9642857142857143\n",
"\n",
"Recall = 1.0\n",
"\n",
"Cross Validation Accuracy = 0.9683583959899748"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. What is the most important feature and the least important feature?"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"image/png":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here