Please see attached PDF.ISYE 6740 Homework 7 (Last Homework) Total 100 points. As usual, please...

Question

Please see attached PDF.ISYE 6740 Homework 7 (Last Homework) Total 100 points. As usual, please submit a report with sufficient explanation of your answers to each the questions, together with your code, in a zip folder. 1 Random forrest for email spam classifier (30 points) Your task for this question is to build a spam classifier using the UCR email spma dataset https://archive. ics.uci.edu/ml/datasets/Spambase came from the postmaster and individuals who had filed spam. The collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ’george’ and the area code ’650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. Load the data. 1. (5 points) How many instances of spam versus regular emails are there in the data? How many data points there are? How many features there are? Note: there may be some missing values, you can just fill in zero. 2. (10 points) Build a classification tree model (also known as the CART model). In Python, this can be done using sklearn.tree.DecisionTreeClassifier. In our answer, you should report the tree models fitted similar to what is shown in the “Random forest” lecture, Page 16, the tree plot. In Python, getting this plot can be done using sklearn.tree.plot tree function. 3. (15 points) Also build a random forrest model. In Python, this can be done using sklearn.ensemble.RandomForestClassifier. Now partition the data to use the first 80% for training and the remaining 20% for testing. Your task is to compare and report the AUC for your classification tree and random forest models on testing data, respectively. To report your results, please try different tree sizes. Plot the curve of AUC versus Tree Size, similar to Page 15 of the Lecture Slides on “Random Forest”. Background information: In classification problem, we use AUC (Area Under The Curve) as a per- formance measure. It is one of the most important evaluation metrics for checking any classification model?s performance. ROC (Receiver Operating Characteristics) curve measures classification accu- racy at various thresholds settings. AUC measures the total area under the ROC curve. Higher the AUC, better the model is at distinguishing the two classes. If you want to read a bit more about AUC curve, check out this link https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 For instance, in Python, this can be done using sklearn.metrics.roc auc score and you will have to figure out the details. 2 Nonlinear regression and cross-validation (30 points) The coefficient of thermal expansion y changes with temperature x. An experiment to relate y to x was done. Temperature was measured in degrees Kelvin. (The Kelvin temperature is the Celcius temperature 1 plus 273.15). The raw data file is copper-new.txt. 0 200 400 600 800 1000 temperature 0 5 10 15 20 25 co ef fic ie nt  o f t he rm al  e xp an si on 1. (10 points) Perform linear regression on the data. Report the fitted model and the fitting error. 2. (10 points) Perform nonlinear regression with polynomial regression function up to degree n = 10 and use ridge regression (see Lecture Slides for “Bias-Variance Tradeoff”). Write down your formulation and strategy for doing this, the form of the ridge regression. 3. (5 points) Use 5 fold cross validation to select the optimal regularization parameter λ. Plot the cross validation curve and report the optimal λ. 4. (5 points) Predict the coefficient at 400 degree Kelvin using both models. Comment on how would you compare the accuracy of predictions. 3 Regression, bias-variance tradeoff (40 points) Consider a dataset with n data points (xi, yi), xi ∈ Rp, drawn from the following linear model: y = xTβ∗ + �, where � is a Gaussian noise and the star sign is used to differentiate the true parameter from the estimators that will be introduced later. Consider the regularized linear regression as follows: β̂(λ) = arg min β { 1 n n∑ i=1 (yi − xTi β)2 + λ‖β‖22 } , where λ ≥ 0 is the regularized parameter. Let X ∈ Rn×p denote the matrix obtained by stacking xTi in each row. 1. (10 points) Find the closed form solution for β̂(λ) and its distribution. 2. (10 points) Calculate the bias E[xT β̂(λ)]− xTβ∗ as a function of λ and some fixed test point x. 2 3. (10 points) Calculate the variance term E [( xT β̂(λ)− E[xT β̂(λ)] )2] . 4. (10 points) Use the results from parts (b) and (c) and the bias-variance decomposition to analyze the impact of λ in the squared error. Specifically, which term dominates when λ is small, and large, respectively? (Hint.) Properties of an affine transformation of a Gaussian random variable will be useful throughout this problem. 3

Ximi · Accepted Answer

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 1.1
",
    "df = pd.read_csv('spambase/spambase.data', names=list(range(57)) + ['class'])
",
    "df.head()
",
    "df = df.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total count 4601
",
      "spam count 1813
",
      "regular count 2788
"
     ]
    }
   ],
   "source": [
    "print ("total count", df.shape[0])
",
    "print ("spam count", df[df['class'] == 1].shape[0])
",
    "print ("regular count", df[df['class']==0].shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split, cross_val_score, TimeSeriesSplit
",
    "from sklearn.tree import DecisionTreeClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
",
       "                       max_depth=None, max_features=None, max_leaf_nodes=None,
",
       "                       min_impurity_decrease=0.0, min_impurity_split=None,
",
       "                       min_samples_leaf=1, min_samples_split=2,
",
       "                       min_weight_fraction_leaf=0.0, presort='deprecated',
",
       "                       random_state=None, splitter='best')"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Question 1.2
",
    "X = df.iloc[:, 1:-1]
",
    "y = df.iloc[:, -1]
",
    "clf = DecisionTreeClassifier()
",
    "clf.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import plot_tree
",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png":

ISYE 6740 Homework 7 (Last Homework) Total 100 points. As usual, please submit a report with sufficient explanation of your answers to each the questions, together with your code, in a zip folder. 1...

Answer To: ISYE 6740 Homework 7 (Last Homework) Total 100 points. As usual, please submit a report with...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment