ICT112 Week 4 Lab ICT707 Big Data Assignment Big Data Assignment Marking Criteria The Big Data Assignment is comprised of two parts: · The first part is to create the algorithms in the tasks, namely:...

1 answer below »
python


ICT112 Week 4 Lab ICT707 Big Data Assignment Big Data Assignment Marking Criteria The Big Data Assignment is comprised of two parts: · The first part is to create the algorithms in the tasks, namely: Decision Tree, Gradient Boosted Tree and Linear regression and then to apply them to the bike sharing dataset provided. Try and produce the output given in the task sections (also given in the Big-Data Assignment.docx provided on Blackboard). · The second part is then use those algorithms created in the first part and apply them to another dataset chosen from Kaggle (other than the bike sharing dataset provided). Rubric Datasets bike sharing [provided] Student selected dataset [from Kaggle.com] Decision Tree Decision Tree 5 5 Decision Tree Categorical features 5 5 Decision Tree Log 5 5 Decision Tree Max Bins 5 5 Decision Tree Max Depth 5 5 Gradient Boosted Tree Gradient Boosted Tree 5 5 Gradient boost tree iterations 5 5 Gradient boost tree Max Bins 5 5 Linear regression Linear regression 5 5 Linear regression Cross Validation Intercept 5 5 Iterations 5 5 Step size 5 5 L1 Regularization 5 5 L2 Regularization 5 5 Linear regression Log 5 5 75 75 Total mark 150 What needs to be submitted for marking: For the Decision tree section a .py or .ipynb file for each of the following: · Decision Tree · Decision Tree Categorical features · Decision Tree Log · Decision Tree Max Bins · Decision Tree Max Depth For the Gradient boost tree section a .py or .ipynb file for each of the following: · Gradient boost tree · Gradient boost tree iterations · Gradient boost tree Max Bins For the Linear regression section a .py or .ipynb file for each of the following: · Linear regression · Linear regression Cross Validation · Intercept · Iterations · Step size · L1 Regularization · L2 Regularization · Linear regression Log Each of the files submitted will be tested with the following datasets: · bike sharing [which is provided on blackboard] · A dataset of the students choice downloaded from Kaggle.com [Hint] Write each algorithm so that it can take in a dataset name. For example: raw_data = sc.textFile("/home/spark/data/hour.csv") In this manner both datasets can be run with the same files. Assignment 1. Utilising Python 3 Build the following regression models: · Decision Tree · Gradient Boosted Tree · Linear regression 2. Select a dataset (other than the example dataset given in section 3) and apply the Decision Tree and Linear regression models created above. Choose a dataset from Kaggle https://www.kaggle.com/datasets 3. Build the following in relation to the gradient boost tree and the dataset choosen in step 2 a) Gradient boost tree iterations (see Big-Data Assignment.docx section 6.1) b) Gradient boost tree Max Bins (see Big-Data Assignment.docx section 7.2) 4. Build the following in relation to the decision tree and the dataset choosen in step 2 a) Decision Tree Categorical features b) Decision Tree Log (see Big-Data Assignment.docxsection 5.4) c) Decision Tree Max Bins (see Big-Data Assignment.docx section 7.2) d) Decision Tree Max Depth (see Big-Data Assignment.docx section 7.1) 5. Build the following in relation to the linear regression and the dataset choosen in step 2 a) Linear regression Cross Validation i. Intercept (see Big-Data Assignment.docx section 6.5) ii. Iterations (see Big-Data Assignment.docx section 6.1) iii. Step size (see Big-Data Assignment.docxsection 6.2) iv. L1 Regularization (see Big-Data Assignment.docx section 6.4) v. L2 Regularization (see Big-Data Assignment.docx section 6.3) b) Linear regression Log (see Big-Data Assignment.docx section 5.4) 6. Follow the provided example of the Bike sharing data set and the guide lines in the sections that follow this section to develop the requirements given in steps 1,3,4 and 5 Task 1 Task 1 is comprised of developing: 1. Decision Tree a) Decision Tree Categorical features b) Decision Tree Log (see Big-Data Assignment.docx section 5.4) c) Decision Tree Max Bins (see Big-Data Assignment.docx section 7.2) d) Decision Tree Max Depth (see Big-Data Assignment.docx section 7.1) The Output for this task and all the sub tasks are based on the the Bike sharing data set as input. Utilise the Bike sharing data set as input to test that the Decision Tree task and sub tasks (i.e.step 1 and 4 from the assignment) are working and producing the correct output before apply to your selected data set. Decision Tree Output 1: Feature vector length for categorical features: 57 Feature vector length for numerical features: 4 Total feature vector length: 61 Decision Tree feature vector: [1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,0.24,0.2879,0.81,0.0] Decision Tree feature vector length: 12 Decision Tree predictions: [(16.0, 54.913223140495866), (40.0, 54.913223140495866), (32.0, 53.171052631578945), (13.0, 14.284023668639053), (1.0, 14.284023668639053)] Decision Tree depth: 5 Decision Tree number of nodes: 63 Decision Tree - Mean Squared Error: 11611.4860 Decision Tree - Mean Absolute Error: 71.1502 Decision Tree - Root Mean Squared Log Error: 0.6251 Output 2: Decision Tree feature vector: [1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,0.24,0.2879,0.81,0.0] Decision Tree feature vector length: 12 Decision Tree predictions: [(16.0, 54.913223140495866), (40.0, 54.913223140495866), (32.0, 53.171052631578945), (13.0, 14.284023668639053), (1.0, 14.284023668639053)] Decision Tree depth: 5 Decision Tree number of nodes: 63 Decision Tree - Mean Squared Error: 11611.4860 Decision Tree - Mean Absolute Error: 71.1502 Decision Tree - Root Mean Squared Log Error: 0.6251 Categorial features Output: Mapping of first categorical feature column: {'1': 0, '4': 1, '2': 2, '3': 3} Categorical feature size mapping {0: 5, 1: 3, 2: 13, 3: 25, 4: 3, 5: 8, 6: 3, 7: 5} Decision Tree Categorical Features - Mean Squared Error: 7912.5642 Decision Tree Categorical Features - Mean Absolute Error: 59.4409 Decision Tree Categorical Features - Root Mean Squared Log Error: 0.6192 Decision Tree Log Output: Decision Tree Log - Mean Squared Error: 14781.5760 Decision Tree Log - Mean Absolute Error: 76.4131 Decision Tree Log - Root Mean Squared Log Error: 0.6406 Decision Tree Max Bins Output: Decision Tree Max Depth Output: Task 2 Task 2 is compromised of developing: 1. Gradient boost tree a) Gradient boost tree iterations (see Big-Data Assignment.docx section 6.1) b) Gradient boost tree Max Bins (see Big-Data Assignment.docxsection 7.2) c) Gradient boost tree Max Depth (see Big-Data Assignment.docx section 7.1) Gradient Boosted Tree Output: GradientBoosted Trees predictions: [(16.0, 103.33972087713495), (40.0, 103.33972087713495), (32.0, 103.33972087713495), (13.0, 103.33972087713495), (1.0, 103.33972087713495)] Gradient Boosted Trees - Mean Squared Error = 325939579.98366314 Gradient Boosted Trees - Mean Absolute Error = 1845603.969 Gradient Boosted Trees - Mean Root Mean Squared Log Error = 32155.5757154 Gradient boost tree iterations Output: Gradient boost tree Max Bins Output: Task 3 Task 3 is compromised of developing: 1. Linear regression model a) Linear regression Cross Validation i. Intercept (see Big-Data Assignment.docx section 6.5) ii. Iterations (see Big-Data Assignment.docx section 6.1) iii. Step size (see Big-Data Assignment.docx section 6.2) iv. L1 Regularization (see Big-Data Assignment.docx section 6.4) v. L2 Regularization (see Big-Data Assignment.docx section 6.3) b) Linear regression Log (see Big-Data Assignment.docx section 5.4) Linear regression model Output: Mapping of first categorical feature column: {'1': 0, '4': 1, '2': 2, '3': 3} Feature vector length for categorical features: 57 Feature vector length for numerical features: 4 Total feature vector length: 61 Linear Model feature vector: [1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.24,0.2879,0.81,0.0] Linear Model feature vector length: 61 Gradient Boosted Trees - Mean Root Mean Squared Log Error = 32155.5757154 Output 2: Linear Model predictions: [(16.0, 53.183375554478182), (40.0, 52.572149013454187), (32.0, 52.517786871472346), (13.0, 52.312352839640027), (1.0, 52.285323002218234)] Linear Regression - Mean Squared Error: 46565.6666 Linear Regression - Mean Absolute Error: 148.3472 Linear Regression - Root Mean Squared Log Error: 1.4284 Linear regression Cross Validation Output: Training data size: 13869 Test data size: 3510 Total data size: 17379 Train + Test size : 17379 Intercept Output: Iterations Output: Step size Output: L1 Regularization Output: L2 Regularization Output: Linear regression Log Output:Linear Regression Log - Mean Squared Error: 50685.5559 Linear Regression Log - Mean Absolute Error: 155.2955 Linear Regression Log - Root Mean Squared Log Error: 1.5411 6 ICT707 Big Data aSSignment ICT707 Big Data aSSignment 1 ICT112 Week 4 Lab ICT707 Big Data Assignment Regression Models Regression models are concerned with target variables that can take any real value. The underlying principle is to find a model that maps input features to predicted target variables. Regression is also a form of supervised learning. Regression models can be used to predict just about any variable of interest. A few examples include the following: · Predicting stock returns and other economic variables · Predicting loss amounts for loan defaults (this can be combined with a classification model that predicts the probability of default, while the regression model predicts the amount in the case of a default) · Recommendations (the Alternating Least Squares factorization model from Chapter 5, Building a Recommendation Engine with Spark, uses linear regression in each iteration) · Predicting customer lifetime value (CLTV) in a retail, mobile, or other business, based on user behavior and spending patterns In the different sections of this chapter, we will do the following: Introduce the various types of regression models available in ML · Explore feature extraction and target variable transformation for regression models · Train a number of regression models using ML · Building a Regression Model with Spark · See how to make predictions using the trained model · Investigate the impact on performance of various parameter settings for regression using cross-validation Types of regression models The core idea of linear models (or generalized linear models) is that we model the predicted outcome of interest (often called the target or dependent variable) as a function of a simple linear predictor applied to the input variables (also referred to as
Answered Same DaySep 22, 2020ICT707University of the Sunshine Coast

Answer To: ICT112 Week 4 Lab ICT707 Big Data Assignment Big Data Assignment Marking Criteria The Big Data...

Akash answered on Oct 08 2020
137 Votes
assign2/.DS_Store
__MACOSX/assign2/._.DS_Store
assign2/.ipynb_checkpoints/bike-checkpoint.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = \"/Users/priya/Desktop/Bike-Sharing-Dataset/bike.csv\"\n",
"data_df = sc.textFile(path)\n",
"data_count= data_df.count()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['1', '2011-01-01', '1', '0', '1', '0', '0', '6', '0', '1', '0.24', '0.2879', '0.81', '0', '3', '13', '16']\n"
]
}
],
"source": [
"data_rec = data_df.map(lambda x: x.split(\",\"))\n",
"first = data_rec.first()\n",
"print (first)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"17379\n"
]
}
],
"source": [
"print (data_count)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Now we have 17379 hourly records,we removed column name already by using unix command"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"sed 1d hour.csv > new_hour.csv We will ignore the record ID and raw date columns. \n",
"We will also ignore the casual and registered count target variables and focus on the \n",
"overall count variable, cnt (which is the sum of the other two counts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"from below command we are cache are data to use again"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PythonRDD[6] at RDD at PythonRDD.scala:48"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_rec.cache()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"now extract each catagorical variable into a binary vector form \n",
"Let's define a function that will extract this mapping from our dataset for a given column:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def get_mapping(rdd, idx):\n",
" return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"above function first map the all field to its unique values and uses the zipwithindex \n",
"transformation to performed key value rdd.\n",
"and key is the variable and value is the index\n",
"We can test our function on the third variable column (index 2):\n",
"so i am taking records is rdd and 2 is index of 3rd variable\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"mapping of feature catagorical columns: {'1': 0, '4': 1, '2': 2, '3': 3}\n"
]
}
],
"source": [
"print(\"mapping of feature catagorical columns: %s\" %get_mapping(data_rec,2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"applying above function to each categorical column \n",
"for variable index from 2 to 9"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"mappings = [get_mapping(data_rec, i) for i in range(2,10)]\n",
"catagorical_len = sum(map(len, mappings))\n",
"num_len = len(data_rec.first()[11:15])\n",
"total_length = num_len + catagorical_len"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have to mappings for each variable, \n",
"and we can see how many values in total we need \n",
"for our binary vector representation:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Feature vector length for categorical features: 57\n",
"Feature vector length for numerical features: 4\n",
"Total feature vector length: 61\n"
]
}
],
"source": [
"print (\"Feature vector length for categorical features: %d\" % catagorical_len)\n",
"print (\"Feature vector length for numerical features: %d\" % num_len)\n",
"print (\"Total feature vector length: %d\" % total_length)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"creating feature vector for linear model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"again we calling mapping function to convert catagorical to binary-encoded features\n",
"import numpy for linear algebra utilities and MLlib LabeledPoint class to wrap our feature vectors and target variables"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.mllib.regression import LabeledPoint\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def extract_feat(record):\n",
" catagorical_vector = np.zeros(catagorical_len)\n",
" j = 0\n",
" steps = 0\n",
" for fields in record[2:9]:\n",
" mapp = mappings[j]\n",
" idx = mapp[fields]\n",
" catagorical_vector[idx + steps] = 1\n",
" j = j + 1\n",
" steps = steps + len(mapp)\n",
" number_vector = np.array([float(field) for field in record[10:14]])\n",
" return np.concatenate((catagorical_vector, number_vector))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def ex_label(record):\n",
" return float(record[-1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ex_features function, we cross through each column in the row of data. \n",
"We find the binary encoding for each single variable in every turn \n",
"from the mappings we created previously\n",
"The step variable ensures that the nonzero feature index in the full feature vector is correct\n",
"(and is somewhat more efficient than, say, creating many smaller binary vectors and \n",
" concatenating them). The numeric vector is created directly by first converting the data \n",
"to floating point numbers and wrapping these in a numpy array. The resulting two vectors \n",
"are then concatenated. The extract_label function simply converts the last column variable \n",
"(the count) into a float. With our utility functions defined, we can proceed with extracting \n",
"feature vectors and labels from our data records:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"data = data_rec.map(lambda r: LabeledPoint(ex_label(r), extract_feat(r)))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Raw data: ['1', '0', '1', '0', '0', '6', '0', '1', '0.24', '0.2879', '0.81', '0', '3', '13', '16']\n",
"Label: 16.0\n",
"Linear Model feature vector:\n",
"[1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.24,0.2879,0.81,0.0]\n",
"Linear Model feature vector length: 61\n"
]
}
],
"source": [
"first_point = data.first()\n",
"print (\"Raw data: \" + str(first[2:]))\n",
"print (\"Label: \" + str(first_point.label))\n",
"print (\"Linear Model feature vector:\\n\" + str(first_point.features))\n",
"print (\"Linear Model feature vector length: \" + str(len(first_point.features)))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.mllib.regression import LinearRegressionWithSGD"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
}
],
"source": [
"linear_model = LinearRegressionWithSGD.train(data, iterations=10,step=0.1, intercept=False)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Linear Model predictions: [(16.0, 117.89250386724846), (40.0, 116.2249612319211), (32.0, 116.02369145779235), (13.0, 115.67088016754433), (1.0, 115.56315650834317)]\n"
]
}
],
"source": [
"true_vs_predicted = data.map(lambda p: (p.label, linear_model.predict(p.features)))\n",
"print (\"Linear Model predictions: \" + str(true_vs_predicted.take(5)))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Linear Model - Mean Squared Error: 30679.4539\n"
]
}
],
"source": [
"li=[]\n",
"for i in true_vs_predicted.collect():\n",
" true,pred=i[0],i[1]\n",
" val=(pred - true)**2\n",
" li.append(val)\n",
"lenth=len(li)\n",
"su=sum(li)\n",
"mean=su/lenth\n",
"print (\"Linear Model - Mean Squared Error: %2.4f\" % mean)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"targets = data_rec.map(lambda r: float(r[-1])).collect()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"import pylab"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/anaconda3/lib/python3.6/site-packages/IPython/core/magics/pylab.py:160: UserWarning: pylab import has clobbered these variables: ['mean', 'pylab']\n",
"`%matplotlib` prevents importing * from pylab and numpy\n",
" \"\\n`%matplotlib` prevents importing * from pylab and numpy\"\n"
]
}
],
"source": [
"%pylab inline"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"hist(targets, bins=45, color='lightblue', normed=True)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"\n",
"fig.set_size_inches(17, 11)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "
VDVXXnoodhLl6Q5FySX57ctv/Oqnruoodirg4l+fVFD/EUsXppaot9/lYfBlVVX5Pk/Ul+rLs/v+h5uHzd/WR3vzDJniS3VJVb969yVfWDST7b3Q8tehbm7sXd/d1JXpnkDZOv3XB125nku5P8u+6+Ocn/S+IZLkticlv3bUl+Y9GzPEWsXpr1JNdPbe9JcmZBswDPYPKdxvcn+bXuPrroeZivyW1nH05yYMGjcPlenOS2yfcbjyT5/qr694sdiXno7jOTf342yW9m4+tUXN3Wk6xP3dXyvmzEK8vhlUk+2t3/e9GDPEWsXpoTSfZV1d7J3zwcSnJswTMBm0wexPOuJJ/s7p9d9DzMR1Xtqqqvn7z+O0luTfLHi52Ky9Xdb+7uPd19Yzb+u/rB7v7hBY/FZaqq504ecJfJbaIvT+Kp+1e57v5fSR6tqv2TXS9N4uGFy+O1GegW4GTjUj4z6u7zVXVXkgeS7EhyuLtPLngsLlNV/XqSf5Tk2qpaT/KW7n7XYqfiMr04yT9J8onJ9xuT5N909/EFzsTl253k3ZMnFX5Fkvd2t19zAmP6u0l+c+PvDrMzyXu6+z8udiTm5F8k+bXJhZvTSf7pgudhDqoF1kOPAAAATElEQVTqq7PxG0/+2aJnmeZX1wAAADActwEDAAAwHLEKAADAcMQqAAAAwxGrAAAADEesAgAAMByxCgAAwHDEKgAAAMMRqwAAAAzn/wPfSjq58IpfOgAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"log_targets = data_rec.map(lambda r: np.log(float(r[-1]))).collect()\n",
"\n",
"hist(log_targets, bins=40, color='lightblue', normed=True)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"\n",
"fig.set_size_inches(16, 10)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"data_log = data.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
}
],
"source": [
"model_log = LinearRegressionWithSGD.train(data_log, iterations=10, step=0.1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we have transformed the target variable, the predictions of the model will be on the log scale,\n",
"as will the target values of the transformed dataset. Therefore, in order to use our model and \n",
"evaluate its performance, we must first transform the log data back into the original scale by \n",
"taking the exponent of both the predicted and true values using the numpy exp function.\n",
"We will show you how to do this in the code here:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"true_vs_predicted_log = data_log.map(lambda p: (np.exp(p.label), np.exp(model_log.predict(p.features))))"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"17379\n",
"log - Mean Squared Error: 50685.5559\n",
"log - Mean Absolue Error: 155.2955\n",
"Root Mean Squared Log Error: 1.5411\n"
]
}
],
"source": [
"nn=[]\n",
"ab=[]\n",
"s_log=[]\n",
"for i in true_vs_predicted_log.collect():\n",
" real,predict=i[0],i[1]\n",
" value=(predict - real)**2\n",
" value1=np.abs(predict - real)\n",
" value2=(np.log(predict + 1) - np.log(real + 1))**2\n",
" nn.append(value)\n",
" ab.append(value1)\n",
" s_log.append(value2)\n",
"value_len=len(nn)\n",
"print( value_len)\n",
"ss=sum(nn)\n",
"t=ss/value_len\n",
"ab_sum=sum(ab)\n",
"ab_mean=ab_sum/value_len\n",
"s_log_sum=sum(s_log)\n",
"s_log_mean=np.sqrt(s_log_sum/value_len)\n",
"print (\"log - Mean Squared Error: %2.4f\" % t)\n",
"print(\"log - Mean Absolue Error: %2.4f\" % ab_mean)\n",
"print(\"Root Mean Squared Log Error: %2.4f\" % s_log_mean)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Non log-transformed predictions:\n",
"[(16.0, 117.89250386724846), (40.0, 116.2249612319211), (32.0, 116.02369145779235)]\n",
"Log-transformed predictions:\n",
"[(15.999999999999998, 28.080291845456212), (40.0, 26.959480191001784), (32.0, 26.65472562945802)]\n"
]
}
],
"source": [
"print (\"Non log-transformed predictions:\\n\" + str(true_vs_predicted.take(3)))\n",
"\n",
"print (\"Log-transformed predictions:\\n\" + str(true_vs_predicted_log.take(3)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tuning model parameters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One relatively easy way to do this is by first taking a random sample of, say, 20 percent of our data as our test set. We will then define our training set as the elements of the original RDD that are not in the test set RDD."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Spliting data into training and test data for cross validation"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"train, test = data.randomSplit([0.8, 0.2], seed=12345)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"train_size=train.count()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"test_size=test.count()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training data size: 13834\n"
]
}
],
"source": [
"print (\"Training data size: %d\" % train_size)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test data size: 3545\n"
]
}
],
"source": [
"print (\"Test data size: %d\" % test_size)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train + Test size : 17379\n"
]
}
],
"source": [
"print (\"Train + Test size : %d\" % (train_size + test_size))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can confirm that we now have two distinct datasets that add up to the original dataset in total:\n",
"\n",
"Training data size: 13934\n",
"\n",
"Test data size: 3545\n",
"\n",
"Total data size: 17379\n",
"\n",
"Train + Test size : 17379\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The impact of parameter settings for linear models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have prepared our training and test sets, we are ready to investigate the impact of different parameter settings on model performance. We will first carry out this evaluation for the linear model. We will create a convenience function to evaluate the relevant performance metric by training the model on the training set and evaluating it on the test set for different parameter settings.\n",
"\n",
"We will use the RMSLE evaluation metric, as it is the one used in the Kaggle competition with this dataset, and this allows us to compare our model results against the competition leaderboard to see how we perform.\n",
"\n",
"The evaluation function is defined here:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"def squared_log_error(pred, actual):\n",
" return (np.log(pred + 1) - np.log(actual + 1))**2"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"def evaluate(train, test, iterations, step, regParam, regType, intercept):\n",
"\n",
" model = LinearRegressionWithSGD.train(train, iterations, step, regParam=regParam, regType=regType, intercept=intercept)\n",
"\n",
" tp = test.map(lambda p: (p.label, model.predict(p.features)))\n",
" \n",
" new_val=[]\n",
" for i in tp.collect():\n",
" actual=i[0]\n",
" pred=i[1]\n",
" va=(np.log(pred + 1) - np.log(actual + 1))**2\n",
" new_val.append(va)\n",
" lenth=len(new_val)\n",
" s_new_val=sum(new_val)\n",
" mean_new_val=s_new_val/lenth\n",
" rmsle=np.sqrt(mean_new_val)\n",
" return rmsle"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Iterations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we saw when evaluating our classification models, we generally expect that a model trained with SGD will achieve better performance as the number of iterations increases, although the increase in performance will slow down as the number of iterations goes above some minimum number. Note that here, we will set the step size to 0.01 to better illustrate the impact at higher iteration numbers:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 5, 10, 20, 50, 100]\n",
"[2.9204455616016656, 2.0695085222669265, 1.79815897170536, 1.594156705081269, 1.43308397524522, 1.3878383528812235]\n"
]
}
],
"source": [
"params = [1, 5, 10, 20, 50, 100]\n",
"\n",
"metrics = [evaluate(train, test, param, 0.01, 0.0, 'l2', False) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we will use the matplotlib library to plot a graph of the RMSLE metric against the number of iterations. We will use a log scale for the x axis to make the output easier to visualize:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying number of iterations')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will perform a similar analysis for step size in the following code:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"params = [0.01, 0.025, 0.05, 0.1, 1.0]"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n",
"/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in log\n",
" # This is added back by InteractiveShellApp.init_path()\n"
]
}
],
"source": [
"metrics = [evaluate(train, test, 10, param, 0.0, 'l2', False) for param in params]"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.01, 0.025, 0.05, 0.1, 1.0]\n",
"[1.79815897170536, 1.432660677663247, 1.3921046531899715, 1.463373357714063, nan]\n"
]
}
],
"source": [
"print (params)\n",
"print (metrics)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"why we avoided using the default step size when training the linear model originally. The default is set to 1.0, which, in this case, results in a nan output for the RMSLE metric. This typically means that the SGD model has converged to a very poor local minimum in the error function that it is optimizing. This can happen when the step size is relatively large, as it is easier for the optimization algorithm to overshoot good solutions.\n",
"\n",
"We can also see that for low step sizes and a relatively low number of iterations (we used 10 here), the model performance is slightly poorer. However, in the preceding Iterations section, we saw that for the lower step-size setting, a higher number of iterations will generally converge to a better solution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting the best parameter settings can be an intensive process that involves training a model on many combinations of parameter settings and selecting the best outcome. Each instance of model training involves a number of iterations, so this process can be very expensive and time consuming when performed on very large datasets.\n",
"\n",
"The output is plotted here, again using a log scale for the step-size axis:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying values of step size')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# L2 regularization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"egularization has the effect of penalizing model complexity in the form of an additional loss term that is a function of the model weight vector. L2 regularization penalizes the L2-norm of the weight vector, while L1 regularization penalizes the L1-norm."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.0, 0.01, 0.1, 1.0, 5.0, 10.0, 20.0]\n",
"[1.463373357714063, 1.4627638795194882, 1.457389998406437, 1.414347928269498, 1.4006915016046428, 1.5458042588519074, 1.8520326400407603]\n"
]
}
],
"source": [
"params = [0.0, 0.01, 0.1, 1.0, 5.0, 10.0, 20.0]\n",
"\n",
"metrics = [evaluate(train, test, 10, 0.1, param, 'l2', False) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying levels of L2 regularization')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# L1 regularization"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.0, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]\n",
"[1.463373357714063, 1.4633409680931317, 1.4630506454349392, 1.4603658739928238, 1.4355688529629576, 1.7677660966171576, 4.800777158151935]\n"
]
}
],
"source": [
"params = [0.0, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]\n",
"\n",
"metrics = [evaluate(train, test, 10, 0.1, param, 'l1', False) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying levels of L1 regularization')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using L1 regularization can encourage sparse weight vectors. Does this hold true in this case? We can find out by examining the number of entries in the weight vector that are zero, with increasing levels of regularization:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"L1 (1.0) number of zero weights: 4\n",
"L1 (10.0) number of zeros weights: 33\n",
"L1 (100.0) number of zeros weights: 58\n"
]
}
],
"source": [
"model_l1 = LinearRegressionWithSGD.train(train, 10, 0.1, regParam=1.0, regType='l1', intercept=False)\n",
"\n",
"model_l1_10 = LinearRegressionWithSGD.train(train, 10, 0.1, regParam=10.0, regType='l1', intercept=False)\n",
"\n",
"model_l1_100 = LinearRegressionWithSGD.train(train, 10, 0.1, regParam=100.0, regType='l1', intercept=False)\n",
"\n",
"print (\"L1 (1.0) number of zero weights: \" + str(sum(model_l1.weights.array == 0)))\n",
"\n",
"print (\"L1 (10.0) number of zeros weights: \" + str(sum(model_l1_10.weights.array == 0)))\n",
"\n",
"print (\"L1 (100.0) number of zeros weights: \" + str(sum(model_l1_100.weights.array == 0)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Intercept"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final parameter option for the linear model is whether to use an intercept or not. An intercept is a constant term that is added to the weight vector and effectively accounts for the mean value of the target variable. If the data is already centered or normalized, an intercept is not necessary; however, it often does not hurt to use one in any case."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[False, True]\n",
"[1.414347928269498, 1.4431958566566532]\n"
]
}
],
"source": [
"params = [False, True]\n",
"\n",
"metrics = [evaluate(train, test, 10, 0.1, 1.0, 'l2', param) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bar(params, metrics, color='lightblue')\n",
"pyplot.xlabel('Metrics without and with an intercept')\n",
"fig = matplotlib.pyplot.gcf()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Decision Tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we have seen, decision tree models typically work on raw features (that is, it is not required to convert categorical features into a binary vector encoding; they can, instead, be used directly). Therefore, we will create a separate function to extract the decision tree feature vector, which simply converts all the values to floats and wraps them in a numpy array:"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"def extract_features_dt(record):\n",
" return np.array(list(map(float, record[2:14])))"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"def extract_label(record):\n",
" return float(record[-1])"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"data_dt = data_rec.map(lambda r: LabeledPoint(extract_label(r),extract_features_dt(r)))\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Decision Tree feature vector: [1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,0.24,0.2879,0.81,0.0]\n",
"Decision Tree feature vector length: 12\n"
]
}
],
"source": [
"first_point_dt = data_dt.first()\n",
"print (\"Decision Tree feature vector: \" + str(first_point_dt.features))\n",
"print (\"Decision Tree feature vector length: \" + str(len(first_point_dt.features)))"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.mllib.tree import DecisionTree"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Decision Tree predictions: [(16.0, 54.913223140495866), (40.0, 54.913223140495866), (32.0, 53.171052631578945), (13.0, 14.284023668639053), (1.0, 14.284023668639053)]\n",
"Decision Tree depth: 5\n",
"Decision Tree number of nodes: 63\n"
]
}
],
"source": [
"dt_model = DecisionTree.trainRegressor(data_dt,{})\n",
"preds = dt_model.predict(data_dt.map(lambda p: p.features))\n",
"actual = data.map(lambda p: p.label)\n",
"true_vs_predicted_dt = actual.zip(preds)\n",
"print (\"Decision Tree predictions: \" + str(true_vs_predicted_dt.take(5)))\n",
"print (\"Decision Tree depth: \" + str(dt_model.depth()))\n",
"print (\"Decision Tree number of nodes: \" + str(dt_model.numNodes()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use the same approach for the decision tree model, using the true_vs_predicted_dt RDD:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"17379\n",
"log - Mean Squared Error: 11611.4860\n",
"log - Mean Absolue Error: 71.1502\n",
"Root Mean Squared Log Error: 0.6251\n"
]
}
],
"source": [
"nn=[]\n",
"ab=[]\n",
"s_log=[]\n",
"for i in true_vs_predicted_dt.collect():\n",
" real,predict=i[0],i[1]\n",
" value=(predict - real)**2\n",
" value1=np.abs(predict - real)\n",
" value2=(np.log(predict + 1) - np.log(real + 1))**2\n",
" nn.append(value)\n",
" ab.append(value1)\n",
" s_log.append(value2)\n",
"value_len=len(nn)\n",
"print( value_len)\n",
"ss=sum(nn)\n",
"t=ss/value_len\n",
"ab_sum=sum(ab)\n",
"ab_mean=ab_sum/value_len\n",
"s_log_sum=sum(s_log)\n",
"s_log_mean=np.sqrt(s_log_sum/value_len)\n",
"print (\"log - Mean Squared Error: %2.4f\" % t)\n",
"print(\"log - Mean Absolue Error: %2.4f\" % ab_mean)\n",
"print(\"Root Mean Squared Log Error: %2.4f\" % s_log_mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Impact of training on log-transformed targets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will perform the same analysis for the decision tree model:"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"data_dt_log = data_dt.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))\n",
"\n",
"dt_model_log = DecisionTree.trainRegressor(data_dt_log,{})\n",
"\n",
"preds_log = dt_model_log.predict(data_dt_log.map(lambda p: p.features))\n",
"\n",
"actual_log = data_dt_log.map(lambda p: p.label)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"new=actual_log.zip(preds_log)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(2.772588722239781, 3.6251613906330347),\n",
" (3.6888794541139363, 3.6251613906330347),\n",
" (3.4657359027997265, 1.985090627799027),\n",
" (2.5649493574615367, 1.985090627799027),\n",
" (0.0, 1.985090627799027)]"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new.take(5)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"true_vs_predicted_dt_log=[]\n",
"for val in new.collect():\n",
" t,p=val[0],val[1]\n",
" x=np.exp(t),np.exp(p)\n",
" true_vs_predicted_dt_log.append(x)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"17379\n",
"log - Mean Squared Error: 14781.5760\n",
"log - Mean Absolue Error: 76.4131\n",
"Root Mean Squared Log Error: 0.6406\n",
"Non log-transformed predictions:\n",
"[(16.0, 54.913223140495866), (40.0, 54.913223140495866), (32.0, 53.171052631578945)]\n"
]
}
],
"source": [
"nn=[]\n",
"ab=[]\n",
"s_log=[]\n",
"for i in true_vs_predicted_dt_log:\n",
" real,predict=i[0],i[1]\n",
" value=(predict - real)**2\n",
" value1=np.abs(predict - real)\n",
" value2=(np.log(predict + 1) - np.log(real + 1))**2\n",
" nn.append(value)\n",
" ab.append(value1)\n",
" s_log.append(value2)\n",
"value_len=len(nn)\n",
"print( value_len)\n",
"ss=sum(nn)\n",
"t=ss/value_len\n",
"ab_sum=sum(ab)\n",
"ab_mean=ab_sum/value_len\n",
"s_log_sum=sum(s_log)\n",
"s_log_mean=np.sqrt(s_log_sum/value_len)\n",
"print (\"log - Mean Squared Error: %2.4f\" % t)\n",
"print(\"log - Mean Absolue Error: %2.4f\" % ab_mean)\n",
"print(\"Root Mean Squared Log Error: %2.4f\" % s_log_mean)\n",
"print (\"Non log-transformed predictions:\\n\" + str(true_vs_predicted_dt.take(3)))\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CROSS VALIDATION for the decision tree"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [],
"source": [
"train_dt, test_dt = data_dt.randomSplit([0.8, 0.2], seed=12345)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The impact of parameter settings for the decision tree"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [],
"source": [
"def evaluate_dt(train, test, maxDepth, maxBins):\n",
"\n",
" model = DecisionTree.trainRegressor(train, {}, impurity='variance', maxDepth=maxDepth, maxBins=maxBins)\n",
"\n",
" preds = model.predict(test.map(lambda p: p.features))\n",
"\n",
" actual = test.map(lambda p: p.label)\n",
"\n",
" tp = actual.zip(preds)\n",
" new_val=[]\n",
" for i in tp.collect():\n",
" actual=i[0]\n",
" pred=i[1]\n",
" va=(np.log(pred + 1) - np.log(actual + 1))**2\n",
" new_val.append(va)\n",
" lenth=len(new_val)\n",
" s_new_val=sum(new_val)\n",
" mean_new_val=s_new_val/lenth\n",
" rmsle=np.sqrt(mean_new_val)\n",
" return rmsle\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tree depth"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We would generally expect performance to increase with more complex trees (that is, trees of greater depth). Having a lower tree depth acts as a form of regularization, and it might be the case that as with L2 or L1 regularization in linear models, there is a tree depth that is optimal with respect to the test set performance.\n",
"\n",
"Here, we will try to increase the depths of trees to see what impact they have on test set RMSLE, keeping the number of bins at the default level of 32:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 2, 3, 4, 5, 10, 20]\n",
"[1.0009455704281573, 0.9071380409401831, 0.8083991513814845, 0.7316093046671605, 0.6252775817287765, 0.43025139584509925, 0.4467589576168234]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"params = [1, 2, 3, 4, 5, 10, 20]\n",
"\n",
"metrics = [evaluate_dt(train_dt, test_dt, param, 32) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)\n",
"\n",
"plot(params, metrics)\n",
"pyplot.xlabel('Metrics for different tree depths')\n",
"fig = matplotlib.pyplot.gcf()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Maximum bins"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we will perform our evaluation on the impact of setting the number of bins for the decision tree. As with the tree depth, a larger number of bins should allow the model to become more complex and might help performance with larger feature dimensions. After a certain point, it is unlikely that it will help any more and might, in fact, hinder performance on the test set due to over-fitting:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2, 4, 8, 16, 32, 64, 100]\n",
"[1.2692079792473667, 0.8059355903824542, 0.7446332199349833, 0.5969914946964172, 0.6252775817287765, 0.6252775817287765, 0.6252775817287765]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEKCAYAAADpfBXhAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAHyFJREFUeJzt3Xt8VPWd//HXJ5kk5IICJqASENAgxQuoqWvrjdqtgq1i1aps+9Pd2tLuT9uuv1qLu7+qxd31svbX1q2Xsq61+utqrdUWXZVa62W3amuQi6CCgBcCClFABcI1n/3jnCGTYSYzJDNMzsn7+XjkwcyZ75z5nBzyzjffc873mLsjIiLxUlbqAkREpPAU7iIiMaRwFxGJIYW7iEgMKdxFRGJI4S4iEkMKdxGRGMoZ7mZ2p5mtNbNFWV6famYLzWy+mbWY2QmFL1NERPaE5bqIycxOAjYCd7v74RlerwM2ubub2ZHA/e4+rijViohIXhK5Grj7s2Y2qpvXN6Y8rQXyuuS1vr7eR43KuloREclg7ty577l7Q652OcM9H2b2eeA6YCjw2W7aTQemA4wcOZKWlpZCfLyISL9hZm/l064gB1Td/aFwKOYs4Npu2s1y92Z3b25oyPmLR0REeqigZ8u4+7PAwWZWX8j1iojInul1uJvZIWZm4eOjgUrg/d6uV0REei7nmLuZ3QtMAurNrBW4GqgAcPfbgXOAC81sO9AOnO+aR1hEpKTyOVtmWo7XbwBuKFhFIiLSa7pCVUQkhhTuIiIxFLlwX/LuR9w0ZwnrNm0rdSkiIn1W5ML9jfc28pOnlrHmwy2lLkVEpM+KXLhXVwbHgDdv21niSkRE+q7IhXttZTkAm7ftKHElIiJ9V+TCvSbsuW/aqp67iEg2EQz3oOfevl09dxGRbKIX7lVBuKvnLiKSXeTCvXbXAVX13EVEsolcuFdXJA+oqucuIpJN5MK9rMyorihXuIuIdCNy4Q5QW1XOpq0alhERySaS4V5dWU67eu4iIllFMtxrKxNs0gFVEZGsIhnuNZUacxcR6U5Ewz2hcBcR6UZEw10HVEVEuhPJcK+tUs9dRKQ7kQz36spyXaEqItKNnOFuZnea2VozW5Tl9S+a2cLw6zkzm1D4Mruq1QFVEZFu5dNzvwuY3M3rbwAnu/uRwLXArALU1a3kAdWODi/2R4mIRFLOcHf3Z4F13bz+nLuvD5++ADQWqLasOqf9Ve9dRCSTQo+5Xww8VuB17qamSrfaExHpTqJQKzKzTxGE+wndtJkOTAcYOXJkjz+r6632qnq8HhGRuCpIz93MjgTuAKa6+/vZ2rn7LHdvdvfmhoaGHn9eclhGN+wQEcms1+FuZiOBB4H/5e5Le19Sbsn7qOpWeyIimeUcljGze4FJQL2ZtQJXAxUA7n47cBWwH3CrmQHscPfmYhUMwZS/oJ67iEg2OcPd3afleP0rwFcKVlEeqit0qz0Rke5E8grVZM9dZ8uIiGQWyXBPjrlvUriLiGQU0XAPe+6aGVJEJKNIhnt1hYZlRES6E8lwLyuz8G5M6rmLiGQSyXCH8IYd6rmLiGQU4XBP0K5wFxHJKMLhrlvtiYhkE+lw1wFVEZHMIhvuwX1U1XMXEckksuGunruISHYRDvcEm9RzFxHJKMLhXq6zZUREsohsuNdWJTTlr4hIFpEN9+qKctq376Sjw0tdiohInxPZcE9O+9u+Xb13EZF0kQ33zml/dVBVRCRdhMM9Oe2veu4iIukiHO7JW+0p3EVE0kU23DtvtadhGRGRdDnD3czuNLO1ZrYoy+vjzOx5M9tqZpcXvsTMksMymvZXRGR3+fTc7wImd/P6OuCbwE2FKChfyWGZdvXcRUR2kzPc3f1ZggDP9vpad38R2F7IwnKpTZ4towOqIiK7ieyYe3WlxtxFRLLZq+FuZtPNrMXMWtra2nq1rs4Dquq5i4ik26vh7u6z3L3Z3ZsbGhp6ta4BiXLMdEBVRCSTyA7LlJUZ1RXlbNat9kREdpPI1cDM7gUmAfVm1gpcDVQAuPvtZrY/0ALsA3SY2d8B4939w6JVHaqpTLBZc8uIiOwmZ7i7+7Qcr78LNBasoj1QW6Weu4hIJpEdloFg2l+NuYuI7C7S4V5bldDdmEREMoh0uNdUlmvKXxGRDCIf7pryV0Rkd5EO99rKBJu3q+cuIpIu0uFeU6Weu4hIJtEO98qExtxFRDKIeLiXs2V7Bzs7vNSliIj0KZEO9+S0v+26SlVEpItIh/uuaX91laqISBeRDndN+ysiklmkwz15qz0dVBUR6Sri4a6eu4hIJhEP96DnrnAXEekq0uG+a8xdB1RFRLqIdLjXVCTH3NVzFxFJFe1wD3vu7TqgKiLSRaTDvbZSPXcRkUwiHe4DKsow05i7iEi6SIe7mVFTUa6zZURE0kQ63AFqqhIalhERSZMz3M3sTjNba2aLsrxuZnazmS0zs4VmdnThy8yuprKczTqgKiLSRT4997uAyd28PgVoCr+mA7f1vqz81VQmNCwjIpImZ7i7+7PAum6aTAXu9sALwCAzO6BQBeZSq567iMhuCjHmPhxYmfK8NVy2GzObbmYtZtbS1tZWgI8Opv3dpFvtiYh0UYhwtwzLMt4ayd1nuXuzuzc3NDQU4KODc93bNSwjItJFIcK9FRiR8rwRWF2A9ealpqpcU/6KiKQpRLjPBi4Mz5o5DvjA3d8pwHrzEpwto567iEiqRK4GZnYvMAmoN7NW4GqgAsDdbwceBU4HlgGbgb8pVrGZ1FYmdEBVRCRNznB392k5XnfgkoJVtIdqKhNs2d7Bzg6nvCzT8L+ISP8T/StUd92NSb13EZGk6If7rml/Ne4uIpIU+XDXtL8iIruLfLhXh8MymzTtr4jILpEP92TPvX27eu4iIkmRD/fkmLt67iIinaIf7rvOllHPXUQkKfLhnhyWUbiLiHSKfLjrPHcRkd3FINzDUyE17a+IyC6RD/cBFWWYQbt67iIiu0Q+3M2M2krdJFtEJFXkwx2CC5k05i4i0ikW4V6rOd1FRLqIRbjXVCZ0QFVEJEVMwl3DMiIiqeIR7lUJDcuIiKSIRbjXqucuItJFLMK9urJcY+4iIiliEe61lQlN+SsikiKvcDezyWa2xMyWmdmMDK8fZGZPmtlCM3vazBoLX2p2NVXlmvJXRCRFznA3s3LgFmAKMB6YZmbj05rdBNzt7kcCM4HrCl1od2oqEmzd0cHODt+bHysi0mfl03M/Fljm7ivcfRtwHzA1rc144Mnw8VMZXi+q2irNDCkikiqfcB8OrEx53houS7UAOCd8/HlgoJnt1/vy8lOjOd1FRLrIJ9wtw7L08Y/LgZPNbB5wMrAK2K0bbWbTzazFzFra2tr2uNhsanSTbBGRLvIJ91ZgRMrzRmB1agN3X+3uZ7v7UcA/hMs+SF+Ru89y92Z3b25oaOhF2V3pVnsiIl3lE+4vAk1mNtrMKoELgNmpDcys3syS67oSuLOwZXavtkrDMiIiqXKGu7vvAC4F5gCvAve7+2Izm2lmZ4bNJgFLzGwpMAz4pyLVm1F1clhGB1RFRABI5NPI3R8FHk1bdlXK4weABwpbWv6SN8luV89dRASIyRWqOqAqItJVrMJdY+4iIoFYhLsOqIqIdBWLcK9KlFFmukJVRCQpFuFuZrrVnohIiliEOwTj7u3b1XMXEYEYhXttlXruIiJJsQn36grdak9EJCk24V5bVa6zZUREQrEJ95rKBJsU7iIiQKzCvZzNukJVRASIVbgnNCwjIhKKTbgHY+7quYuIQIzCvbqyXGPuIiKh2IR7bWWCbTs62LGzo9SliIiUXGzCfdfMkNvVexcRiVG4hzND6ipVEZH4hHttlW61JyKSFJtwr9Gt9kREdolRuOtWeyIiSbELd13IJCKSZ7ib2WQzW2Jmy8xsRobXR5rZU2Y2z8wWmtnphS+1e7rVnohIp5zhbmblwC3AFGA8MM3Mxqc1+7/A/e5+FHABcGuhC81lcE0lAH9+4/29/dEiIn1OPj33Y4Fl7r7C3bcB9wFT09o4sE/4eF9gdeFKzE/DwCou/MRB/Pz5t3hu+Xt7++NFRPqUfMJ9OLAy5XlruCzVNcCXzKwVeBT4RqYVmdl0M2sxs5a2trYelNu9K6d8jDH1tXznVwv5cMv2gq9fRCQq8gl3y7DM055PA+5y90bgdOAeM9tt3e4+y92b3b25oaFhz6vNobqynB+cN4F3P9zCzIdfKfj6RUSiIp9wbwVGpDxvZPdhl4uB+wHc/XlgAFBfiAL31FEjB3PJpIN5YG4rcxa/W4oSRERKLp9wfxFoMrPRZlZJcMB0dlqbt4FPA5jZxwjCvfDjLnm69JQmDh++D3//4Mu8t3FrqcoQESmZnOHu7juAS4E5wKsEZ8UsNrOZZnZm2OzbwFfNbAFwL/DX7p4+dLPXVCbK+OF5E/lo6w5m/PplSliKiEhJJPJp5O6PEhwoTV12VcrjV4DjC1ta7zQNG8gVpx3KP/7nq/xqbivnNY/I/SYRkZiIzRWqmXz5+NEcN2YIMx9+hZXrNpe6HBGRvSbW4V5WZtz0hQkAfPtXC+jo0PCMiPQPsQ53gMbBNVx9xnj+/MY67vzjG6UuR0Rkr4h9uAOce0wjnxk/jBvnLGHpmo9KXY6ISNH1i3A3M647+wgGViW47Jfz2bZD91kVkXjrF+EOUF9XxXVnH8Hi1R9y85Ovl7ocEZGi6jfhDnDqYfvzhWMaufXpZbz09vpSlyMiUjT9KtwBrjpjPAfsW83/+eV8Nut+qyISU/0u3AcOqOAH503grXWbue7R10pdjohIUfS7cAc4bsx+XHz8aO554S2eWVqyKXBERIqmX4Y7wOWnHUrT0Dqu/PVCtmzXrflEJF76bbgPqCjn2rMOZ/UHW5j17IpSlyMiUlD9NtwhGJ6Zcvj+3Pb0ct79YEupyxERKZh+He4Af3/6x9jZ4dz4uA6uikh89PtwHzGkhotPHM2D81Yxf+WGUpcjIlIQ/T7cAS751CHU11Ux8+HFurGHiMSCwh2oq0pwxWmH8tLbG5i9IP32sCIi0aNwD517TCOHHbgP1z/2Gu3bdGqkiESbwj1UVmZcfcZhvKNTI0UkBhTuKY4dPYTPHnEAtz+znHc+aC91OSIiPZZXuJvZZDNbYmbLzGxGhtd/aGbzw6+lZhbZ005mTBnHTndufHxJqUsREemxnOFuZuXALcAUYDwwzczGp7Zx98vcfaK7TwT+FXiwGMXuDSOG1PDVE0fz0LxVmhZYRCIrn577scAyd1/h7tuA+4Cp3bSfBtxbiOJK5W8nHULDwCpmPvyKTo0UkUjKJ9yHAytTnreGy3ZjZgcBo4E/9L600kmeGjl/5QZ+O1+nRopI9OQT7pZhWbbu7AXAA+6e8VxCM5tuZi1m1tLW1ren2j3n6EaOGL4v1z/2mm7qISKRk0+4twIjUp43Atm6sxfQzZCMu89y92Z3b25oaMi/yhIoKzOuOmM87364hZ8+o1MjRSRa8gn3F4EmMxttZpUEAT47vZGZHQoMBp4vbIml8/FRQ/jskQfw02eXs3qDTo0UkejIGe7uvgO4FJgDvArc7+6LzWymmZ2Z0nQacJ/H7AjklVPG0eFwg2aNFJEISeTTyN0fBR5NW3ZV2vNrCldW39E4uIbpJ47hJ08t48JPjOKYgwaXuiQRkZx0hWoe/nbSwQwdWMXMR16hoyNWf5iISEwp3PNQW5XgisnjWLByA79dsKrU5YiI5KRwz9PZRw3nyMZ9ueGxJTo1UkT6PIV7nsrKjKs+F5waebtOjRSRPk7hvgeaRw3hjAkH8tNnlrNKp0aKSB+mcN9DM6aMA+CGx3RqpIj0XQr3PTR8UDVfO2kMsxesZu5b60pdjohIRgr3HvjayQczbJ9g1kidGikifZHCvQdqqxJ8d/I4FrR+wEPzdGqkiPQ9CvceOmvicCaMGMSNc15j01adGikifYvCvYeSp0au+XArtz+zvNTliIh0oXDvhWMOGsyZEw5k1rMraF2/udTliIjsonDvpRlTxmEG1+vUSBHpQxTuvXTgoGqmn3Qwjyx8h5Y3dWqkiPQNCvcC+PrJY9h/nwF8X6dGikgfkdd87tK9msoE351yKJf9cgEPzlvFucc0lrqkgtrZ4aze0M62nR2lLkUkFgZVV7BfXVVRP0PhXiBTJwzn58+9xY2Pv8aUw/entip639qdHc7b6zazdM1HvL7mI5au2cjSNR+x4r1NbNuhYBcplK+ffPCuqUyKJXoJ1Eclb6h99q3PcdvTy7n8tENLXVJWOzuclckQXxsE+NI1G1netrFLiA8fVM3YYXWcPLaBgxvqqKrQKJ5IIRwytK7on6FwL6CjRw7mrIkHMuu/VnD+x0cwYkhNSevp6HBWrt+8qwf+ekqIb00L8aZhdZzYVM8hQ+sYO2wghwytoy6Cf32ISEA/vQV2xeRxPL74Xa5//DVu+auj98pndnQ4revbgx742o94PQzz5W0b2bK9M8QP3HcATcMG8smD92PssIE0DaujadhAhbhIDOX1U21mk4EfA+XAHe5+fYY25wHXAA4scPe/KmCdkXHgoGq+fvLB/Oj3r3PRJ9Zx7OghBVt3R4ezakP7rmGU18NhlWVrN9K+feeudgfsO4BDhtZx3JiDGBsGeNPQOgYOqChYLSLSt5l796fumVk5sBT4DNAKvAhMc/dXUto0AfcDp7j7ejMb6u5ru1tvc3Ozt7S09Lb+Pql9205O+cHT7FdXyexLTqCszPbo/ckQf31t50HNZWGIb97WGeL77zMg6H0PHdgZ4sPq2EchLhJbZjbX3Ztztcun534ssMzdV4Qrvg+YCryS0uarwC3uvh4gV7DHXXVlOTOmjONb983ngZdaOa95RMZ27mGIJ8fE13b2xlNDfOjAKsYOG8j5Hx/B2GFBkB8ydCD7VivERSSzfMJ9OLAy5Xkr8BdpbcYCmNkfCYZurnH3xwtSYUSdOeFA7nruTf5lzhKmHL4/H27Zseug5utrNrJ07UaWrfmITSkh3jCwirHD6jivuTPEm4YOZN8ahbiI7Jl8wj3TmEL6WE4CaAImAY3Af5nZ4e6+ocuKzKYD0wFGjhy5x8VGiVkwa+Tnb32Oo699gu07O79l9XVBiH+heUSXYZVBNZUlrFhE4iSfcG8FUscVGoHVGdq84O7bgTfMbAlB2L+Y2sjdZwGzIBhz72nRUXHUyMF873Pjeev9TTQNG8jY8DTDwbUKcREprnzC/UWgycxGA6uAC4D0M2F+A0wD7jKzeoJhmhWFLDSqLj5hdKlLEJF+KOclh+6+A7gUmAO8Ctzv7ovNbKaZnRk2mwO8b2avAE8B33H394tVtIiIdC/nqZDFEudTIUVEiiXfUyE1WYiISAwp3EVEYkjhLiISQwp3EZEYUriLiMSQwl1EJIZKdiqkmbUBb+VoVg+8txfK6Wu03f1Pf912bfeeO8jdG3I1Klm458PMWvI5nzNutN39T3/ddm138WhYRkQkhhTuIiIx1NfDfVapCygRbXf/01+3XdtdJH16zF1ERHqmr/fcRUSkB/psuJvZZDNbYmbLzGxGqespFjMbYWZPmdmrZrbYzL4VLh9iZk+Y2evhv4NLXWsxmFm5mc0zs0fC56PN7E/hdv/SzGJ3ZxMzG2RmD5jZa+F+/0R/2N9mdln4f3yRmd1rZgPiuL/N7E4zW2tmi1KWZdy/Frg5zLmFZnZ0oerok+FuZuXALcAUYDwwzczGl7aqotkBfNvdPwYcB1wSbusM4El3bwKeDJ/H0bcI7hOQdAPww3C71wMXl6Sq4vox8Li7jwMmEGx/rPe3mQ0Hvgk0u/vhBPdavoB47u+7gMlpy7Lt3ykEd61rIrgF6W2FKqJPhjtwLLDM3Ve4+zbgPmBqiWsqCnd/x91fCh9/RPCDPpxge38eNvs5cFZpKiweM2sEPgvcET434BTggbBJ7LbbzPYBTgL+HcDdt4X3Go79/ia481u1mSWAGuAdYri/3f1ZYF3a4mz7dypwtwdeAAaZ2QGFqKOvhvtwYGXK89ZwWayZ2SjgKOBPwDB3fweCXwDA0NJVVjQ/Aq4AOsLn+wEbwrt/QTz3+xigDfhZOBx1h5nVEvP97e6rgJuAtwlC/QNgLvHf30nZ9m/Rsq6vhrtlWBbr03rMrA74NfB37v5hqespNjP7HLDW3eemLs7QNG77PQEcDdzm7kcBm4jZEEwm4RjzVGA0cCBQSzAkkS5u+zuXov2f76vh3gqMSHneCKwuUS1FZ2YVBMH+C3d/MFy8JvnnWfjv2lLVVyTHA2ea2ZsEw26nEPTkB4V/tkM893sr0OrufwqfP0AQ9nHf338JvOHube6+HXgQ+CTx399J2fZv0bKur4b7i0BTeCS9kuDAy+wS11QU4TjzvwOvuvv/S3lpNnBR+Pgi4Ld7u7Zicvcr3b3R3UcR7N8/uPsXCW6wfm7YLI7b/S6w0swODRd9GniFmO9vguGY48ysJvw/n9zuWO/vFNn272zgwvCsmeOAD5LDN73m7n3yCzgdWAosB/6h1PUUcTtPIPgzbCEwP/w6nWD8+Ung9fDfIaWutYjfg0nAI+HjMcCfgWXAr4CqUtdXhO2dCLSE+/w3wOD+sL+B7wOvAYuAe4CqOO5v4F6C4wrbCXrmF2fbvwTDMreEOfcywdlEBalDV6iKiMRQXx2WERGRXlC4i4jEkMJdRCSGFO4iIjGkcBcRiSGFe0SZmZvZPSnPE2bWlpxdsZv3TTSz07t5vdnMbu5lbQ3hTH/zzOzE3qwrXN+o5Ax7qfWZWZWZ/d7M5pvZ+WZ2Yjjr4Hwzq+7t53ZTzyQz+2Sx1p/lM+8o1eR54fZm/H9lZo+a2aC9XZPklsjdRPqoTcDhZlbt7u3AZ4BVebxvItAMPJr+gpkl3L2F4Bzs3vg08Jq7X5SzZednl7v7zlzt0uo7Cqhw94nhOm4HbnL3n+X5mUZww5qOnI27mgRsBJ7bw/f1mLt/ZW991p5w96wdBSmxUp/wr68eXyixEfhn4Nzw+d3Ad+m8GKgWuJPgat95BPN6VBJcKdhGcLHU+cA1BLf8+h3wH3S9oKgO+BnBxRULgXMIpmq9i+BClJeBy9Lqmpj2GdXAtLDtIuCGtG2YSTBR2glp6zkGWAA8D/wLsChcPgl4hGDipWUEE1DNB75GMBPfGwTTOAB8J9z+hcD3w2WjCGbevDX8vhwEnBp+zksEF9LUhW3fJLjw5qWw/nHh+98l+EU6Hzgxre5rCGb9+134/rOBG8P3P07wywjgqrC2ReH33wg6Wy8Ck8I21wH/FD5+mvACl/D7dgPBxFu/J5hF9WlgBXBm2OavgZ+k1PVIynpzvj9tmyYBzwIPEVxVejtQlvI9qk/5vv4bsDjc/uqwzTfD9y0E7iv1z05/+Sp5Afrq4Y4LfkCPJJibZEAYNJPoDOZ/Br4UPh5EcLVvbYYf+mvCH/LkD2LqOm4AfpTSdjBB6D6RsmxQhtp2fQbBJFFvAw1heP0BOCt8zYHzsmzfQuDk8PFu4Z7+OHx+F52/7E6lMzTLwnA7KQyhDuC4sF19GFy14fPvAleFj98EvhE+/t/AHSnfs8uz1H0N8N9ABcFc7ZuBKeFrD6Vs+5CU99wDnBE+PiwMyc8Q/PKpDJc/TWe4e9o6f5fyefPT90H4PDXcc74/bZsmAVsIriYtB55I+T6/SWe47wAmhsvvp/P/32rCK0/J8P9FX8X50ph7hLn7QoIfqmnsPsxyKjDDzOYTBMMAYGSWVc32YGgn3V8SXBqd/Lz1BL27MWb2r2Y2Gcg1g+XHgac9mDBqB/ALgpAF2EkwYVoXZrYvQQg8Ey66J71NHk4Nv+YR9LzHEdwQAeAtD+bOhuAGKeOBP4bfq4sIevNJyYnc5hJ8r/PxmAeTY71MEIaPh8tfTlnHp8LjEi8TTJp2GIC7LybY3oeBL3twP4N029LW+UzK5+VTY0/e/2cP7q+wk+Dy+hMytHnD3eeHj1O/XwuBX5jZlwh+AcheoDH36JtNME/2JIL5K5IMOMfdl6Q2NrO/yLCOTVnWbaRNP+ru681sAnAacAlwHvDlburLNKVp0hbPPM6+2+f2gAHXuftPuywM5szflNbuCXeflmU9W8N/d5L/z8tWAHfvMLPtHnZZCf5iSJjZAIJhoWZ3X2lm1xD88k06AtgADMuy/vR1pn5essYddD1hYsAevj9d+v7ItH+2pjzeSTAkB8ENWU4CzgS+Z2aHeecc7lIk6rlH353ATHd/OW35HOAb4UFDzOyocPlHwMA81/074NLkEzMbbGb1BOOtvwa+RzBdbXf+BJxsZvXh7ROnAc909wYP7kz0gZkle4dfzLPeVHOAL4fz5GNmw80s0w0wXgCON7NDwnY1ZjY2x7r35HuYSTJo3wvrS86KiJmdTfBL+iTg5l6cifImMNHMysxsBMG4em8cG87SWkZwrOa/83lT2H6Euz9FcGOWQQTHcqTIFO4R5+6t7v7jDC9dSzCOujA8jfDacPlTwPjk6YM5Vv+PwGALbmi8APgUwV1ing6HMO4CrsxR3zthm6cIDpC+5O75TOv6N8AtZvY8kGnIqFvunjxA/Hw49PEAGQLZ3dsIxqfvNbOFBGE/LsfqHwY+H34P9/hUz/CX178RDIP8huAgKuEvzuuBi919KfATgvut9sQfCQ4uv0zwl91LPVxP0vNhbYvC9T6U5/vKgf8f7oN5BPdL3dDLWiQPmhVSRCSG1HMXEYkhhbuISAwp3EVEYkjhLiISQwp3EZEYUriLiMSQwl1EJIYU7iIiMfQ/fdBbbb6ySS4AAAAASUVORK5CYII=\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"params = [2, 4, 8, 16, 32, 64, 100]\n",
"\n",
"metrics = [evaluate_dt(train_dt, test_dt, 5, param) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)\n",
"\n",
"plot(params, metrics)\n",
"pyplot.xlabel('Metrics for different maximum bins')\n",
"fig = matplotlib.pyplot.gcf()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gradient BOOSTED TREE"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel\n"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"data_gbt = data_rec.map(lambda r: LabeledPoint(extract_label(r),extract_features_dt(r)))"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"(trainingData, testData) = data_gbt.randomSplit([0.7, 0.3])"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gradient BOOSTED predictions: [(40.0, 18.20990171759985), (2.0, 18.223666887903477), (8.0, 127.79752806968237), (106.0, 120.2269624548493), (37.0, 133.7865565239979)]\n"
]
}
],
"source": [
"model = GradientBoostedTrees.trainRegressor(trainingData,\n",
" categoricalFeaturesInfo={}, numIterations=3)\n",
"preds = model.predict(testData.map(lambda p: p.features))\n",
"actual = testData.map(lambda p: p.label)\n",
"true_vs_predicted_GBT = actual.zip(preds)\n",
"print (\"Gradient BOOSTED predictions: \" + str(true_vs_predicted_GBT.take(5)))\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5240\n",
"log - Mean Squared Error: 14147.5213\n",
"log - Mean Absolue Error: 82.3479\n",
"Root Mean Squared Log Error: 0.7971\n"
]
}
],
"source": [
"nn=[]\n",
"ab=[]\n",
"s_log=[]\n",
"for i in true_vs_predicted_GBT.collect():\n",
" real,predict=i[0],i[1]\n",
" value=(predict - real)**2\n",
" value1=np.abs(predict - real)\n",
" value2=(np.log(predict + 1) - np.log(real + 1))**2\n",
" nn.append(value)\n",
" ab.append(value1)\n",
" s_log.append(value2)\n",
"value_len=len(nn)\n",
"print( value_len)\n",
"ss=sum(nn)\n",
"t=ss/value_len\n",
"ab_sum=sum(ab)\n",
"ab_mean=ab_sum/value_len\n",
"s_log_sum=sum(s_log)\n",
"\n",
"s_log_mean=np.sqrt(s_log_sum/value_len)\n",
"print (\"log - Mean Squared Error: %2.4f\" % t)\n",
"print(\"log - Mean Absolue Error: %2.4f\" % ab_mean)\n",
"print(\"Root Mean Squared Log Error: %2.4f\" % s_log_mean)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"def evaluate_dt(trainingData,categoricalFeaturesInfo, loss, numIterations, maxDepth, maxBins):\n",
"\n",
" model = GradientBoostedTrees.trainRegressor(trainingData,categoricalFeaturesInfo, loss,numIterations,maxDepth=maxDepth, maxBins=maxBins)\n",
"\n",
" preds = model.predict(testData.map(lambda p: p.features))\n",
"\n",
" actual = testData.map(lambda p: p.label)\n",
"\n",
" tp = actual.zip(preds)\n",
" new_val=[]\n",
" for i in tp.collect():\n",
" actual=i[0]\n",
" pred=i[1]\n",
" va=(np.log(pred + 1) - np.log(actual + 1))**2\n",
" new_val.append(va)\n",
" lenth=len(new_val)\n",
" s_new_val=sum(new_val)\n",
" mean_new_val=s_new_val/lenth\n",
" rmsle=np.sqrt(mean_new_val)\n",
" return rmsle"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gradient boost tree Iteration"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2, 4, 8, 16, 32, 64, 100]\n",
"[0.8199092268881405, 0.8191629830985493, 0.8176741701394266, 0.8147160855323975, 0.8089602867888349, 0.7979663301656902, 0.7864647907071756]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"params = [2, 4, 8, 16, 32, 64, 100]\n",
"\n",
"metrics = [evaluate_dt(trainingData, {},'leastAbsoluteError', param,3, 32) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)\n",
"\n",
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying number of iterations')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2, 4, 8, 16, 32, 64, 100]\n",
"[1.3129641027539716, 0.8623427672994901, 0.824921416579645, 0.7999850600613403, 0.8169310667181966, 0.8169291195629018, 0.8169291195629018]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"params = [2, 4, 8, 16, 32, 64, 100]\n",
"\n",
"metrics = [evaluate_dt(trainingData, {},'leastAbsoluteError',10,3, param) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)\n",
"\n",
"plot(params, metrics)\n",
"pyplot.xlabel('Metrics for different maximum bins')\n",
"fig = matplotlib.pyplot.gcf()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
assign2/.ipynb_checkpoints/house-checkpoint.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"

SparkContext

\n",
"\n",
"

Spark UI

\n",
"\n",
"
\n",
"
Version
\n",
"
v2.2.0
\n",
"
Master
\n",
"
local[*]
\n",
"
AppName
\n",
"
PySparkShell
\n",
"
\n",
"
\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sc"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.mllib.regression import LabeledPoint,LinearRegressionWithSGD\n",
"from pyspark.mllib.tree import DecisionTree\n",
"import numpy as np\n",
"import operator\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"\n",
"house_df = sc.textFile(\"/Users/Priya/Desktop/house/trainnoheader.csv\")\n",
"data_count= house_df.count()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"records = house_df.map(lambda x: x.split(\",\"))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"type_columns=[2,5,7,8,9,10,11,12,13,14,15,16,21,22,23,24,27,28,29,39,40,41,53,55,65,78,79]\n",
"type_columns_with_NA=[6,25,30,31,32,33,35,42,57,58,60,63,64,72,73,74]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"number_columns=[1,4,17,18,19,20,34,36,37,38,43,44,45,46,47,48,49,50,51,52,54,56,61,62,66,67,68,69,70,71,75,76,77]\n",
"number_columns_with_NA=[3,26,59]\n",
"number_columns_with_many_zeros=[26,34,36,37,38,44,45,62,66,67,68,69,70,71,75]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"saleprice_column=80"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def getMapOfColumn(idx):\n",
" return records.map(lambda fields:fields[idx]).distinct().zipWithIndex().collectAsMap()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def get_type_maps():\n",
" type_maps={}\n",
" for i in type_columns:\n",
" type_maps[i]=getMapOfColumn(i)\n",
" for i in type_columns_with_NA:\n",
" type_maps[i]=getMapOfColumn(i)\n",
" return type_maps"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"type_maps=get_type_maps()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{2: {'RL': 0, 'RH': 1, 'RM': 2, 'C (all)': 3, 'FV': 4}, 5: {'Pave': 0, 'Grvl': 1}, 7: {'Reg': 0, 'IR1': 1, 'IR2': 2, 'IR3': 3}, 8: {'Bnk': 0, 'Low': 1, 'Lvl': 2, 'HLS': 3}, 9: {'NoSeWa': 0, 'AllPub': 1}, 10: {'FR2': 0, 'CulDSac': 1, 'Inside': 2, 'Corner': 3, 'FR3': 4}, 11: {'Gtl': 0, 'Mod': 1, 'Sev': 2}, 12: {'CollgCr': 0, 'Mitchel': 1, 'NWAmes': 2, 'NAmes': 3, 'MeadowV': 4, 'Edwards': 5, 'ClearCr': 6, 'NPkVill': 7, 'Blmngtn': 8, 'SWISU': 9, 'Veenker': 10, 'Crawfor': 11, 'NoRidge': 12, 'Somerst': 13, 'OldTown': 14, 'BrkSide': 15, 'Sawyer': 16, 'NridgHt': 17, 'SawyerW': 18, 'IDOTRR': 19, 'Timber': 20, 'Gilbert': 21, 'StoneBr': 22, 'BrDale': 23, 'Blueste': 24}, 13: {'Norm': 0, 'Feedr': 1, 'PosN': 2, 'Artery': 3, 'RRAe': 4, 'RRNn': 5, 'PosA': 6, 'RRAn': 7, 'RRNe': 8}, 14: {'Norm': 0, 'Artery': 1, 'RRNn': 2, 'Feedr': 3, 'PosN': 4, 'PosA': 5, 'RRAe': 6, 'RRAn': 7}, 15: {'1Fam': 0, 'Duplex': 1, 'TwnhsE': 2, '2fmCon': 3, 'Twnhs': 4}, 16: {'1.5Fin': 0, '1.5Unf': 1, 'SLvl': 2, '2.5Unf': 3, '2.5Fin': 4, '2Story': 5, '1Story': 6, 'SFoyer': 7}, 21: {'Hip': 0, 'Shed': 1, 'Gable': 2, 'Gambrel': 3, 'Mansard': 4, 'Flat': 5}, 22: {'Metal': 0, 'Membran': 1, 'Roll': 2, 'CompShg': 3, 'WdShngl': 4, 'WdShake': 5, 'Tar&Grv': 6, 'ClyTile': 7}, 23: {'VinylSd': 0, 'WdShing': 1, 'Plywood': 2, 'BrkComm': 3, 'AsphShn': 4, 'CBlock': 5, 'MetalSd': 6, 'Wd Sdng': 7, 'HdBoard': 8, 'BrkFace': 9, 'CemntBd': 10, 'AsbShng': 11, 'Stucco': 12, 'Stone': 13, 'ImStucc': 14}, 24: {'VinylSd': 0, 'Wd Shng': 1, 'Plywood': 2, 'CmentBd': 3, 'AsphShn': 4, 'CBlock': 5, 'MetalSd': 6, 'HdBoard': 7, 'Wd Sdng': 8, 'BrkFace': 9, 'Stucco': 10, 'AsbShng': 11, 'Brk Cmn': 12, 'ImStucc': 13, 'Stone': 14, 'Other': 15}, 27: {'Fa': 0, 'Gd': 1, 'TA': 2, 'Ex': 3}, 28: {'Fa': 0, 'Po': 1, 'TA': 2, 'Gd': 3, 'Ex': 4}, 29: {'PConc': 0, 'CBlock': 1, 'BrkTil': 2, 'Wood': 3, 'Slab': 4, 'Stone': 5}, 39: {'GasW': 0, 'GasA': 1, 'Grav': 2, 'Wall': 3, 'OthW': 4, 'Floor': 5}, 40: {'Fa': 0, 'Po': 1, 'Ex': 2, 'Gd': 3, 'TA': 4}, 41: {'N': 0, 'Y': 1}, 53: {'Fa': 0, 'Gd': 1, 'TA': 2, 'Ex': 3}, 55: {'Typ': 0, 'Min2': 1, 'Maj2': 2, 'Min1': 3, 'Maj1': 4, 'Mod': 5, 'Sev': 6}, 65: {'N': 0, 'Y': 1, 'P': 2}, 78: {'WD': 0, 'New': 1, 'ConLw': 2, 'COD': 3, 'ConLD': 4, 'ConLI': 5, 'CWD': 6, 'Con': 7, 'Oth': 8}, 79: {'Normal': 0, 'AdjLand': 1, 'Family': 2, 'Abnorml': 3, 'Partial': 4, 'Alloca': 5}, 6: {'NA': 0, 'Pave': 1, 'Grvl': 2}, 25: {'None': 0, 'NA': 1, 'BrkFace': 2, 'Stone': 3, 'BrkCmn': 4}, 30: {'NA': 0, 'Fa': 1, 'Gd': 2, 'TA': 3, 'Ex': 4}, 31: {'NA': 0, 'Fa': 1, 'Po': 2, 'TA': 3, 'Gd': 4}, 32: {'Mn': 0, 'NA': 1, 'No': 2, 'Gd': 3, 'Av': 4}, 33: {'GLQ': 0, 'Rec': 1, 'NA': 2, 'ALQ': 3, 'Unf': 4, 'BLQ': 5, 'LwQ': 6}, 35: {'NA': 0, 'Rec': 1, 'GLQ': 2, 'Unf': 3, 'BLQ': 4, 'ALQ': 5, 'LwQ': 6}, 42: {'FuseF': 0, 'FuseA': 1, 'FuseP': 2, 'Mix': 3, 'NA': 4, 'SBrkr': 5}, 57: {'NA': 0, 'Fa': 1, 'Po': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}, 58: {'BuiltIn': 0, 'CarPort': 1, 'NA': 2, 'Basment': 3, '2Types': 4, 'Attchd': 5, 'Detchd': 6}, 60: {'Fin': 0, 'NA': 1, 'RFn': 2, 'Unf': 3}, 63: {'Fa': 0, 'NA': 1, 'Po': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}, 64: {'Fa': 0, 'NA': 1, 'Po': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}, 72: {'NA': 0, 'Fa': 1, 'Ex': 2, 'Gd': 3}, 73: {'NA': 0, 'MnPrv': 1, 'MnWw': 2, 'GdWo': 3, 'GdPrv': 4}, 74: {'NA': 0, 'Shed': 1, 'Othr': 2, 'Gar2': 3, 'TenC': 4}}\n"
]
}
],
"source": [
"print(type_maps)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def get_type_cnt(maps):\n",
" return sum([len(maps[i]) for i in maps])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Feature vector length for type features: 268\n",
"Feature vector length for numerical features: 33\n",
"Total feature vector length: 301\n",
"Total_dt feature vector length: 76\n"
]
}
],
"source": [
"type_cnt=get_type_cnt(type_maps)\n",
"number_cnt=len(number_columns)\n",
"total=type_cnt+number_cnt\n",
"\n",
"total_dt=len(type_columns)+len(type_columns_with_NA)+len(number_columns)\n",
"\n",
"print (\"Feature vector length for type features: %d\" % type_cnt)\n",
"print (\"Feature vector length for numerical features: %d\" % number_cnt)\n",
"print (\"Total feature vector length: %d\" % total)\n",
"print (\"Total_dt feature vector length: %d\" % total_dt)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def extract_features(fields):\n",
" features=np.zeros(total)\n",
" step=0\n",
" for i in type_columns:\n",
" features[step+ int(type_maps[i][fields[i]]) ]=1.0\n",
" step=step+len(type_maps[i])\n",
" for i in type_columns_with_NA:\n",
" features[step+int(type_maps[i][fields[i]])]=1.0\n",
" step=step+len(type_maps[i])\n",
" for i in number_columns:\n",
" features[step]=float(fields[i])\n",
" step=step+1\n",
" return features"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def extract_features_dt(fields):\n",
" features=np.zeros(total_dt)\n",
" step=0\n",
" for i in type_columns:\n",
" features[step]=float(type_maps[i][fields[i]])\n",
" step=step+1\n",
" \n",
" for i in type_columns_with_NA:\n",
" features[step]=float(type_maps[i][fields[i]])\n",
" step=step+1\n",
" for i in number_columns:\n",
" features[step]=float(fields[i])\n",
" step=step+1\n",
" return features"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"data=records.map(lambda fields: LabeledPoint(float(fields[saleprice_column]),extract_features(fields)))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Label: 208500.0\n",
"Linear Model feature vector:\n",
"[1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,60.0,8450.0,7.0,5.0,2003.0,2003.0,706.0,0.0,150.0,856.0,856.0,854.0,0.0,1710.0,1.0,0.0,2.0,1.0,3.0,1.0,8.0,0.0,2.0,548.0,0.0,61.0,0.0,0.0,0.0,0.0,0.0,2.0,2008.0]\n",
"Linear Model feature vector length: 301\n"
]
}
],
"source": [
"first_point = data.first()\n",
"#print (\"Raw data: \" + str(first_point[1:]))\n",
"print (\"Label: \" + str(first_point.label))\n",
"print (\"Linear Model feature vector:\\n\" + str(first_point.features))\n",
"print (\"Linear Model feature vector length: \" + str(len(first_point.features)))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.mllib.regression import LinearRegressionWithSGD"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
}
],
"source": [
"lrModel=LinearRegressionWithSGD.train(data, iterations=10, step=0.1, intercept=False)\n",
"true_vs_predicted=data.map(lambda p: (p.label, lrModel.predict(p.features)))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Linear Model predictions: [(208500.0, -1.3111060925180484e+75), (181500.0, -1.4720767452081686e+75), (223500.0, -1.7050281430818638e+75), (140000.0, -1.4631365187530982e+75), (250000.0, -2.1369709269890862e+75)]\n"
]
}
],
"source": [
"print (\"Linear Model predictions: \" + str(true_vs_predicted.take(5)))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Linear Model - Mean Squared Error: 4519283835876382689242228853370308019420839092654378420329959275965062173707428040253979807151535809132649831467864490762707227557576984928707873341440.0000\n"
]
}
],
"source": [
"li=[]\n",
"for i in true_vs_predicted.collect():\n",
" true,pred=i[0],i[1]\n",
" val=(pred - true)**2\n",
" li.append(val)\n",
"lenth=len(li)\n",
"su=sum(li)\n",
"mean=su/lenth\n",
"print (\"Linear Model - Mean Squared Error: %2.4f\" % mean)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"targets = records.map(lambda r: float(r[-1])).collect()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"import pylab"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/anaconda3/lib/python3.6/site-packages/IPython/core/magics/pylab.py:160: UserWarning: pylab import has clobbered these variables: ['mean', 'pylab']\n",
"`%matplotlib` prevents importing * from pylab and numpy\n",
" \"\\n`%matplotlib` prevents importing * from pylab and numpy\"\n"
]
}
],
"source": [
"%pylab inline"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"hist(targets, bins=40, color='lightblue', normed=True)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"\n",
"fig.set_size_inches(16, 10)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"log_targets = records.map(lambda r: np.log(float(r[-1]))).collect()\n",
"\n",
"hist(log_targets, bins=40, color='lightblue', normed=True)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"\n",
"fig.set_size_inches(16, 10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"data_log = data.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
}
],
"source": [
"model_log = LinearRegressionWithSGD.train(data_log, iterations=10, step=0.1)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"true_vs_predicted_log = data_log.map(lambda p: (np.exp(p.label), np.exp(model_log.predict(p.features))))"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1460\n",
"log - Mean Squared Error: 39039267707.7658\n",
"log - Mean Absolue Error: 180921.1959\n",
"Root Mean Squared Log Error: 12.0307\n"
]
}
],
"source": [
"nn=[]\n",
"ab=[]\n",
"s_log=[]\n",
"for i in true_vs_predicted_log.collect():\n",
" real,predict=i[0],i[1]\n",
" value=(predict - real)**2\n",
" value1=np.abs(predict - real)\n",
" value2=(np.log(predict + 1) - np.log(real + 1))**2\n",
" nn.append(value)\n",
" ab.append(value1)\n",
" s_log.append(value2)\n",
"value_len=len(nn)\n",
"print( value_len)\n",
"ss=sum(nn)\n",
"t=ss/value_len\n",
"ab_sum=sum(ab)\n",
"ab_mean=ab_sum/value_len\n",
"s_log_sum=sum(s_log)\n",
"s_log_mean=np.sqrt(s_log_sum/value_len)\n",
"print (\"log - Mean Squared Error: %2.4f\" % t)\n",
"print(\"log - Mean Absolue Error: %2.4f\" % ab_mean)\n",
"print(\"Root Mean Squared Log Error: %2.4f\" % s_log_mean)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Non log-transformed predictions:\n",
"[(208500.0, -1.3111060925180484e+75), (181500.0, -1.4720767452081686e+75), (223500.0, -1.7050281430818638e+75)]\n",
"Log-transformed predictions:\n",
"[(208500.00000000012, 0.0), (181499.99999999988, 0.0), (223500.0, 0.0)]\n"
]
}
],
"source": [
"print (\"Non log-transformed predictions:\\n\" + str(true_vs_predicted.take(3)))\n",
"\n",
"print (\"Log-transformed predictions:\\n\" + str(true_vs_predicted_log.take(3)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tuning model parameters"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"train, test = data.randomSplit([0.7, 0.3], seed=12345)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"train_size=train.count()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"test_size=test.count()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training data size: 1050\n"
]
}
],
"source": [
"print (\"Training data size: %d\" % train_size)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test data size: 410\n"
]
}
],
"source": [
"print (\"Test data size: %d\" % test_size)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train + Test size : 1460\n"
]
}
],
"source": [
"print (\"Train + Test size : %d\" % (train_size + test_size))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The impact of parameter settings for linear models"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"def evaluate(train, test, iterations, step, regParam, regType, intercept):\n",
"\n",
" model = LinearRegressionWithSGD.train(train, iterations, step, regParam=regParam, regType=regType, intercept=intercept)\n",
"\n",
" tp = test.map(lambda p: (p.label, model.predict(p.features)))\n",
" \n",
" new_val=[]\n",
" for i in tp.collect():\n",
" actual=i[0]\n",
" pred=i[1]\n",
" va=(np.log(pred + 1) - np.log(actual + 1))**2\n",
" new_val.append(va)\n",
" lenth=len(new_val)\n",
" s_new_val=sum(new_val)\n",
" mean_new_val=s_new_val/lenth\n",
" rmsle=np.sqrt(mean_new_val)\n",
" return rmsle"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Iterations"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n",
"/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in log\n",
" # This is added back by InteractiveShellApp.init_path()\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 5, 11, 15, 20, 50]\n",
"[16.401492085322918, 81.34883033703413, 176.05369822746945, 238.23038626017032, nan, nan]\n"
]
}
],
"source": [
"params = [1, 5, 11, 15, 20, 50]\n",
"\n",
"metrics = [evaluate(train, test, param, 0.1, 0.0, 'l2', False) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying number of iterations')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step size"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"params = [0.1, 0.020, 0.25, 0.1, 1.0]"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n",
"/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in log\n",
" # This is added back by InteractiveShellApp.init_path()\n"
]
}
],
"source": [
"metrics = [evaluate(train, test, 20, param, 0.0, 'l2', False) for param in params]"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.1, 0.02, 0.25, 0.1, 1.0]\n",
"[nan, nan, nan, nan, nan]\n"
]
}
],
"source": [
"print (params)\n",
"print (metrics)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEOCAYAAACNY7BQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFIRJREFUeJzt3X2UJXV95/H3hxkFVxIeB4PgOBhI3HEfcNMLMeJKFHAw0eEIETBZJ1nixGyIR3I4Kx7XyIMnB9SEbOLjBAiEk+UhqHGiUUJAdMMxSA8iDCgyQbNMYHVYUBeNssN+94+qlvu73J7p6XuHnof365w+XfWrX1V9773d9alfVfe9qSokSZqxx0IXIEnasRgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqTG4oUuYD4OPPDAWrZs2UKXIUk7lXXr1j1cVUu21m+nDIZly5YxPT290GVI0k4lyT/OpZ+XkiRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJjYkEQ5IVSe5NsiHJOSOW75nkmn75rUmWDS1fmuSxJGdPoh5J0vyNHQxJFgEfAE4ElgOnJ1k+1O0M4NGqOhy4GLhoaPnFwKfHrUWSNL5JjBiOAjZU1f1V9ThwNbByqM9K4Ip++jrglUkCkOQk4H7g7gnUIkka0ySC4RDggYH5jX3byD5VtRn4DnBAkmcDbwPOm0AdkqQJmEQwZERbzbHPecDFVfXYVneSrE4ynWR606ZN8yhTkjQXiyewjY3A8wbmDwUenKXPxiSLgX2AR4CjgVOSvAfYF/h/SX5QVe8f3klVrQHWAExNTQ0HjyRpQiYRDLcBRyQ5DPgn4DTgDUN91gKrgC8ApwA3VVUBL5vpkORc4LFRoSBJevqMHQxVtTnJmcD1wCLgsqq6O8n5wHRVrQUuBa5MsoFupHDauPuVJG0f6U7cdy5TU1M1PT290GVI0k4lybqqmtpaP//zWZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUMBgkSQ2DQZLUmEgwJFmR5N4kG5KcM2L5nkmu6ZffmmRZ3358knVJ7uq/v2IS9UiS5m/sYEiyCPgAcCKwHDg9yfKhbmcAj1bV4cDFwEV9+8PAa6rqXwOrgCvHrUeSNJ5JjBiOAjZU1f1V9ThwNbByqM9K4Ip++jrglUlSVV+qqgf79ruBvZLsOYGaJEnzNIlgOAR4YGB+Y982sk9VbQa+Axww1Odk4EtV9cMJ1CRJmqfFE9hGRrTVtvRJ8iK6y0snzLqTZDWwGmDp0qXbXqUkaU4mMWLYCDxvYP5Q4MHZ+iRZDOwDPNLPHwp8HHhjVf3DbDupqjVVNVVVU0uWLJlA2ZKkUSYRDLcBRyQ5LMkzgdOAtUN91tLdXAY4BbipqirJvsCngLdX1S0TqEWSNKaxg6G/Z3AmcD3wFeDaqro7yflJXtt3uxQ4IMkG4HeAmT9pPRM4HHhnkjv6r4PGrUmSNH+pGr4dsOObmpqq6enphS5DknYqSdZV1dTW+vmfz5KkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWpMJBiSrEhyb5INSc4ZsXzPJNf0y29Nsmxg2dv79nuTvGoS9UiS5m/sYEiyCPgAcCKwHDg9yfKhbmcAj1bV4cDFwEX9usuB04AXASuAD/bbkyQtkEmMGI4CNlTV/VX1OHA1sHKoz0rgin76OuCVSdK3X11VP6yqrwMb+u1JkhbIJILhEOCBgfmNfdvIPlW1GfgOcMAc15UkPY0mEQwZ0VZz7DOXdbsNJKuTTCeZ3rRp0zaWKEmaq0kEw0bgeQPzhwIPztYnyWJgH+CROa4LQFWtqaqpqppasmTJBMqWJI0yiWC4DTgiyWFJnkl3M3ntUJ+1wKp++hTgpqqqvv20/q+WDgOOAL44gZokSfO0eNwNVNXmJGcC1wOLgMuq6u4k5wPTVbUWuBS4MskGupHCaf26dye5FrgH2Az8VlU9MW5NkqT5S3fivnOZmpqq6enphS5DknYqSdZV1dTW+vmfz5KkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkhsEgSWoYDJKkxljBkGT/JDckua//vt8s/Vb1fe5Lsqpv+xdJPpXkq0nuTnLhOLVIkiZj3BHDOcCNVXUEcGM/30iyP/Au4GjgKOBdAwHyvqp6IfBi4KVJThyzHknSmMYNhpXAFf30FcBJI/q8Crihqh6pqkeBG4AVVfX9qvosQFU9DtwOHDpmPZKkMY0bDM+pqocA+u8HjehzCPDAwPzGvu1HkuwLvIZu1CFJWkCLt9Yhyd8CPzFi0TvmuI+MaKuB7S8GrgL+qKru30Idq4HVAEuXLp3jriVJ22qrwVBVx822LMk3kxxcVQ8lORj41ohuG4FjB+YPBW4emF8D3FdVf7iVOtb0fZmamqot9ZUkzd+4l5LWAqv66VXAJ0b0uR44Icl+/U3nE/o2krwb2Ad465h1SJImZNxguBA4Psl9wPH9PEmmklwCUFWPABcAt/Vf51fVI0kOpbsctRy4PckdSX59zHokSWNK1c53VWZqaqqmp6cXugxJ2qkkWVdVU1vr538+S5IaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqTFWMCTZP8kNSe7rv+83S79VfZ/7kqwasXxtkvXj1CJJmoxxRwznADdW1RHAjf18I8n+wLuAo4GjgHcNBkiS1wGPjVmHJGlCxg2GlcAV/fQVwEkj+rwKuKGqHqmqR4EbgBUASfYGfgd495h1SJImZNxgeE5VPQTQfz9oRJ9DgAcG5jf2bQAXAL8PfH/MOiRJE7J4ax2S/C3wEyMWvWOO+8iItkpyJHB4VZ2VZNkc6lgNrAZYunTpHHctSdpWWw2GqjputmVJvpnk4Kp6KMnBwLdGdNsIHDswfyhwM/AS4GeSfKOv46AkN1fVsYxQVWuANQBTU1O1tbolSfMz7qWktcDMXxmtAj4xos/1wAlJ9utvOp8AXF9VH6qq51bVMuAY4GuzhYIk6ekzbjBcCByf5D7g+H6eJFNJLgGoqkfo7iXc1n+d37dJknZAqdr5rspMTU3V9PT0QpchSTuVJOuqampr/fzPZ0lSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSI1W10DVssySbgG8D35nH6gcCD0+2Im3BPszvddqR7aiPaaHq2t77nfT2J7W9cbYz33XHPX49v6qWbK3TThkMAEnWVNXqeaw3XVVT26MmPdV8X6cd2Y76mBaqru2930lvf1LbG2c7O/rxa2e+lPRXC12A5mRXfJ121Me0UHVt7/1OevuT2t4429lRf4aAnXjEMF+OGCTtrBwxbD9rFroASZqnp+X4tduNGCRJW7Y7jhgkSVtgMEiSGgaDJKmx2wdDkmcnuSLJnyT55YWuR5LmKskLklya5LpJbneXDIYklyX5VpL1Q+0rktybZEOSc/rm1wHXVdWbgNc+7cVK0oBtOX5V1f1Vdcaka9glgwG4HFgx2JBkEfAB4ERgOXB6kuXAocADfbcnnsYaJWmUy5n78Wu72CWDoao+Dzwy1HwUsKFP2MeBq4GVwEa6cIBd9PmQtPPYxuPXdrE7HQgP4cmRAXSBcAjwMeDkJB9iB/83dUm7rZHHryQHJPkw8OIkb5/UzhZPakM7gYxoq6r6HvBrT3cxkrQNZjt+/W/gzZPe2e40YtgIPG9g/lDgwQWqRZK2xdN6/NqdguE24IgkhyV5JnAasHaBa5KkuXhaj1+7ZDAkuQr4AvDTSTYmOaOqNgNnAtcDXwGuraq7F7JOSRq2Ixy/fBM9SVJjlxwxSJLmz2CQJDUMBklSw2CQJDUMBklSw2CQJDUMhl1Ukkpy5cD84iSbknxyK+sdmeTVW1g+leSPxqxtSZJbk3wpycvG2dakJTk/yXELtO9vJDlwAfb73iR3J3nvHPouS/KG7VzPm5O8cXvuQ1u2O71X0u7me8C/SvKsqvpn4Hjgn+aw3pHAFPDXwwuSLK6qaWB6zNpeCXy1qlbNdYUki6pqIm+L3j+OzaOWVdXvTmIfO5nfAJZU1Q/n0HcZ8Abgv2+vYqrqw9tr25obRwy7tk8Dv9BPnw5cNbOg/+S6y5Lc1p+5r+z/1f584NQkdyQ5Ncm5SdYk+Rvgz5IcOzPqSLJ3kj9NcleSO5OcnGRRksuTrO/bzxosKMmRwHuAV/f7eFaS0/u+65NcNND3sf4M/lbgJQPt/zLJFwfmlyW5s5/+3f4xre/rTt9+c5LfS/I54B1Jvp7kGf2yH+/P1p/R135K3/6NJOclub2v74V9+5IkN/TtH0nyj8Nn+kl+M8l7BuZ/Nckf99N/mWRdf5a+evhF6x/P+oH5s5Oc20//ZJLP9Ov/j4Gafql/zF9O8vkR20w/Mph5XU7t29cCzwZunWkbWOfl/Wt0R/8z8mPAhcDL+raz+tf7vf1zfmeS3+jXPTbJ55N8PMk9ST6c5CnHmyQX9svvTPK+vu3c/jE/d2D/dyR5Isnz++f/o/0+b0vy0uHtakxV5dcu+AU8Bvwb4DpgL+AO4Fjgk/3y3wN+pZ/eF/ga3QHiV4H3D2znXGAd8Kx+fnAbFwF/ONB3P+BngBsG2vYdUduP9gE8F/ifwBK6EexNwEn9sgJeP8vjuwN4QT/9NuC/9tP7D/S5EnhNP30z8MGBZX86sJ/VwO/305cDp/TT3wB+u5/+z8Al/fT7gbf30yv6Og8cqm8J3fvnz8x/GjhmsEbgWcB64ICB/R1Id1a+fmDds4Fz++kbgSP66aOBm/rpu4BDtvCcnwzcACwCntM/5wfP/KzM8hz/FfDSfnrv/vX50es/8NzNPPd70o0mD+v7/QB4Qb/PG2ae14F19wfu5cl3YNh34Gfu7KG+v0X3NhDQjVZmnsulwFcW+vdtV/tyxLALq6o76Q4yp/PUS0MnAOckuYPuoLkX3S/ZKGuruxw17Di6T5Wa2d+jwP3AC5L8cZIVwHe3Uua/B26uqk3VXd75c+A/9MueAD46y3rXAq/vp08Frumnfz7d/Yu7gFcALxpY55qB6Ut48u3Wf40uKEb5WP99Hd1zCXAM3QelUFWfAR4dXqmqNgH3J/nZJAcAPw3c0i9+S5IvA39P946ZR8yy70aSvYGfA/6if90+AhzcL74FuDzJm+gOxMOOAa6qqieq6pvA5+ie+y25BfiDJG+hO2iPuvx2AvDGvp5bgQMGHs8Xq/tgmSfoRqvHDK37XbrwuCTJ64Dvz/K4Xwr8OvCf+qbjgPf3+1wL/Hg/mtGEeI9h17cWeB/dGdwBA+0BTq6qewc7Jzl6xDa+N8u2Q3e2/CNV9WiSfwu8iu4s7/U8+Qs92zZm84Oa/b7CNXQHyI91u637kuwFfBCYqqoH+ssve416HFV1S3/J5uXAoqpqPl93wMx19yd48vdlSzUP1/h64KvAx6uqkhxLd2B7SVV9P8nNQzUCbKa9zDuzfA/g21V15PCOqurN/Wv3C8AdSY6s7r36Z8y15sFtXpjkU8Crgb/P6JvyoRtVXd80do9z+I3Yhn9WNic5iu6e02l0bxL3iqHtHAxcCry2qh7rm/ege/5GnaxoAhwx7PouA86vqruG2q8HfnvgGvyL+/b/A8z17Otv6H6Z6bexX3+tfY+q+ijwTuDfbWUbtwIvT3Jgus+1PZ3ubHaLquof6A7W7+TJkcDMAfTh/uz6lK1s5s/ozmRnGy3M5u/oRytJTqC7hDbKx4CT6B7TTI37AI/2ofBC4GdHrPdN4KB0n861J/CLAFX1XeDrSX6p33f6ECbJT1bVrdXdPH+Y9r37AT5Pd+9oUZIldKOyL7IF/TbvqqqL6C4RvZCn/nxcD/xmnrxf81NJnt0vOyrd20TvQTeq+7uh7e8N7FNVfw28le4PHwaXP4NuZPi2qvrawKLhn7unBKXGYzDs4qpqY1X9txGLLgCeAdzZ3+i8oG//LLC8v9l36oj1Br0b2G/mpifw83QfQXhzP8y/HNjixw1W1UN9n88CXwZur6pPzO3RcQ3wK3QHD6rq28Cf0F1v/0u697Dfkj+nO6hftZV+w84DTkhyO92Hsz9Ed8Bs9JfW7gGeX1UzB+HPAIvT3Sy/gO5y0vB6/5fujwBuBT5JN+KY8cvAGf3zfTdPfu7ve/ubyuvpQuDLQ5v9OHBn334T8F+q6n9t5XG+deC1/We6+yR3Apv7m9xn0V2Suwe4vd/3R3hyZPUFupvV64Gv9zUM+jHgk/1z8TngrKHlP0d3ueu8gRvQzwXeAkz1N6zvYTt8gtnuzrfd1m4r3V8frayq/7iN6+0JPNFfCnkJ8KFRl3d2Z/2lpLOr6hcXuhZtO+8xaLeU7k9HT6S7fr6tlgLX9pdIHgfeNMnapIXmiEGS1PAegySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhr/H9PYhXnKZ6AqAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying values of step size')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# L2 regularization"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n",
"/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in log\n",
" # This is added back by InteractiveShellApp.init_path()\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.0, 0.01, 0.1, 1.0, 5.0, 10.0, 20.0]\n",
"[nan, nan, nan, nan, nan, nan, nan]\n"
]
}
],
"source": [
"params = [0.0, 0.01, 0.1, 1.0, 5.0, 10.0, 20.0]\n",
"\n",
"metrics = [evaluate(train, test, 10, 0.1, param, 'l2', False) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEOCAYAAACNY7BQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFJ1JREFUeJzt3X20ZXV93/H3R0bBaMLjYJBhHCws7ZgHrCeobawkPDjY6rACFWhihpRkVlJJ22S5Kq5oUDSJaLKwKpqMiE5owkOJ1okmQYKSrvqA3FEKjBEZEcsIjWOZ0KBGMuTbP/YeOL/ruQ9zzxnuvcz7tdZZdz/89m9/9z777s/Z+9xzbqoKSZL2eNJiFyBJWloMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSw2CQJDUMBklSY8ViF7AQRxxxRK1Zs2axy5CkZWXr1q3fqqqVc7VblsGwZs0apqamFrsMSVpWknx9Pu28lSRJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJakwkGJKsS3Jnku1JLhwx/8Ak1/Tzb06yZtr81UkeSvLaSdQjSVq4sYMhyQHAZcDpwFrg3CRrpzU7H9hVVccBlwKXTJt/KfDn49YiSRrfJK4YTgS2V9XdVfUwcDWwflqb9cDmfvg64OQkAUhyBnA3sG0CtUiSxjSJYDgauHdofEc/bWSbqtoNPAgcnuRpwOuAN0+gDknSBEwiGDJiWs2zzZuBS6vqoTlXkmxMMpVkaufOnQsoU5I0Hysm0McO4Jih8VXAfTO02ZFkBXAw8ADwQuCsJG8HDgH+McnfV9V7pq+kqjYBmwAGg8H04JEkTcgkguEW4PgkxwLfAM4B/u20NluADcBngbOAT1ZVAS/Z0yDJm4CHRoWCJOnxM3YwVNXuJBcA1wMHAFdU1bYkFwNTVbUF+ABwZZLtdFcK54y7XknSvpHuhfvyMhgMampqarHLkKRlJcnWqhrM1c5PPkuSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKlhMEiSGhMJhiTrktyZZHuSC0fMPzDJNf38m5Os6aefmmRrktv7nz89iXokSQs3djAkOQC4DDgdWAucm2TttGbnA7uq6jjgUuCSfvq3gFdU1Y8CG4Arx61HkjSeSVwxnAhsr6q7q+ph4Gpg/bQ264HN/fB1wMlJUlVfrKr7+unbgIOSHDiBmiRJCzSJYDgauHdofEc/bWSbqtoNPAgcPq3NmcAXq+p7E6hJkrRAKybQR0ZMq71pk+R5dLeXTptxJclGYCPA6tWr975KSdK8TOKKYQdwzND4KuC+mdokWQEcDDzQj68CPgL8fFV9daaVVNWmqhpU1WDlypUTKFuSNMokguEW4PgkxyZ5CnAOsGVamy10by4DnAV8sqoqySHAx4HXV9WnJ1CLJGlMYwdD/57BBcD1wF8D11bVtiQXJ3ll3+wDwOFJtgO/Duz5k9YLgOOANya5tX8cOW5NkqSFS9X0twOWvsFgUFNTU4tdhiQtK0m2VtVgrnZ+8lmS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEkNg0GS1DAYJEmNiQRDknVJ7kyyPcmFI+YfmOSafv7NSdYMzXt9P/3OJC+bRD2SpIUbOxiSHABcBpwOrAXOTbJ2WrPzgV1VdRxwKXBJv+xa4BzgecA64L19f5KkRTKJK4YTge1VdXdVPQxcDayf1mY9sLkfvg44OUn66VdX1feq6mvA9r4/SdIimUQwHA3cOzS+o582sk1V7QYeBA6f57KSpMfRJIIhI6bVPNvMZ9mug2RjkqkkUzt37tzLEiVJ8zWJYNgBHDM0vgq4b6Y2SVYABwMPzHNZAKpqU1UNqmqwcuXKCZQtSRplEsFwC3B8kmOTPIXuzeQt09psATb0w2cBn6yq6qef0//V0rHA8cDnJ1CTJGmBVozbQVXtTnIBcD1wAHBFVW1LcjEwVVVbgA8AVybZTnelcE6/7LYk1wJfAnYDr6mqR8atSZK0cOleuC8vg8GgpqamFrsMSVpWkmytqsFc7fzksySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpYTBIkhoGgySpMVYwJDksyQ1J7up/HjpDuw19m7uSbOin/UCSjyf5cpJtSd42Ti2SpMkY94rhQuDGqjoeuLEfbyQ5DLgIeCFwInDRUID8blU9F3g+8C+SnD5mPZKkMY0bDOuBzf3wZuCMEW1eBtxQVQ9U1S7gBmBdVX2nqj4FUFUPA18AVo1ZjyRpTOMGwzOq6n6A/ueRI9ocDdw7NL6jn/aoJIcAr6C76pAkLaIVczVI8pfAD4+Y9RvzXEdGTKuh/lcAVwHvqqq7Z6ljI7ARYPXq1fNctSRpb80ZDFV1ykzzkvxNkqOq6v4kRwHfHNFsB3DS0Pgq4Kah8U3AXVX1zjnq2NS3ZTAY1GxtJUkLN+6tpC3Ahn54A/DREW2uB05Lcmj/pvNp/TSSvBU4GPhPY9YhSZqQcYPhbcCpSe4CTu3HSTJIcjlAVT0AvAW4pX9cXFUPJFlFdztqLfCFJLcm+cUx65EkjSlVy++uzGAwqKmpqcUuQ5KWlSRbq2owVzs/+SxJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqTGWMGQ5LAkNyS5q/956AztNvRt7kqyYcT8LUnuGKcWSdJkjHvFcCFwY1UdD9zYjzeSHAZcBLwQOBG4aDhAkvwM8NCYdUiSJmTcYFgPbO6HNwNnjGjzMuCGqnqgqnYBNwDrAJI8Hfh14K1j1iFJmpBxg+EZVXU/QP/zyBFtjgbuHRrf0U8DeAvwe8B3xqxDkjQhK+ZqkOQvgR8eMes35rmOjJhWSU4AjquqX0uyZh51bAQ2AqxevXqeq5Yk7a05g6GqTplpXpK/SXJUVd2f5CjgmyOa7QBOGhpfBdwEvBh4QZJ7+jqOTHJTVZ3ECFW1CdgEMBgMaq66JUkLM+6tpC3Anr8y2gB8dESb64HTkhzav+l8GnB9Vb2vqp5ZVWuAnwS+MlMoSJIeP+MGw9uAU5PcBZzaj5NkkORygKp6gO69hFv6x8X9NEnSEpSq5XdXZjAY1NTU1GKXIUnLSpKtVTWYq52ffJYkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNQwGSVLDYJAkNVJVi13DXkuyE/hb4MEFLH4E8K3JVqRZHMzCnqelbKlu02LVta/XO+n+J9XfOP0sdNlxz1/PqqqVczValsEAkGRTVW1cwHJTVTXYFzXp+y30eVrKluo2LVZd+3q9k+5/Uv2N089SP38t51tJf7rYBWhenojP01LdpsWqa1+vd9L9T6q/cfpZqscQsIyvGBbKKwZJy5VXDPvOpsUuQJIW6HE5f+13VwySpNntj1cMkqRZGAySpIbBIElq7PfBkORpSTYneX+Sn13seiRpvpI8O8kHklw3yX6fkMGQ5Iok30xyx7Tp65LcmWR7kgv7yT8DXFdVvwS88nEvVpKG7M35q6rurqrzJ13DEzIYgA8B64YnJDkAuAw4HVgLnJtkLbAKuLdv9sjjWKMkjfIh5n/+2ieekMFQVf8DeGDa5BOB7X3CPgxcDawHdtCFAzxB94ek5WMvz1/7xP50Ijyax64MoAuEo4EPA2cmeR9L/GPqkvZbI89fSQ5P8vvA85O8flIrWzGpjpaBjJhWVfVt4Bce72IkaS/MdP76v8AvT3pl+9MVww7gmKHxVcB9i1SLJO2Nx/X8tT8Fwy3A8UmOTfIU4BxgyyLXJEnz8biev56QwZDkKuCzwHOS7EhyflXtBi4Argf+Gri2qrYtZp2SNN1SOH/5JXqSpMYT8opBkrRwBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEwLDFJKsmVQ+MrkuxM8rE5ljshyctnmT9I8q4xa1uZ5OYkX0zyknH6mrQkFyc5ZUJ93ZPkiEn0Nck+kzw3ya39/v8nc/Wf5GeT3NY/PpPkx8dZ/0IsZLuTXL6Qbw5Ncl6SZ47bj/av70paLr4N/EiSp1bVd4FTgW/MY7kTgAHwZ9NnJFlRVVPA1Ji1nQx8uao2zHeBJAdU1US+zrzfjt2j5lXVb05iHUvcGcBHq+qiebb/GvDSqtqV5HRgE/DC2RaY5PO1EP36f3GBi58H3EH/VRFj9KOq8rGEHsBDwG8DZ/Xjfwi8DvhYP/404Aq6j8h/ke6rd58C/G9gJ3ArcDbwJroTwSeAPwZOGurj6cAHgduB24AzgQPovgf+jn76r02r64Rp63gqcG7f9g7gkmnbcDFwM/CTQ9P/KfD5ofE1wG398G/223RHX/eeD1/e1O+PvwIuojvZPbmf90PAPcCT+9r37LN7gDcDX+jre24/fSVwQz/9D4CvA0eMeA7u2TMd+Dng8/02/0G/n34FePtQ+/OAd8/UfrjP/vn7OPC/+m09e8T6TwA+1z83HwEOBV4O/B+6Fwmfmq3mGY6rQ4FvzHLMPfp8AS/o9/dWuk/aHtW3+4m+ps8C7wDuGNr+9wz19zHgpBH78r/3fW4DNs6y/pvoXuS8st+PtwJ3Al+b6VgBzur7uZPHjs+bgEG/zGzH6m/1z8fngGcs9jlgKTwWvQAf056Q7kD9MeA64KD+ID+Jx07qvw38XD98CPCV/mQz/ZfzTf0v4VP78eE+LgHeOdT20P5kcMPQtENG1PboOoBn0gXFSrorz08CZ/TzCnjVDNt3K/Dsfvh1wBv64cOG2lwJvKIfvgl479C8Dw6tZyPwe/3wh2iD4Vf74X8PXN4Pvwd4fT+8rq9zxmCgC7I/5bEgei/w8/02bx9q/+d0J7SR7af1eSbw/qFlDx6x/tvoXulDd8J859Bz+toZ9us9o7ZlaP5r9+yHEfMefb7oQvYzwMp+/Gzgin74DuCf98NvY++D4bD+51P7vg4fdbwwdEIfmnYt8Jp5HCuD6f0w97G6Z/m30x+P+/vD9xiWoKq6je7V9Ll8/62h04ALk9xKd+AfBKyeoast1d2Omu4Uuv8GtWd9u4C7gWcneXeSdcD/m6PMnwBuqqqd1d3e+SPgX/bzHgH+ZIblrgVe1Q+fDVzTD/9U//7F7cBPA88bWuaaoeHLeexr0n+BLihG+XD/cyvdvoTu5H01QFX9BbBrpo3rnUwXmLf0+/tkulDbCdyd5EVJDgeeA3x6pvbT+rwdOCXJJUleUlUPDs9McjBdKP9VP2kzj+3XBUnyU8D5dEE8yvDz9RzgR4Ab+m14A7AqySHAD1bVZ/p2f7yAUv5Dkj2vzI8Bjh+x/lH1/2fgu1W155id7VgZZbZj9WG6IIP2WNmv+R7D0rUF+F26V/qHD00PcGZV3TncOMmoe8ffnqHv0L1SelR196F/HHgZ8Bq6k/e/m6W+Ud8Pv8ff18z3qa8B/luSD3errbuSHET36npQVfcmeRNd4H3fdlTVp5OsSfJSuts0zf/FHfK9/ucjPHacz1bzKAE2V9Wof4ByDd0++jLwkaqqJLO131P/V5K8gO7W0O8k+URVXbyXdc1bkh+jC9PTq/vu/lGGn68A26rqxdP6OXSW1eym/UOWg6Y3SHIS3QuSF1fVd5LcNNRuxuMlycnAv6E/kc/jWBnZzSzz/qH6ywXaY2W/5hXD0nUFcHFV3T5t+vXAr/YnIZI8v5/+d8APzrPvT9B9UyN9H4f2fznypKr6E+CNwD+bo4+bgZcmOaL/f7Tn0t2XnlVVfZXuF/CNPHYlsOcX+1tJnk53v3g2fwhcxcxXCzP5n/RXK0lOo7uFNpsbgbOSHNkvc1iSZ/XzPkz3ZvC5PLYds7Wnn/ZM4DtV9V/pgr/Zz/0VxK6hv/p6NfPYr6MkWd3X+eqq+so8F7sTWJnkxX0fT07yvP6q8u+SvKhvd87QMvcAJyR5UpJj6P4N5XQHA7v6UHgu8KIRbabX/yy6EHjV0JXvbMfKTL8DCzpW92em4xJVVTuA/zJi1luAdwK39eFwD/CvgU/x2C2m35mj+7cClyW5g+4k/Wbgq8AHk+x5sTDrvwmsqvvT/SvBT9G9IvuzqvrofLaN7kT6DuDYvq+/TfJ+utss99C9sTibP+q34ap5rm+PNwNXJTmb7sRwP93JZKSq+lKSNwCf6PfLP9BdTX29v8L6ErC2qj4/V/uhbn8UeEeSf+zn/8qIVW8Afj/JD9Dd4pvvfxi8re8Xult2P0R3tfne/nXE7qoazNZBVT2c5CzgXf1trRV0x9s2uttR70/ybbrbmHtug32a7o8C9ry5+4URXf8F8MtJbqMLn8/NY3vO6+v/SF//fVX18lmOlQ/R7bfvAo9e8Yx5rO6X/NptLTv9iWt9Vb16L5c7EHikqnb3r4jfV1Un7JMin4CSPL2qHuqHL6T7a6X/uMhlaR/wikHLSpJ3A6fT3aPfW6uBa/tX8w8DvzTJ2vYD/6p/5b2C7irovMUtR/uKVwySpIZvPkuSGgaDJKlhMEiSGgaDJKlhMEiSGgaDJKnx/wE1BT111UnhvQAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying levels of L2 regularization')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# L1 regularization"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n",
"/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in log\n",
" # This is added back by InteractiveShellApp.init_path()\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.0, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]\n",
"[nan, nan, nan, nan, nan, nan, nan]\n"
]
}
],
"source": [
"params = [0.0, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]\n",
"\n",
"metrics = [evaluate(train, test, 10, 0.1, param, 'l1', False) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"L1 (1.0) number of zero weights: 6\n",
"L1 (10.0) number of zeros weights: 6\n",
"L1 (100.0) number of zeros weights: 6\n"
]
}
],
"source": [
"model_l1 = LinearRegressionWithSGD.train(train, 10, 0.1, regParam=1.0, regType='l1', intercept=False)\n",
"\n",
"model_l1_10 = LinearRegressionWithSGD.train(train, 10, 0.1, regParam=10.0, regType='l1', intercept=False)\n",
"\n",
"model_l1_100 = LinearRegressionWithSGD.train(train, 10, 0.1, regParam=100.0, regType='l1', intercept=False)\n",
"\n",
"print (\"L1 (1.0) number of zero weights: \" + str(sum(model_l1.weights.array == 0)))\n",
"\n",
"print (\"L1 (10.0) number of zeros weights: \" + str(sum(model_l1_10.weights.array == 0)))\n",
"\n",
"print (\"L1 (100.0) number of zeros weights: \" + str(sum(model_l1_100.weights.array == 0)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Intercept"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/akashsoni/spark/python/pyspark/mllib/regression.py:281: UserWarning: Deprecated in 2.0.0. Use ml.regression.LinearRegression.\n",
" warnings.warn(\"Deprecated in 2.0.0. Use ml.regression.LinearRegression.\")\n",
"/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in log\n",
" # This is added back by InteractiveShellApp.init_path()\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[False, True]\n",
"[nan, nan]\n"
]
}
],
"source": [
"params = [False, True]\n",
"\n",
"metrics = [evaluate(train, test, 10, 0.1, 1.0, 'l2', param) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEKCAYAAAAW8vJGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFX1JREFUeJzt3X+0ZWV93/H3xxkFfyQwwIDIOBkUEjtEi/UuXNrowh8gmCgspRGT1DHVktVoarR2BWsTAV2JmFisv5JM1EhtEjBaV6aSFBBFiU2VO4A/RkXGEcMIVSyUVbRi0W//2M+V89ycy71zz7lzufB+rXXW2fvZz977+5w7cz9n733PPqkqJEma86DVLkCSdN9iMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKmzfrULWI7DDjustmzZstplSNKasnPnzu9U1cbF+q3JYNiyZQuzs7OrXYYkrSlJvrGUfp5KkiR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUmcqwZDklCTXJ9md5Owxyw9IcnFb/pkkW+Yt35zkziSvnUY9kqTlmzgYkqwD3gWcCmwFXpxk67xuLwNur6pjgAuA8+ctvwD4m0lrkSRNbhpHDCcAu6tqT1X9ALgIOG1en9OAC9v0h4BnJQlAktOBPcCuKdQiSZrQNILhKOCmkfm9rW1sn6q6G7gDODTJw4HfAs6dQh2SpCmYRjBkTFstsc+5wAVVdeeiO0nOSjKbZPbWW29dRpmSpKVYP4Vt7AUePTK/Cbh5gT57k6wHDgJuA54MnJHkLcDBwI+SfL+q3jl/J1W1HdgOMDMzMz94JElTMo1guBo4NsnRwDeBM4FfmtdnB7AN+DvgDODjVVXA0+Y6JDkHuHNcKEiS9p+Jg6Gq7k7ySuBSYB3wvqraleQ8YLaqdgDvBT6QZDfDkcKZk+5XkrQyMrxxX1tmZmZqdnZ2tcuQpDUlyc6qmlmsn598liR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1phIMSU5Jcn2S3UnOHrP8gCQXt+WfSbKltZ+UZGeSL7TnZ06jHknS8k0cDEnWAe8CTgW2Ai9OsnVet5cBt1fVMcAFwPmt/TvA86rq8cA24AOT1iNJmsw0jhhOAHZX1Z6q+gFwEXDavD6nARe26Q8Bz0qSqrq2qm5u7buAA5McMIWaJEnLNI1gOAq4aWR+b2sb26eq7gbuAA6d1+eFwLVVddcUapIkLdP6KWwjY9pqX/okOY7h9NLJC+4kOQs4C2Dz5s37XqUkaUmmccSwF3j0yPwm4OaF+iRZDxwE3NbmNwEfAV5SVV9baCdVtb2qZqpqZuPGjVMoW5I0zjSC4Wrg2CRHJ3kIcCawY16fHQwXlwHOAD5eVZXkYOAS4HVV9ekp1CJJmtDEwdCuGbwSuBT4MvDBqtqV5Lwkz2/d3gscmmQ38Bpg7k9aXwkcA/x2kuva4/BJa5IkLV+q5l8OuO+bmZmp2dnZ1S5DktaUJDuramaxfn7yWZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSR2DQZLUMRgkSZ2pBEOSU5Jcn2R3krPHLD8gycVt+WeSbBlZ9rrWfn2S50yjHknS8k0cDEnWAe8CTgW2Ai9OsnVet5cBt1fVMcAFwPlt3a3AmcBxwCnAu9v2JEmrZBpHDCcAu6tqT1X9ALgIOG1en9OAC9v0h4BnJUlrv6iq7qqqrwO72/YkSatkGsFwFHDTyPze1ja2T1XdDdwBHLrEdSVJ+9E0giFj2mqJfZay7rCB5Kwks0lmb7311n0sUZK0VNMIhr3Ao0fmNwE3L9QnyXrgIOC2Ja4LQFVtr6qZqprZuHHjFMqWJI0zjWC4Gjg2ydFJHsJwMXnHvD47gG1t+gzg41VVrf3M9ldLRwPHAp+dQk2SpGVaP+kGquruJK8ELgXWAe+rql1JzgNmq2oH8F7gA0l2MxwpnNnW3ZXkg8CXgLuBV1TVDyetSZK0fBneuK8tMzMzNTs7u9plSNKakmRnVc0s1s9PPkuSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOgaDJKljMEiSOhMFQ5JDklye5Ib2vGGBfttanxuSbGttD0tySZKvJNmV5M2T1CJJmo5JjxjOBq6oqmOBK9p8J8khwBuAJwMnAG8YCZA/qKrHAU8E/mmSUyesR5I0oUmD4TTgwjZ9IXD6mD7PAS6vqtuq6nbgcuCUqvpeVX0CoKp+AFwDbJqwHknShCYNhiOq6haA9nz4mD5HATeNzO9tbT+W5GDgeQxHHZKkVbR+sQ5JPgY8csyi1y9xHxnTViPbXw/8BfD2qtpzL3WcBZwFsHnz5iXuWpK0rxYNhqp69kLLknwryZFVdUuSI4Fvj+m2FzhxZH4TcOXI/Hbghqp62yJ1bG99mZmZqXvrK0lavklPJe0AtrXpbcBfjelzKXBykg3tovPJrY0kbwIOAn5zwjokSVMyaTC8GTgpyQ3ASW2eJDNJ3gNQVbcBbwSubo/zquq2JJsYTkdtBa5Jcl2Sl09YjyRpQqlae2dlZmZmanZ2drXLkKQ1JcnOqppZrJ+ffJYkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVLHYJAkdQwGSVJnomBIckiSy5Pc0J43LNBvW+tzQ5JtY5bvSPLFSWqRJE3HpEcMZwNXVNWxwBVtvpPkEOANwJOBE4A3jAZIkhcAd05YhyRpSiYNhtOAC9v0hcDpY/o8B7i8qm6rqtuBy4FTAJI8AngN8KYJ65AkTcmkwXBEVd0C0J4PH9PnKOCmkfm9rQ3gjcBbge9NWIckaUrWL9YhyceAR45Z9Pol7iNj2irJ8cAxVfXqJFuWUMdZwFkAmzdvXuKuJUn7atFgqKpnL7QsybeSHFlVtyQ5Evj2mG57gRNH5jcBVwJPAZ6U5MZWx+FJrqyqExmjqrYD2wFmZmZqsbolScsz6amkHcDcXxltA/5qTJ9LgZOTbGgXnU8GLq2qP6yqR1XVFuDngK8uFAqSpP1n0mB4M3BSkhuAk9o8SWaSvAegqm5juJZwdXuc19okSfdBqVp7Z2VmZmZqdnZ2tcuQpDUlyc6qmlmsn598liR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUsdgkCR1UlWrXcM+S3Ir8I3VrmMfHQZ8Z7WL2M8c8wODY147fqqqNi7WaU0Gw1qUZLaqZla7jv3JMT8wOOb7H08lSZI6BoMkqWMw7D/bV7uAVeCYHxgc8/2M1xgkSR2PGCRJHYNhipIckuTyJDe05w0L9NvW+tyQZNuY5TuSfHHlK57cJGNO8rAklyT5SpJdSd68f6vfN0lOSXJ9kt1Jzh6z/IAkF7fln0myZWTZ61r79Umesz/rnsRyx5zkpCQ7k3yhPT9zf9e+HJP8jNvyzUnuTPLa/VXziqgqH1N6AG8Bzm7TZwPnj+lzCLCnPW9o0xtGlr8A+HPgi6s9npUeM/Aw4Bmtz0OAq4BTV3tMC4xzHfA14DGt1s8BW+f1+XXgj9r0mcDFbXpr638AcHTbzrrVHtMKj/mJwKPa9M8C31zt8azkeEeWfxj4S+C1qz2eSR4eMUzXacCFbfpC4PQxfZ4DXF5Vt1XV7cDlwCkASR4BvAZ4036odVqWPeaq+l5VfQKgqn4AXANs2g81L8cJwO6q2tNqvYhh7KNGX4sPAc9KktZ+UVXdVVVfB3a37d3XLXvMVXVtVd3c2ncBByY5YL9UvXyT/IxJcjrDm55d+6neFWMwTNcRVXULQHs+fEyfo4CbRub3tjaANwJvBb63kkVO2aRjBiDJwcDzgCtWqM5JLTqG0T5VdTdwB3DoEte9L5pkzKNeCFxbVXetUJ3TsuzxJnk48FvAufuhzhW3frULWGuSfAx45JhFr1/qJsa0VZLjgWOq6tXzz1uutpUa88j21wN/Aby9qvbse4X7xb2OYZE+S1n3vmiSMQ8Lk+OA84GTp1jXSplkvOcCF1TVne0AYk0zGPZRVT17oWVJvpXkyKq6JcmRwLfHdNsLnDgyvwm4EngK8KQkNzL8XA5PcmVVncgqW8Exz9kO3FBVb5tCuStlL/DokflNwM0L9Nnbwu4g4LYlrntfNMmYSbIJ+Ajwkqr62sqXO7FJxvtk4IwkbwEOBn6U5PtV9c6VL3sFrPZFjvvTA/h9+guxbxnT5xDg6wwXXze06UPm9dnC2rn4PNGYGa6nfBh40GqPZZFxrmc4f3w091yYPG5en1fQX5j8YJs+jv7i8x7WxsXnScZ8cOv/wtUex/4Y77w+57DGLz6vegH3pwfDudUrgBva89wvvxngPSP9/gXDBcjdwK+O2c5aCoZlj5nhHVkBXwaua4+Xr/aY7mWszwW+yvCXK69vbecBz2/TBzL8Rcpu4LPAY0bWfX1b73ruo395Nc0xA/8e+O7Iz/U64PDVHs9K/oxHtrHmg8FPPkuSOv5VkiSpYzBIkjoGgySpYzBIkjoGgySpYzDcjyWpJB8YmV+f5NYkH11kveOTPPdels8kefs0ax2zj+fP3d0yyelJto4suzLJVL5vN8m/m8Z2Ftj2jUkOW8Z675kb72h9Sbbsj7vuJvnrdouSe+vz0iSPWula7qv7v78zGO7fvgv8bJKHtvmTgG8uYb3jGf6e+x9Isr6qZqvqX0+pxrGqakdVzd2G+3SGO5SuhBULhuWqqpdX1Zfa7H6vr6qeW1X/e5FuLwX26Rdz+6TwtOzz/rV0BsP9398AP9+mX8xwTyIAkjw8yfuSXJ3k2iSnJXkIwwd6XpTkuiQvSnJOku1JLgP+U5IT5446kjwiyZ+2++5/PskLk6xL8v4kX2ztrx4tqC3fk8HBSX6U5Olt2VVJjmnvCN+Z5KnA84Hfb/U8tm3mnyX5bJKvJnlaW/fAkVquTfKM1v7SJO8c2f9H2xjeDDy0bffP5r9wSf4wyWyG74o4d6T9xiTnJrmm7etxrf3QJJe1ff8xY+6rk+QXk/yHNv2qJHva9GOT/G2bvrIdlY2rb12SP2k1XTYS+qP7eF6G7wq4NsnHkhzR2s9pP+8r2+s/NtznjnTaEcqX5+8vyRkMH2D8s1bbQ5M8KcknM3z3wqUZbo8yN5bfTfJJ4FVJjkjykSSfa4+ntn6/0n6e1yX54yTrWvudSd7aXusrkmwct/9x49AEVvsTdj5W7gHcCTyB4fbABzJ8+vRE4KNt+e8Cv9KmD2b4xOfDGd6NvXNkO+cAO4GHtvnRbZwPvG2k7wbgSQy32Z5rO3hMbf+N4VYRvwBczfDJ4AOAr7flP64BeD9wxsi6VwJvbdPPBT7Wpv8N8Kdt+nHA37dxzx/PR4ET516je3n95j7Fva7t8wlt/kbgN9r0r9M+4Q28HfidNv3zDJ/qPmzeNh8JXN2mP9TGfhSwDfi9kfHNzK+P4RPxdwPHt/kPzv385u1jA/d8be/LR16rc4D/3l7nw4D/BTx4zPo3tuUL7m9ejQ9u293Y5l8EvG+k37tHtn0x8Jsjr+tBwD8C/utcLcC7Ge6vRHsNf7lN/87Iv4kf79/H9B/eRO9+rqo+n+FurS8G/nre4pOB5+eeb5s6ENi8wKZ2VNX/HdP+bIZ7xszt7/b2LvgxSd4BXAJcNma9q4CnM9yX5veAfwl8kuEX5VL8l/a8k+EXGMDPAe9odXwlyTeAn17i9sb5xSRnMdxD50iG01mfH7P/F7Tpp89NV9UlSW6fv8Gq+p/tKOsnGG7G9udtvaeNbPPefL2qrhvZ95YxfTYBF7d37Q9huDfVnEtquP31XUm+DRzBcGO4Sfb3MwxfxnN5hjuLrgNuGVl+8cj0M4GXAFTVD4E7kvxzhjcTV7f1H8o9N2P80cj6/5mlvUaakKeSHhh2AH/AyGmkJgw3OTu+PTZX1ZcX2MZ3F2gP825NXMOX8fxjhnd1rwDeM2a9qxh+GZ7AEFgHMxyJfGqxwTRz9/b/IffcJXih+x3fTf9v/cDFNp7kaOC1wLOq6gkMATe63rj9w9Jup/13wK8y3Ddp7nV4CvDpJaw7+p0G8/c95x0M76wfD/zaAnXf2/r7ur8Au0b+HT2+qkZvs73Qv53R9S8cWf9nquqcBfp6D5/9wGB4YHgfcF5VfWFe+6XAbyQ//gaqJ7b2/wP8xBK3fRnwyrmZJBsy/CXOg6rqw8BvA/9kzHqfAZ4K/Kiqvs9wmuvXGH5RzrfUej4F/HKr46cZjn6uZzg1cnySByV5NP23p/2/JA8es62fZPiFdkc7R3/qPu7/VIZTOgv1e217vhZ4BnBXVd0xpu9C9d2bg7jnjwy27eO6SzX6M7ke2JjkKQBJHpzhexjGuQL4V63fuiQ/2drOSHJ4az8kyU+1/g8CzmjTvwT87Zj9a8oMhgeAqtpbVf9xzKI3Mpwf/nyGP4N8Y2v/BLC1Xdh70SKbfxOwIcOF5s8x/JI7CrgyyXUM1wdeN6amuxi+Cet/tKarGP6jzw8vGL5i8d+2i6mPHbN8zrsZLs5+geH0w0vbfj7NcDrlCwxHTteMrLO9jb+7+FxVn2P4pb2LIViX8m7+XODpSa5hOE339wv0u4rhNNKn2umUm7jnF958Y+tbxDnAXya5CvjOPqy3L94P/FH7Ga9j+OV9fvs3cB1D6I/zKuAZ7We0k+G21l9iuBvrZUk+z/DVr0e2/t8Fjkuyk+E01Hnz9+/F5+nz7qqS7rOS3FlVj1jtOh5oPGKQJHU8YpAkdTxikCR1DAZJUsdgkCR1DAZJUsdgkCR1DAZJUuf/A6flJmuHzJr7AAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bar(params, metrics, color='lightblue')\n",
"pyplot.xlabel('Metrics without and with an intercept')\n",
"fig = matplotlib.pyplot.gcf()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Decision Tree"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [],
"source": [
"def extract_features_dt(fields):\n",
" features=np.zeros(total_dt)\n",
" step=0\n",
" for i in type_columns:\n",
" features[step]=float(type_maps[i][fields[i]])\n",
" step=step+1\n",
" \n",
" for i in type_columns_with_NA:\n",
" features[step]=float(type_maps[i][fields[i]])\n",
" step=step+1\n",
" for i in number_columns:\n",
" features[step]=float(fields[i])\n",
" step=step+1\n",
" return features"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [],
"source": [
"data_dt=records.map(lambda fields: LabeledPoint(float(fields[saleprice_column]),extract_features_dt(fields)))"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[LabeledPoint(208500.0, [0.0,0.0,0.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,5.0,2.0,3.0,0.0,0.0,1.0,2.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,3.0,2.0,0.0,3.0,5.0,0.0,5.0,2.0,3.0,3.0,0.0,0.0,0.0,60.0,8450.0,7.0,5.0,2003.0,2003.0,706.0,0.0,150.0,856.0,856.0,854.0,0.0,1710.0,1.0,0.0,2.0,1.0,3.0,1.0,8.0,0.0,2.0,548.0,0.0,61.0,0.0,0.0,0.0,0.0,0.0,2.0,2008.0]), LabeledPoint(181500.0, [0.0,0.0,0.0,2.0,1.0,0.0,0.0,10.0,1.0,0.0,0.0,6.0,2.0,3.0,6.0,6.0,2.0,2.0,1.0,1.0,2.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,3.0,3.0,3.0,3.0,5.0,3.0,5.0,2.0,3.0,3.0,0.0,0.0,0.0,20.0,9600.0,6.0,8.0,1976.0,1976.0,978.0,0.0,284.0,1262.0,1262.0,0.0,0.0,1262.0,0.0,1.0,2.0,0.0,3.0,1.0,6.0,1.0,2.0,460.0,298.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0]), LabeledPoint(223500.0, [0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,5.0,2.0,3.0,0.0,0.0,1.0,2.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,3.0,0.0,0.0,3.0,5.0,3.0,5.0,2.0,3.0,3.0,0.0,0.0,0.0,60.0,11250.0,7.0,5.0,2001.0,2002.0,486.0,0.0,434.0,920.0,920.0,866.0,0.0,1786.0,1.0,0.0,2.0,1.0,3.0,1.0,6.0,1.0,2.0,608.0,0.0,42.0,0.0,0.0,0.0,0.0,0.0,9.0,2008.0]), LabeledPoint(140000.0, [0.0,0.0,1.0,2.0,1.0,3.0,0.0,11.0,0.0,0.0,0.0,5.0,2.0,3.0,7.0,1.0,2.0,2.0,2.0,1.0,3.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,0.0,3.0,4.0,2.0,3.0,3.0,5.0,4.0,6.0,3.0,3.0,3.0,0.0,0.0,0.0,70.0,9550.0,7.0,5.0,1915.0,1970.0,216.0,0.0,540.0,756.0,961.0,756.0,0.0,1717.0,1.0,0.0,1.0,0.0,3.0,1.0,7.0,1.0,3.0,642.0,0.0,35.0,272.0,0.0,0.0,0.0,0.0,2.0,2006.0]), LabeledPoint(250000.0, [0.0,0.0,1.0,2.0,1.0,0.0,0.0,12.0,0.0,0.0,0.0,5.0,2.0,3.0,0.0,0.0,1.0,2.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,3.0,4.0,0.0,3.0,5.0,3.0,5.0,2.0,3.0,3.0,0.0,0.0,0.0,60.0,14260.0,8.0,5.0,2000.0,2000.0,655.0,0.0,490.0,1145.0,1145.0,1053.0,0.0,2198.0,1.0,0.0,2.0,1.0,4.0,1.0,9.0,1.0,3.0,836.0,192.0,84.0,0.0,0.0,0.0,0.0,0.0,12.0,2008.0]), LabeledPoint(143000.0, [0.0,0.0,1.0,2.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,3.0,0.0,0.0,2.0,2.0,3.0,1.0,2.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,3.0,2.0,0.0,3.0,5.0,0.0,5.0,3.0,3.0,3.0,0.0,1.0,1.0,50.0,14115.0,5.0,5.0,1993.0,1995.0,732.0,0.0,64.0,796.0,796.0,566.0,0.0,1362.0,1.0,0.0,1.0,1.0,1.0,1.0,5.0,0.0,2.0,480.0,40.0,30.0,0.0,320.0,0.0,0.0,700.0,10.0,2009.0]), LabeledPoint(307000.0, [0.0,0.0,0.0,2.0,1.0,2.0,0.0,13.0,0.0,0.0,0.0,6.0,2.0,3.0,0.0,0.0,1.0,2.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,3.0,4.0,3.0,4.0,0.0,3.0,5.0,4.0,5.0,2.0,3.0,3.0,0.0,0.0,0.0,20.0,10084.0,8.0,5.0,2004.0,2005.0,1369.0,0.0,317.0,1686.0,1694.0,0.0,0.0,1694.0,1.0,0.0,2.0,0.0,3.0,1.0,7.0,1.0,2.0,636.0,255.0,57.0,0.0,0.0,0.0,0.0,0.0,8.0,2007.0]), LabeledPoint(200000.0, [0.0,0.0,1.0,2.0,1.0,3.0,0.0,2.0,2.0,0.0,0.0,5.0,2.0,3.0,8.0,7.0,2.0,2.0,1.0,1.0,2.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,3.0,2.0,3.0,0.0,3.0,4.0,5.0,3.0,5.0,2.0,3.0,3.0,0.0,0.0,1.0,60.0,10382.0,7.0,6.0,1973.0,1973.0,859.0,32.0,216.0,1107.0,1107.0,983.0,0.0,2090.0,1.0,0.0,2.0,1.0,3.0,1.0,7.0,2.0,2.0,484.0,235.0,204.0,228.0,0.0,0.0,0.0,350.0,11.0,2009.0]), LabeledPoint(129900.0, [2.0,0.0,0.0,2.0,1.0,2.0,0.0,14.0,3.0,0.0,0.0,0.0,2.0,3.0,9.0,1.0,2.0,2.0,2.0,1.0,3.0,1.0,2.0,3.0,1.0,0.0,3.0,0.0,0.0,3.0,3.0,2.0,4.0,3.0,0.0,3.0,6.0,3.0,0.0,3.0,0.0,0.0,0.0,50.0,6120.0,7.0,5.0,1931.0,1950.0,0.0,0.0,952.0,952.0,1022.0,752.0,0.0,1774.0,0.0,0.0,2.0,0.0,2.0,2.0,8.0,2.0,2.0,468.0,90.0,0.0,205.0,0.0,0.0,0.0,0.0,4.0,2008.0]), LabeledPoint(118000.0, [0.0,0.0,0.0,2.0,1.0,3.0,0.0,15.0,3.0,1.0,3.0,1.0,2.0,3.0,6.0,6.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,3.0,2.0,0.0,3.0,5.0,3.0,5.0,2.0,4.0,3.0,0.0,0.0,0.0,190.0,7420.0,5.0,6.0,1939.0,1950.0,851.0,0.0,140.0,991.0,1077.0,0.0,0.0,1077.0,1.0,0.0,1.0,0.0,2.0,2.0,5.0,2.0,1.0,205.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,2008.0])]\n"
]
}
],
"source": [
"print(data_dt.take(10))"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Decision Tree feature vector: [0.0,0.0,0.0,2.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,5.0,2.0,3.0,0.0,0.0,1.0,2.0,0.0,1.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,3.0,2.0,0.0,3.0,5.0,0.0,5.0,2.0,3.0,3.0,0.0,0.0,0.0,60.0,8450.0,7.0,5.0,2003.0,2003.0,706.0,0.0,150.0,856.0,856.0,854.0,0.0,1710.0,1.0,0.0,2.0,1.0,3.0,1.0,8.0,0.0,2.0,548.0,0.0,61.0,0.0,0.0,0.0,0.0,0.0,2.0,2008.0]\n",
"Decision Tree feature vector length: 76\n"
]
}
],
"source": [
"first_point_dt = data_dt.first()\n",
"print (\"Decision Tree feature vector: \" + str(first_point_dt.features))\n",
"print (\"Decision Tree feature vector length: \" + str(len(first_point_dt.features)))"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.mllib.tree import DecisionTree"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Decision Tree predictions: [(208500.0, 190334.33561643836), (181500.0, 147907.61375661375), (223500.0, 190334.33561643836), (140000.0, 156058.38888888888), (250000.0, 307760.1111111111)]\n",
"Decision Tree depth: 5\n",
"Decision Tree number of nodes: 63\n"
]
}
],
"source": [
"dt_model = DecisionTree.trainRegressor(data_dt,{})\n",
"preds = dt_model.predict(data_dt.map(lambda p: p.features))\n",
"actual = data.map(lambda p: p.label)\n",
"true_vs_predicted_dt = actual.zip(preds)\n",
"print (\"Decision Tree predictions: \" + str(true_vs_predicted_dt.take(5)))\n",
"print (\"Decision Tree depth: \" + str(dt_model.depth()))\n",
"print (\"Decision Tree number of nodes: \" + str(dt_model.numNodes()))"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1460\n",
"log - Mean Squared Error: 875573280.8278\n",
"log - Mean Absolue Error: 21582.1548\n",
"Root Mean Squared Log Error: 0.1736\n"
]
}
],
"source": [
"nn=[]\n",
"ab=[]\n",
"s_log=[]\n",
"for i in true_vs_predicted_dt.collect():\n",
" real,predict=i[0],i[1]\n",
" value=(predict - real)**2\n",
" value1=np.abs(predict - real)\n",
" value2=(np.log(predict + 1) - np.log(real + 1))**2\n",
" nn.append(value)\n",
" ab.append(value1)\n",
" s_log.append(value2)\n",
"value_len=len(nn)\n",
"print( value_len)\n",
"ss=sum(nn)\n",
"t=ss/value_len\n",
"ab_sum=sum(ab)\n",
"ab_mean=ab_sum/value_len\n",
"s_log_sum=sum(s_log)\n",
"s_log_mean=np.sqrt(s_log_sum/value_len)\n",
"print (\"log - Mean Squared Error: %2.4f\" % t)\n",
"print(\"log - Mean Absolue Error: %2.4f\" % ab_mean)\n",
"print(\"Root Mean Squared Log Error: %2.4f\" % s_log_mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Impact of training on log-transformed targets"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [],
"source": [
"data_dt_log = data_dt.map(lambda lp: LabeledPoint(np.log(lp.label), lp.features))\n",
"\n",
"dt_model_log = DecisionTree.trainRegressor(data_dt_log,{})\n",
"\n",
"preds_log = dt_model_log.predict(data_dt_log.map(lambda p: p.features))\n",
"\n",
"actual_log = data_dt_log.map(lambda p: p.label)"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [],
"source": [
"new=actual_log.zip(preds_log)"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(12.247694320220994, 12.147159998151047),\n",
" (12.109010932687042, 11.890912291269839),\n",
" (12.31716669303576, 12.147159998151047),\n",
" (11.84939770159144, 11.949554245993713),\n",
" (12.429216196844383, 12.515673640608348)]"
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new.take(5)"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [],
"source": [
"true_vs_predicted_dt_log=[]\n",
"for val in new.collect():\n",
" t,p=val[0],val[1]\n",
" x=np.exp(t),np.exp(p)\n",
" true_vs_predicted_dt_log.append(x)"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1460\n",
"log - Mean Squared Error: 1022580494.4448\n",
"log - Mean Absolue Error: 21569.5794\n",
"Root Mean Squared Log Error: 0.1610\n",
"Non log-transformed predictions:\n",
"[(208500.0, 190334.33561643836), (181500.0, 147907.61375661375), (223500.0, 190334.33561643836)]\n"
]
}
],
"source": [
"nn=[]\n",
"ab=[]\n",
"s_log=[]\n",
"for i in true_vs_predicted_dt_log:\n",
" real,predict=i[0],i[1]\n",
" value=(predict - real)**2\n",
" value1=np.abs(predict - real)\n",
" value2=(np.log(predict + 1) - np.log(real + 1))**2\n",
" nn.append(value)\n",
" ab.append(value1)\n",
" s_log.append(value2)\n",
"value_len=len(nn)\n",
"print( value_len)\n",
"ss=sum(nn)\n",
"t=ss/value_len\n",
"ab_sum=sum(ab)\n",
"ab_mean=ab_sum/value_len\n",
"s_log_sum=sum(s_log)\n",
"s_log_mean=np.sqrt(s_log_sum/value_len)\n",
"print (\"log - Mean Squared Error: %2.4f\" % t)\n",
"print(\"log - Mean Absolue Error: %2.4f\" % ab_mean)\n",
"print(\"Root Mean Squared Log Error: %2.4f\" % s_log_mean)\n",
"print (\"Non log-transformed predictions:\\n\" + str(true_vs_predicted_dt.take(3)))\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CROSS VALIDATION for the decision tree"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [],
"source": [
"train_dt, test_dt = data_dt.randomSplit([0.8, 0.2], seed=12345)"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [],
"source": [
"def evaluate_dt(train, test, maxDepth, maxBins):\n",
"\n",
" model = DecisionTree.trainRegressor(train, {}, impurity='variance', maxDepth=maxDepth, maxBins=maxBins)\n",
"\n",
" preds = model.predict(test.map(lambda p: p.features))\n",
"\n",
" actual = test.map(lambda p: p.label)\n",
"\n",
" tp = actual.zip(preds)\n",
" new_val=[]\n",
" for i in tp.collect():\n",
" actual=i[0]\n",
" pred=i[1]\n",
" va=(np.log(pred + 1) - np.log(actual + 1))**2\n",
" new_val.append(va)\n",
" lenth=len(new_val)\n",
" s_new_val=sum(new_val)\n",
" mean_new_val=s_new_val/lenth\n",
" rmsle=np.sqrt(mean_new_val)\n",
" return rmsle"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tree depth"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 2, 3, 4, 5, 10, 20]\n",
"[0.332943090421251, 0.2770328548990305, 0.25563569006835973, 0.24091676957589137, 0.212163652773227, 0.22209830650755852, 0.23462193469250922]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"params = [1, 2, 3, 4, 5, 10, 20]\n",
"\n",
"metrics = [evaluate_dt(train_dt, test_dt, param, 32) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)\n",
"\n",
"plot(params, metrics)\n",
"pyplot.xlabel('Metrics for different tree depths')\n",
"fig = matplotlib.pyplot.gcf()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Maximum bins"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2, 4, 8, 16, 32, 64, 100]\n",
"[0.22578199542260993, 0.22626606160811255, 0.20380255431723798, 0.2076920210675261, 0.212163652773227, 0.21000218813883056, 0.2228581552832826]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"params = [2, 4, 8, 16, 32, 64, 100]\n",
"\n",
"metrics = [evaluate_dt(train_dt, test_dt, 5, param) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)\n",
"\n",
"plot(params, metrics)\n",
"pyplot.xlabel('Metrics for different maximum bins')\n",
"fig = matplotlib.pyplot.gcf()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gradient BOOSTED TREE"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel\n"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [],
"source": [
"def extract_label(record):\n",
" return float(record[-1])"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [],
"source": [
"data_gbt = records.map(lambda r: LabeledPoint(extract_label(r),extract_features_dt(r)))"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [],
"source": [
"(traindata, testData) = data_gbt.randomSplit([0.7, 0.3])"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Gradient BOOSTED predictions: [(307000.0, 265005.81171157816), (118000.0, 133611.14615160643), (279500.0, 188655.2935917073), (149000.0, 126758.74131085054), (139000.0, 129323.12870531864)]\n"
]
}
],
"source": [
"model = GradientBoostedTrees.trainRegressor(traindata,\n",
" categoricalFeaturesInfo={}, numIterations=3)\n",
"preds = model.predict(testData.map(lambda p: p.features))\n",
"actual = testData.map(lambda p: p.label)\n",
"true_vs_predicted_GBT = actual.zip(preds)\n",
"print (\"Gradient BOOSTED predictions: \" + str(true_vs_predicted_GBT.take(5)))\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"413\n",
"log - Mean Squared Error: 1852793539.9857\n",
"log - Mean Absolue Error: 29263.6311\n",
"Root Mean Squared Log Error: 0.2392\n"
]
}
],
"source": [
"nn=[]\n",
"ab=[]\n",
"s_log=[]\n",
"for i in true_vs_predicted_GBT.collect():\n",
" real,predict=i[0],i[1]\n",
" value=(predict - real)**2\n",
" value1=np.abs(predict - real)\n",
" value2=(np.log(predict + 1) - np.log(real + 1))**2\n",
" nn.append(value)\n",
" ab.append(value1)\n",
" s_log.append(value2)\n",
"value_len=len(nn)\n",
"print( value_len)\n",
"ss=sum(nn)\n",
"t=ss/value_len\n",
"ab_sum=sum(ab)\n",
"ab_mean=ab_sum/value_len\n",
"s_log_sum=sum(s_log)\n",
"\n",
"s_log_mean=np.sqrt(s_log_sum/value_len)\n",
"print (\"log - Mean Squared Error: %2.4f\" % t)\n",
"print(\"log - Mean Absolue Error: %2.4f\" % ab_mean)\n",
"print(\"Root Mean Squared Log Error: %2.4f\" % s_log_mean)"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [],
"source": [
"def evaluate_dt(traindata,categoricalFeaturesInfo, loss, numIterations, maxDepth, maxBins):\n",
"\n",
" model = GradientBoostedTrees.trainRegressor(trainingData,categoricalFeaturesInfo, loss,numIterations,maxDepth=maxDepth, maxBins=maxBins)\n",
"\n",
" preds = model.predict(testData.map(lambda p: p.features))\n",
"\n",
" actual = testData.map(lambda p: p.label)\n",
"\n",
" tp = actual.zip(preds)\n",
" new_val=[]\n",
" for i in tp.collect():\n",
" actual=i[0]\n",
" pred=i[1]\n",
" va=(np.log(pred + 1) - np.log(actual + 1))**2\n",
" new_val.append(va)\n",
" lenth=len(new_val)\n",
" s_new_val=sum(new_val)\n",
" mean_new_val=s_new_val/lenth\n",
" rmsle=np.sqrt(mean_new_val)\n",
" return rmsle"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gradient boost tree Iteration"
]
},
{
"cell_type": "code",
"execution_count": 150,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2, 4, 8, 16, 32, 64, 100]\n",
"[0.25905666523741905, 0.2590563768733536, 0.25905580014870655, 0.25905464671334816, 0.259052339898376, 0.2590477264914201, 0.25904253676400585]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"params = [2, 4, 8, 16, 32, 64, 100]\n",
"\n",
"metrics = [evaluate_dt(traindata, {},'leastAbsoluteError', param,3, 32) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)\n",
"\n",
"plot(params, metrics)\n",
"\n",
"fig = matplotlib.pyplot.gcf()\n",
"pyplot.xlabel('Metrics for varying number of iterations')\n",
"pyplot.xscale('log')"
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2, 4, 8, 16, 32, 64, 100]\n",
"[0.24489669490739654, 0.26140602081099523, 0.2619618739499482, 0.25816082247564837, 0.25905551178812486, 0.25776353461608653, 0.25866605672527904]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"params = [2, 4, 8, 16, 32, 64, 100]\n",
"\n",
"metrics = [evaluate_dt(traindata, {},'leastAbsoluteError',10,3, param) for param in params]\n",
"\n",
"print (params)\n",
"\n",
"print (metrics)\n",
"\n",
"plot(params, metrics)\n",
"pyplot.xlabel('Metrics for different maximum bins')\n",
"fig = matplotlib.pyplot.gcf()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here