Assignment 4 - Spark ML¶ Learning Outcomes¶ In this assignment you will: · Use ML piplenes · Improve a Random Forest model · Perform Hyperparameter tuning Your assignment must be submitted in the...

1 answer below »

Apache Spark in Databricks ... I am having some trouble understanding a couple items will not have time to hand this in on time on my own as I am working long hours.


Assignment 4 - Spark ML¶ Learning Outcomes¶ In this assignment you will: · Use ML piplenes · Improve a Random Forest model · Perform Hyperparameter tuning Your assignment must be submitted in the following file type: · Databricks notebook file (DBC) ** Question 1: ** (5 marks) In our learning from module 5, we leveraged ML pipeline, and built linear regressor and random forest models. However these results are still a bit disappointing. With that being said, we're working with very few features and we've likely made some assumptions that just aren't quite valid (like zip code shortening). Also, just because a rich zip code exists doesn't mean that the farmer's market would be held in that zip code too. In fact we might want to start looking at neighboring zip codes or doing some sort of distance measure to predict whether or not there exists a farmer's market in a certain mile radius from a wealthy zip code. With that being said, we've got a lot of other potential features and plenty of other parameters to tune on our random forest so play around with the above pipeline and see what results do you get. You may use the same classifier we built in the notebook in this module and build upon it. You may also carry out hyperparameter tuning. Read from the following notebook, command cells 65 to 82: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/2324680169011732/6085673883631125/latest.html (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/2324680169011732/6085673883631125/latest.html) ** Question 2 ** ( 7 marks) Using the Apache Spark ML pipeline, build a model to predict the price of a diamond based on the available features. Read from the following notebook for details about dataset. https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/4396972618536508/6085673883631125/latest.html (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/4396972618536508/6085673883631125/latest.html)
Answered Same DayMar 04, 2021

Answer To: Assignment 4 - Spark ML¶ Learning Outcomes¶ In this assignment you will: · Use ML piplenes · Improve...

Abr Writing answered on Mar 08 2021
152 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ykBV4nGNkZcQ"
},
"source": [
"Importing the necessary libraries"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "ywM9xDqWkZM1"
},
"outputs": [],
"source": [
"from pyspark import SparkContext\n",
"from pyspark.sql import SQLContext\n"
,
"\n",
"from pyspark.ml.feature import *\n",
"from pyspark.ml import Pipeline\n",
"from pyspark.ml.regression import *\n",
"from pyspark.ml.evaluation import *\n",
"from pyspark.ml.evaluation import *\n",
"from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n",
"\n",
"# Initializing the spark\n",
"sc = SparkContext(appName='Diamonds')\n",
"sqlContext = SQLContext(sc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "sy6yx7k0jhL_"
},
"source": [
"# Question 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "4z1R7c0ek9Ao"
},
"source": [
"Loading the diamonds dataset as Spark Dataframe and displaying top row"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 124
},
"colab_type": "code",
"id": "QUK06JwhjgsA",
"outputId": "dc37c924-6641-490f-8217-32ded68a6654"
},
"outputs": [
{
"data": {
"text/plain": [
"[Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55.0, price=326, x=3.95, y=3.98, z=2.43),\n",
" Row(carat=0.21, cut='Premium', color='E', clarity='SI1', depth=59.8, table=61.0, price=326, x=3.89, y=3.84, z=2.31),\n",
" Row(carat=0.23, cut='Good', color='E', clarity='VS1', depth=56.9, table=65.0, price=327, x=4.05, y=4.07, z=2.31),\n",
" Row(carat=0.29, cut='Premium', color='I', clarity='VS2', depth=62.4, table=58.0, price=334, x=4.2, y=4.23, z=2.63),\n",
" Row(carat=0.31, cut='Good', color='J', clarity='SI2', depth=63.3, table=58.0, price=335, x=4.34, y=4.35, z=2.75)]"
]
},
"execution_count": 5,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"diamonds = (sqlContext.read.format(\"com.databricks.spark.csv\")\n",
" .option(\"header\",\"true\")\n",
" .option(\"inferSchema\", \"true\")\n",
" .load('diamonds.csv'))\n",
"\n",
"diamonds.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "hhd41ewHzR7L"
},
"source": [
"Printing the data schema"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 225
},
"colab_type": "code",
"id": "QyhE3MPJzSbP",
"outputId": "985a0884-63c4-434d-80c5-1d57d4e53623"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"root\n",
" |-- carat: double (nullable = true)\n",
" |-- cut: string (nullable = true)\n",
" |-- color: string (nullable = true)\n",
" |-- clarity: string (nullable = true)\n",
" |-- depth: double (nullable = true)\n",
" |-- table: double (nullable = true)\n",
" |-- price: integer (nullable = true)\n",
" |-- x: double (nullable = true)\n",
" |-- y: double (nullable = true)\n",
" |-- z: double (nullable = true)\n",
"\n"
]
}
],
"source": [
"diamonds.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "VWjwvYxl6mp0"
},
"source": [
"## Random Forest Regressor"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"colab_type": "code",
"id": "BKpCpFzE7yaH",
"outputId": "75ce2836-10b9-48be-99ed-0d5652da7a79"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3 categorical features\n",
"6 numerical features\n"
]
}
],
"source": [
"cat_cols = [item[0] for item in diamonds.dtypes if item[1].startswith('string')] \n",
"print(str(len(cat_cols)) + ' categorical features')\n",
"num_cols = [item[0] for item in diamonds.dtypes if...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here