Apache Spark: Fit a Binary Logistic Regression Model to a Dataset Dataset:Dropbox link for baby-names Output:Jupyter Notebook (please display the output) Requirements for exercise: 1. Build a...

1 answer below »


Apache Spark: Fit a Binary Logistic Regression Model to a Dataset



Dataset:Dropbox link for baby-names



Output:Jupyter Notebook (please display the output)






Requirements for exercise:


1.Build a Classification Model:In this exercise, you will fit a binary logistic regression model to the baby name dataset you used in the previous exercise. This model will predict the sex of a person based on their age, name, and state they were born in. To train the model, you will use the data found in baby-names/names-classifier.



2.Prepare in Input Features:First, you will need to prepare each of the input features. While age is a numeric feature, state and name are not. These need to be converted into numeric vectors before you can train the model. Use a StringIndexer along with the OneHotEncoderEstimator to convert the name, state, and sex columns into numeric vectors. Use the VectorAssembler to combine the name, state, and age vectors into a single features vector. Your final dataset should contain a column called features containing the prepared vector and a column called label containing the sex of the person.



3.Fit and Evaluate the Model:Fit the model as a logistic regression model with the following parameters. LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8). Provide the area under the ROC curve for the model.


Provided insights and summary of model output, 500 words.


Answered Same DayOct 31, 2021

Answer To: Apache Spark: Fit a Binary Logistic Regression Model to a Dataset Dataset:Dropbox link for...

Ximi answered on Nov 03 2021
152 Votes
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "spark-grey.ipynb",
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "code",
"metadata": {
"id": "YdjxUFOKgbPX",
"colab_type": "code",
"colab": {}
},
"source": [
"import pandas\n",
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz\n",
"!tar xvf spark-2.4.4-bin-hadoop2.7.tgz\n",
"!pip install -q findspark"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "TDbXbmKege-K",
"colab_type": "code",
"colab": {}
},
"source": [
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
"os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.4-bin-hadoop2.7\""
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "g9YMz0J4g0fS",
"colab_type": "code",
"colab": {}
},
"source": [
"import findspark\n",
"findspark.init()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "SXFqUHmxoK0e",
"colab_type": "code",
"colab": {}
},
"source": [
"from pyspark.sql import SparkSession\n",
"spark = SparkSession.builder.master(\"local[*]\").getOrCreate()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Ed_uGjoWoZ58",
"colab_type": "code",
"colab": {}
},
"source": [
"sc = spark.sparkContext\n",
"from pyspark.sql import SQLContext\n",
"sqlContext = SQLContext(sc)\n"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "qMQhqdhHobWW",
"colab_type": "code",
"colab": {}
},
"source": [
"import glob\n",
"# List all *.parquet files\n",
"files = glob.glob('*.parquet')"
],
"execution_count": 0,
"outputs": []
},
{
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here