Module 5Coding Assignment: Linear Regression with Python

Question

Harsimran · Accepted Answer

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "deletable": false,
    "editable": false,
    "id": "viK_dHiXBEo2",
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "# Module 5: Linear Regression - Interactions and Transformations
",
    "
",
    "**_Author:Favio Vázquez and Jessica Cervi_**
",
    "
",
    " In this assignment, we will perform some feature transformation on a database describing airplane accidents and next, we will study a complete example of linear regression.
",
    " 
",
    " 
",
    " 
",
    "### Index:
",
    "
",
    "
",
    "- [Question 1](#q01)
",
    "- [Question 2](#q02)
",
    "- [Question 3](#q03)
",
    "- [Question 4](#q04)
",
    "- [Question 5](#q05)
",
    "- [Question 6](#q06)
",
    "- [Question 7](#q07)
",
    "- [Question 8](#q08)
",
    "- [Question 9](#q09)
",
    "- [Question 10](#q10)
",
    "- [Question 11](#q11)
",
    "- [Question 12](#q12)
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "deletable": false,
    "editable": false,
    "id": "qY13QYshBEo3",
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "## Import the necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "## DON'T CHANGE THIS CODE
",
    "import warnings
",
    "warnings.filterwarnings('ignore')
",
    "import pandas as pd
",
    "import numpy as np
",
    "import matplotlib.pyplot as plt
",
    "import seaborn as sns
",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "
",
    "[Return to top](#questions)
",
    "
",
    "## Question 1
",
    "
",
    "Read the CSV file named "airplane_crash.csv" in the `data` folder and assign it to a dataframe called `accident`.  Next, drop the column `Summary` using the `pandas` command `drop` ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type',
",
      "       'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground', 'Summary'],
",
      "      dtype='object')
",
      "Index(['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type',
",
      "       'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground'],
",
      "      dtype='object')
"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Index(['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type',
",
       "       'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground'],
",
       "      dtype='object')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### GRADED
",
    "
",
    "### YOUR SOLUTION HERE
",
    "accident = pd.read_csv("data/airplane_crash.csv")
",
    "#print(accident.columns)
",
    "
",
    "###
",
    "### YOUR CODE HERE
",
    "accident = accident.drop('Summary', 1)
",
    "
",
    "###
",
    "
",
    "### Answer check
",
    "accident.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": true,
     "grade_id": "Question 01",
     "locked": true,
     "points": "10",
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "###
",
    "### AUTOGRADER TEST - DO NOT REMOVE
",
    "###
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "Now we extract the info and visuaize the first 10 rows of our dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "
",
      "RangeIndex: 5268 entries, 0 to 5267
",
      "Data columns (total 12 columns):
",
      "Date            5268 non-null object
",
      "Time            3049 non-null object
",
      "Location        5248 non-null object
",
      "Operator        5250 non-null object
",
      "Flight #        1069 non-null object
",
      "Route           3562 non-null object
",
      "Type            5241 non-null object
",
      "Registration    4933 non-null object
",
      "cn/In           4040 non-null object
",
      "Aboard          5161 non-null float64
",
      "Fatalities      5256 non-null float64
",
      "Ground          5246 non-null float64
",
      "dtypes: float64(3), object(9)
",
      "memory usage: 494.0+ KB
"
     ]
    }
   ],
   "source": [
    "## DON'T CHANGE THIS CODE
",
    "accident.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      Date
",
       "      Time
",
       "      Location
",
       "      Operator
",
       "      Flight #
",
       "      Route
",
       "      Type
",
       "      Registration
",
       "      cn/In
",
       "      Aboard
",
       "      Fatalities
",
       "      Ground
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      09/17/1908
",
       "      17:18
",
       "      Fort Myer, Virginia
",
       "      Military - U.S. Army
",
       "      NaN
",
       "      Demonstration
",
       "      Wright Flyer III
",
       "      NaN
",
       "      1
",
       "      2.0
",
       "      1.0
",
       "      0.0
",
       "    
",
       "    
",
       "      1
",
       "      07/12/1912
",
       "      06:30
",
       "      AtlantiCity, New Jersey
",
       "      Military - U.S. Navy
",
       "      NaN
",
       "      Test flight
",
       "      Dirigible
",
       "      NaN
",
       "      NaN
",
       "      5.0
",
       "      5.0
",
       "      0.0
",
       "    
",
       "    
",
       "      2
",
       "      08/06/1913
",
       "      NaN
",
       "      Victoria, British Columbia, Canada
",
       "      Private
",
       "      -
",
       "      NaN
",
       "      Curtiss seaplane
",
       "      NaN
",
       "      NaN
",
       "      1.0
",
       "      1.0
",
       "      0.0
",
       "    
",
       "    
",
       "      3
",
       "      09/09/1913
",
       "      18:30
",
       "      Over the North Sea
",
       "      Military - German Navy
",
       "      NaN
",
       "      NaN
",
       "      Zeppelin L-1 (airship)
",
       "      NaN
",
       "      NaN
",
       "      20.0
",
       "      14.0
",
       "      0.0
",
       "    
",
       "    
",
       "      4
",
       "      10/17/1913
",
       "      10:30
",
       "      Near Johannisthal, Germany
",
       "      Military - German Navy
",
       "      NaN
",
       "      NaN
",
       "      Zeppelin L-2 (airship)
",
       "      NaN
",
       "      NaN
",
       "      30.0
",
       "      30.0
",
       "      0.0
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "         Date   Time                            Location  \
",
       "0  09/17/1908  17:18                 Fort Myer, Virginia   
",
       "1  07/12/1912  06:30             AtlantiCity, New Jersey   
",
       "2  08/06/1913    NaN  Victoria, British Columbia, Canada   
",
       "3  09/09/1913  18:30                  Over the North Sea   
",
       "4  10/17/1913  10:30          Near Johannisthal, Germany   
",
       "
",
       "                 Operator Flight #          Route                    Type  \
",
       "0    Military - U.S. Army      NaN  Demonstration        Wright Flyer III   
",
       "1    Military - U.S. Navy      NaN    Test flight               Dirigible   
",
       "2                 Private        -            NaN        Curtiss seaplane   
",
       "3  Military - German Navy      NaN            NaN  Zeppelin L-1 (airship)   
",
       "4  Military - German Navy      NaN            NaN  Zeppelin L-2 (airship)   
",
       "
",
       "  Registration cn/In  Aboard  Fatalities  Ground  
",
       "0          NaN     1     2.0         1.0     0.0  
",
       "1          NaN   NaN     5.0         5.0     0.0  
",
       "2          NaN   NaN     1.0         1.0     0.0  
",
       "3          NaN   NaN    20.0        14.0     0.0  
",
       "4          NaN   NaN    30.0        30.0     0.0  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## DON'T CHANGE THIS CODE
",
    "accident.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "
",
    "[Return to top](#questions)
",
    "
",
    "
",
    "## Question 2
",
    "
",
    "This dataset does not have duplicate rows, however it is always good practice to verify that you aren't aggregating duplicate rows. 
",
    "
",
    "Double up the `accident` dataframe by appending it to itself. Assign the resulting dataframe to a new dataframe called `temp_accident`. Finally, drop the `last` of the duplicate rows and assign the result to a dataframe called `orig_accident`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(5268, 12)
",
      "Rows in duplicate dataframe: 10536
",
      "Rows in duplicate-free dataframe: 5268
"
     ]
    }
   ],
   "source": [
    "### GRADED
",
    "
",
    "### YOUR SOLUTION HERE
",
    "frames = [accident , accident]
",
    "temp_accident = pd.concat(frames)
",
    "
",
    "###
",
    "### YOUR CODE HERE
",
    "orig_accident = temp_accident.drop_duplicates()
",
    "###
",
    "
",
    "### Answer check
",
    "print("Rows in duplicate dataframe: {}".format(temp_accident.shape[0]))
",
    "print("Rows in duplicate-free dataframe: {}".format(orig_accident.shape[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": true,
     "grade_id": "Question 02",
     "locked": true,
     "points": "10",
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "###
",
    "### AUTOGRADER TEST - DO NOT REMOVE
",
    "###
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "
",
    "[Return to top](#questions)
",
    "
",
    "## Question 3
",
    "
",
    "Imputation is a feature engineering technique used to keep valuable data that have null values by replacing the missing values with an estimate.
",
    "
",
    "In our dataframe `orig_accident`, the column `Aboard` has some missing values. Follow these steps to impute the missing values:
",
    "- Extract this column and as a  `pandas.Series` and assign it to a variable called `aboard_missing`.
",
    "- Compute the mean of `aboard_missing` and store the result in `aboard_average`.
",
    "- Finally, create a new variable `aboard_people` where the missing values in `aboard_missing` have been imputed with the value in `aboard_average`.
",
    "
",
    "**Hint:** you can use the methods `.fillna()` or `.isnull()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0       False
",
      "1       False
",
      "2       False
",
      "3       False
",
      "4       False
",
      "5       False
",
      "6       False
",
      "7       False
",
      "8        True
",
      "9       False
",
      "10      False
",
      "11      False
",
      "12      False
",
      "13      False
",
      "14       True
",
      "15      False
",
      "16      False
",
      "17      False
",
      "18      False
",
      "19      False
",
      "20       True
",
      "21      False
",
      "22      False
",
      "23      False
",
      "24      False
",
      "25      False
",
      "26       True
",
      "27      False
",
      "28      False
",
      "29      False
",
      "        ...  
",
      "5238    False
",
      "5239    False
",
      "5240    False
",
      "5241    False
",
      "5242    False
",
      "5243    False
",
      "5244    False
",
      "5245    False
",
      "5246    False
",
      "5247    False
",
      "5248    False
",
      "5249    False
",
      "5250    False
",
      "5251    False
",
      "5252    False
",
      "5253    False
",
      "5254    False
",
      "5255    False
",
      "5256    False
",
      "5257    False
",
      "5258    False
",
      "5259    False
",
      "5260    False
",
      "5261    False
",
      "5262    False
",
      "5263    False
",
      "5264    False
",
      "5265    False
",
      "5266    False
",
      "5267    False
",
      "Name: Aboard, Length: 5268, dtype: bool
",
      "27.595427242782407
",
      "0       False
",
      "1       False
",
      "2       False
",
      "3       False
",
      "4       False
",
      "5       False
",
      "6       False
",
      "7       False
",
      "8        True
",
      "9       False
",
      "10      False
",
      "11      False
",
      "12      False
",
      "13      False
",
      "14       True
",
      "15      False
",
      "16      False
",
      "17      False
",
      "18      False
",
      "19      False
",
      "20       True
",
      "21      False
",
      "22      False
",
      "23      False
",
      "24      False
",
      "25      False
",
      "26       True
",
      "27      False
",
      "28      False
",
      "29      False
",
      "        ...  
",
      "5238    False
",
      "5239    False
",
      "5240    False
",
      "5241    False
",
      "5242    False
",
      "5243    False
",
      "5244    False
",
      "5245    False
",
      "5246    False
",
      "5247    False
",
      "5248    False
",
      "5249    False
",
      "5250    False
",
      "5251    False
",
      "5252    False
",
      "5253    False
",
      "5254    False
",
      "5255    False
",
      "5256    False
",
      "5257    False
",
      "5258    False
",
      "5259    False
",
      "5260    False
",
      "5261    False
",
      "5262    False
",
      "5263    False
",
      "5264    False
",
      "5265    False
",
      "5266    False
",
      "5267    False
",
      "Name: Aboard, Length: 5268, dtype: bool
",
      "Average aboard people: 27.595427242782407
"
     ]
    }
   ],
   "source": [
    "### GRADED
",
    "
",
    "### YOUR SOLUTION HERE
",
    "aboard_missing = orig_accident['Aboard']
",
    "aboard_average = aboard_missing.mean()
",
    "
",
    "###
",
    "### YOUR CODE HERE
",
    "aboard_people = aboard_missing.fillna(value = aboard_average)
",
    "
",
    "###
",
    "
",
    "### Answer check
",
    "print("Average aboard people: {}".format(aboard_average))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": true,
     "grade_id": "Question 03",
     "locked": true,
     "points": "10",
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "###
",
    "### AUTOGRADER TEST - DO NOT REMOVE
",
    "###
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "## Regression Evaluation Metrics
",
    "
",
    "
",
    "In general, we can use three different error evaluation metrics:
",
    "
",
    "**Mean Absolute Error** (MAE): the mean of the absolute value of the errors
",
    "
",
    "$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
",
    "
",
    "**Mean Squared Error** (MSE): the mean of the squared errors
",
    "
",
    "$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
",
    "
",
    "**Root Mean Squared Error** (RMSE): the square root of the mean of the squared errors
",
    "
",
    "$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
",
    "
",
    "These evaluation metrics compare to each other in the following way:
",
    "
",
    "- The **MAE** is the easiest to understand because it's just the average error.
",
    "- The **MSE** is more popular than MAE because MSE "punishes" larger errors. For this reason MSE tends to be more useful in real world problems.
",
    "- The **RMSE** is even more popular than MSE because is interpretable in the "y" units.
",
    "
",
    "Because our goal is to minimize the error, we can also refer to these metrics as **loss functions**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "
",
    "[Return to top](#questions)
",
    "
",
    "## Question 4
",
    "
",
    "Next, we will find the error between the aboard people and fatalities in our dataset. 
",
    "- Fill the missing values in the column `Fatalities` from `orig_accident` with the average value. Store the result as a `Pandas.Series` to `fatal_count`.
",
    "- Compute the MAE, MSE and RMSE between `fatal_count` and `aboard_people`. Save the result of each metric comparison into variables called `crash_mae`, `crash_mse`,  and `crash_rmse`, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.852438601750174
",
      "908.9028198094563
",
      "30.14801518855688
",
      "Missing values in fatal_count: 0
",
      "MAE: 7.852438601750174
",
      "MSE: 908.9028198094563
",
      "RMSE: 30.14801518855688
"
     ]
    }
   ],
   "source": [
    "from sklearn import metrics
",
    "from sklearn.metrics import mean_squared_error , mean_absolute_error
",
    "from math import sqrt
",
    "### GRADED
",
    "
",
    "fatalities = orig_accident['Fatalities']
",
    "fatalities_mean  = fatalities.mean()
",
    "fatalities =fatalities.fillna(value = fatalities_mean)
",
    "
",
    "### YOUR SOLUTION HERE
",
    "fatal_count = fatalities
",
    "
",
    "crash_mae = None
",
    "crash_mse = None
",
    "crash_rmse = None
",
    "
",
    "
",
    "crash_mae = mean_absolute_error(aboard_people, fatal_count)
",
    "crash_mse = mean_squared_error(aboard_people, fatal_count)
",
    "crash_rmse = sqrt(mean_squared_error(aboard_people, fatal_count))
",
    "###
",
    "### YOUR CODE HERE
",
    "
",
    "###
",
    "
",
    "### Answer check
",
    "print("Missing values in fatal_count: {}".format(fatal_count.isnull().sum()))
",
    "print("MAE: {}".format(crash_mae))
",
    "print("MSE: {}".format(crash_mse))
",
    "print("RMSE: {}".format(crash_rmse))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": true,
     "grade_id": "Question 04",
     "locked": true,
     "points": "15",
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "###
",
    "### AUTOGRADER TEST - DO NOT REMOVE
",
    "###
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "
",
    "[Return to top](#questions)
",
    "
",
    "## Question 5
",
    "
",
    "Sometimes it is useful to extract rows or columns from a dataframe by setting a condition based on some specific feature we are interested in. For example, we can extract only the entries from the dataframe that have a desired value. Alternatively, we can select entries by applying a boolean condition to the DataFrame. 
",
    "
",
    "From the dataframe `orig_accident`, extract only the rows that have the value `"Zeppelin L-1 (airship)"` in the column `Type`. Assign the resulting array to the variable `zeppelin_flights`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      Date
",
       "      Time
",
       "      Location
",
       "      Operator
",
       "      Flight #
",
       "      Route
",
       "      Type
",
       "      Registration
",
       "      cn/In
",
       "      Aboard
",
       "      Fatalities
",
       "      Ground
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      3
",
       "      09/09/1913
",
       "      18:30
",
       "      Over the North Sea
",
       "      Military - German Navy
",
       "      NaN
",
       "      NaN
",
       "      Zeppelin L-1 (airship)
",
       "      NaN
",
       "      NaN
",
       "      20.0
",
       "      14.0
",
       "      0.0
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "         Date   Time            Location                Operator Flight #  \
",
       "3  09/09/1913  18:30  Over the North Sea  Military - German Navy      NaN   
",
       "
",
       "  Route                    Type Registration cn/In  Aboard  Fatalities  Ground  
",
       "3   NaN  Zeppelin L-1 (airship)          NaN   NaN    20.0        14.0     0.0  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### GRADED
",
    "
",
    "### YOUR SOLUTION HERE
",
    "
",
    "zeppelin_flights = orig_accident.loc[orig_accident['Type'] == "Zeppelin L-1 (airship)" ]
",
    "
",
    "###
",
    "### YOUR CODE HERE
",
    "###
",
    "
",
    "### Answer check
",
    "zeppelin_flights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": true,
     "grade_id": "Question 05",
     "locked": true,
     "points": "10",
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "###
",
    "### AUTOGRADER TEST - DO NOT REMOVE
",
    "###
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "## Training a model and the Shapiro-Wilk test
",
    "
",
    "In the second part of the assignment, we will work using the linear regression and the _Shapiro-Wilk_ test on a dataframe containing  information about houses in different regions of the United States.
",
    "
",
    "Imagine your friend is a real estate agent and wants some help predicting housing prices for different regions in the USA. It would be helpful if you could somehow create a model that takes a few features of a house and returns the estimate of what the house would sell for.
",
    "
",
    "He has asked you if you could help him out with your new data science skills. You say yes and decide that Linear Regression might be a good path to solve this problem!
",
    "
",
    "The dataset with historical real state information is stored in the file `USA_Housing.csv` and contains the following columns:
",
    "
",
    "* 'Avg. Area Income': Avg. Income of residents of the city the house is located in.
",
    "* 'Avg. Area House Age': Avg Age of Houses in the same city.
",
    "* 'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in the same city.
",
    "* 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in the same city.
",
    "* 'Area Population': Population of the city the house is located in.
",
    "* 'Price': Final Sale Price for the house.
",
    "* 'Address': Address for the house."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "deletable": false,
    "editable": false,
    "id": "f7Yr6WqnBEo8",
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "## Read and extract information about the data
",
    "
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": [],
    "colab_type": "code",
    "collapsed": true,
    "deletable": false,
    "editable": false,
    "id": "qd5jmv6YBEo9",
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "## DON'T CHANGE THIS CODE
",
    "USAhousing = pd.read_csv('data/USA_Housing.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      Avg. Area Income
",
       "      Avg. Area House Age
",
       "      Avg. Area Number of Rooms
",
       "      Avg. Area Number of Bedrooms
",
       "      Area Population
",
       "      Price
",
       "      Address
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      79545.458574
",
       "      5.682861
",
       "      7.009188
",
       "      4.09
",
       "      23086.800503
",
       "      1.059034e+06
",
       "      208 Michael Ferry Apt. 674\nLaurabury, NE 3701...
",
       "    
",
       "    
",
       "      1
",
       "      79248.642455
",
       "      6.002900
",
       "      6.730821
",
       "      3.09
",
       "      40173.072174
",
       "      1.505891e+06
",
       "      188 Johnson Views Suite 079\nLake Kathleen, CA...
",
       "    
",
       "    
",
       "      2
",
       "      61287.067179
",
       "      5.865890
",
       "      8.512727
",
       "      5.13
",
       "      36882.159400
",
       "      1.058988e+06
",
       "      9127 Elizabeth Stravenue\nDanieltown, WI 06482...
",
       "    
",
       "    
",
       "      3
",
       "      63345.240046
",
       "      7.188236
",
       "      5.586729
",
       "      3.26
",
       "      34310.242831
",
       "      1.260617e+06
",
       "      USS Barnett\nFPO AP 44820
",
       "    
",
       "    
",
       "      4
",
       "      59982.197226
",
       "      5.040555
",
       "      7.839388
",
       "      4.23
",
       "      26354.109472
",
       "      6.309435e+05
",
       "      USNS Raymond\nFPO AE 09386
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
",
       "0      79545.458574             5.682861                   7.009188   
",
       "1      79248.642455             6.002900                   6.730821   
",
       "2      61287.067179             5.865890                   8.512727   
",
       "3      63345.240046             7.188236                   5.586729   
",
       "4      59982.197226             5.040555                   7.839388   
",
       "
",
       "   Avg. Area Number of Bedrooms  Area Population         Price  \
",
       "0                          4.09     23086.800503  1.059034e+06   
",
       "1                          3.09     40173.072174  1.505891e+06   
",
       "2                          5.13     36882.159400  1.058988e+06   
",
       "3                          3.26     34310.242831  1.260617e+06   
",
       "4                          4.23     26354.109472  6.309435e+05   
",
       "
",
       "                                             Address  
",
       "0  208 Michael Ferry Apt. 674\nLaurabury, NE 3701...  
",
       "1  188 Johnson Views Suite 079\nLake Kathleen, CA...  
",
       "2  9127 Elizabeth Stravenue\nDanieltown, WI 06482...  
",
       "3                          USS Barnett\nFPO AP 44820  
",
       "4                         USNS Raymond\nFPO AE 09386  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "USAhousing.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "colab": [],
    "colab_type": "code",
    "deletable": false,
    "editable": false,
    "id": "U8IkEWefBEpF",
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    },
    "outputId": "5d8de68b-0199-4e53-9b32-4cf862b9ca81"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "
",
      "RangeIndex: 5000 entries, 0 to 4999
",
      "Data columns (total 7 columns):
",
      "Avg. Area Income                5000 non-null float64
",
      "Avg. Area House Age             5000 non-null float64
",
      "Avg. Area Number of Rooms       5000 non-null float64
",
      "Avg. Area Number of Bedrooms    5000 non-null float64
",
      "Area Population                 5000 non-null float64
",
      "Price                           5000 non-null float64
",
      "Address                         5000 non-null object
",
      "dtypes: float64(6), object(1)
",
      "memory usage: 273.5+ KB
"
     ]
    }
   ],
   "source": [
    "USAhousing.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "colab": [],
    "colab_type": "code",
    "deletable": false,
    "editable": false,
    "id": "ZlmgNJriBEpI",
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    },
    "outputId": "2baf4687-162b-4165-a462-c666509b9eea"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      Avg. Area Income
",
       "      Avg. Area House Age
",
       "      Avg. Area Number of Rooms
",
       "      Avg. Area Number of Bedrooms
",
       "      Area Population
",
       "      Price
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      count
",
       "      5000.000000
",
       "      5000.000000
",
       "      5000.000000
",
       "      5000.000000
",
       "      5000.000000
",
       "      5.000000e+03
",
       "    
",
       "    
",
       "      mean
",
       "      68583.108984
",
       "      5.977222
",
       "      6.987792
",
       "      3.981330
",
       "      36163.516039
",
       "      1.232073e+06
",
       "    
",
       "    
",
       "      std
",
       "      10657.991214
",
       "      0.991456
",
       "      1.005833
",
       "      1.234137
",
       "      9925.650114
",
       "      3.531176e+05
",
       "    
",
       "    
",
       "      min
",
       "      17796.631190
",
       "      2.644304
",
       "      3.236194
",
       "      2.000000
",
       "      172.610686
",
       "      1.593866e+04
",
       "    
",
       "    
",
       "      25%
",
       "      61480.562388
",
       "      5.322283
",
       "      6.299250
",
       "      3.140000
",
       "      29403.928702
",
       "      9.975771e+05
",
       "    
",
       "    
",
       "      50%
",
       "      68804.286404
",
       "      5.970429
",
       "      7.002902
",
       "      4.050000
",
       "      36199.406689
",
       "      1.232669e+06
",
       "    
",
       "    
",
       "      75%
",
       "      75783.338666
",
       "      6.650808
",
       "      7.665871
",
       "      4.490000
",
       "      42861.290769
",
       "      1.471210e+06
",
       "    
",
       "    
",
       "      max
",
       "      107701.748378
",
       "      9.519088
",
       "      10.759588
",
       "      6.500000
",
       "      69621.713378
",
       "      2.469066e+06
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "       Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
",
       "count       5000.000000          5000.000000                5000.000000   
",
       "mean       68583.108984             5.977222                   6.987792   
",
       "std        10657.991214             0.991456                   1.005833   
",
       "min        17796.631190             2.644304                   3.236194   
",
       "25%        61480.562388             5.322283                   6.299250   
",
       "50%        68804.286404             5.970429                   7.002902   
",
       "75%        75783.338666             6.650808                   7.665871   
",
       "max       107701.748378             9.519088                  10.759588   
",
       "
",
       "       Avg. Area Number of Bedrooms  Area Population         Price  
",
       "count                   5000.000000      5000.000000  5.000000e+03  
",
       "mean                       3.981330     36163.516039  1.232073e+06  
",
       "std                        1.234137      9925.650114  3.531176e+05  
",
       "min                        2.000000       172.610686  1.593866e+04  
",
       "25%                        3.140000     29403.928702  9.975771e+05  
",
       "50%                        4.050000     36199.406689  1.232669e+06  
",
       "75%                        4.490000     42861.290769  1.471210e+06  
",
       "max                        6.500000     69621.713378  2.469066e+06  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "USAhousing.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "
",
    "[Return to top](#questions)
",
    "
",
    "## Question 6
",
    "
",
    "Extract the first 10 rows of the the `USAhousing` dataset and store them in a dataframe called `df`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "colab": [],
    "colab_type": "code",
    "id": "hs1MkVCqBEpA",
    "outputId": "7d4b61c6-924b-4fae-818d-2d471c3c56e6"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      Avg. Area Income
",
       "      Avg. Area House Age
",
       "      Avg. Area Number of Rooms
",
       "      Avg. Area Number of Bedrooms
",
       "      Area Population
",
       "      Price
",
       "      Address
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      79545.458574
",
       "      5.682861
",
       "      7.009188
",
       "      4.09
",
       "      23086.800503
",
       "      1.059034e+06
",
       "      208 Michael Ferry Apt. 674\nLaurabury, NE 3701...
",
       "    
",
       "    
",
       "      1
",
       "      79248.642455
",
       "      6.002900
",
       "      6.730821
",
       "      3.09
",
       "      40173.072174
",
       "      1.505891e+06
",
       "      188 Johnson Views Suite 079\nLake Kathleen, CA...
",
       "    
",
       "    
",
       "      2
",
       "      61287.067179
",
       "      5.865890
",
       "      8.512727
",
       "      5.13
",
       "      36882.159400
",
       "      1.058988e+06
",
       "      9127 Elizabeth Stravenue\nDanieltown, WI 06482...
",
       "    
",
       "    
",
       "      3
",
       "      63345.240046
",
       "      7.188236
",
       "      5.586729
",
       "      3.26
",
       "      34310.242831
",
       "      1.260617e+06
",
       "      USS Barnett\nFPO AP 44820
",
       "    
",
       "    
",
       "      4
",
       "      59982.197226
",
       "      5.040555
",
       "      7.839388
",
       "      4.23
",
       "      26354.109472
",
       "      6.309435e+05
",
       "      USNS Raymond\nFPO AE 09386
",
       "    
",
       "    
",
       "      5
",
       "      80175.754159
",
       "      4.988408
",
       "      6.104512
",
       "      4.04
",
       "      26748.428425
",
       "      1.068138e+06
",
       "      06039 Jennifer Islands Apt. 443\nTracyport, KS...
",
       "    
",
       "    
",
       "      6
",
       "      64698.463428
",
       "      6.025336
",
       "      8.147760
",
       "      3.41
",
       "      60828.249085
",
       "      1.502056e+06
",
       "      4759 Daniel Shoals Suite 442\nNguyenburgh, CO ...
",
       "    
",
       "    
",
       "      7
",
       "      78394.339278
",
       "      6.989780
",
       "      6.620478
",
       "      2.42
",
       "      36516.358972
",
       "      1.573937e+06
",
       "      972 Joyce Viaduct\nLake William, TN 17778-6483
",
       "    
",
       "    
",
       "      8
",
       "      59927.660813
",
       "      5.362126
",
       "      6.393121
",
       "      2.30
",
       "      29387.396003
",
       "      7.988695e+05
",
       "      USS Gilbert\nFPO AA 20957
",
       "    
",
       "    
",
       "      9
",
       "      81885.927184
",
       "      4.423672
",
       "      8.167688
",
       "      6.10
",
       "      40149.965749
",
       "      1.545155e+06
",
       "      Unit 9446 Box 0958\nDPO AE 97025
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "   Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
",
       "0      79545.458574             5.682861                   7.009188   
",
       "1      79248.642455             6.002900                   6.730821   
",
       "2      61287.067179             5.865890                   8.512727   
",
       "3      63345.240046             7.188236                   5.586729   
",
       "4      59982.197226             5.040555                   7.839388   
",
       "5      80175.754159             4.988408                   6.104512   
",
       "6      64698.463428             6.025336                   8.147760   
",
       "7      78394.339278             6.989780                   6.620478   
",
       "8      59927.660813             5.362126                   6.393121   
",
       "9      81885.927184             4.423672                   8.167688   
",
       "
",
       "   Avg. Area Number of Bedrooms  Area Population         Price  \
",
       "0                          4.09     23086.800503  1.059034e+06   
",
       "1                          3.09     40173.072174  1.505891e+06   
",
       "2                          5.13     36882.159400  1.058988e+06   
",
       "3                          3.26     34310.242831  1.260617e+06   
",
       "4                          4.23     26354.109472  6.309435e+05   
",
       "5                          4.04     26748.428425  1.068138e+06   
",
       "6                          3.41     60828.249085  1.502056e+06   
",
       "7                          2.42     36516.358972  1.573937e+06   
",
       "8                          2.30     29387.396003  7.988695e+05   
",
       "9                          6.10     40149.965749  1.545155e+06   
",
       "
",
       "                                             Address  
",
       "0  208 Michael Ferry Apt. 674\nLaurabury, NE 3701...  
",
       "1  188 Johnson Views Suite 079\nLake Kathleen, CA...  
",
       "2  9127 Elizabeth Stravenue\nDanieltown, WI 06482...  
",
       "3                          USS Barnett\nFPO AP 44820  
",
       "4                         USNS Raymond\nFPO AE 09386  
",
       "5  06039 Jennifer Islands Apt. 443\nTracyport, KS...  
",
       "6  4759 Daniel Shoals Suite 442\nNguyenburgh, CO ...  
",
       "7     972 Joyce Viaduct\nLake William, TN 17778-6483  
",
       "8                          USS Gilbert\nFPO AA 20957  
",
       "9                   Unit 9446 Box 0958\nDPO AE 97025  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### GRADED
",
    "
",
    "### YOUR SOLUTION HERE
",
    "df = USAhousing.head(10)
",
    "
",
    "###
",
    "### YOUR CODE HERE
",
    "###
",
    "
",
    "### Answer check
",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "grade": true,
     "grade_id": "Question 06",
     "locked": true,
     "points": "10",
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "###
",
    "### AUTOGRADER TEST - DO NOT REMOVE
",
    "###
"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "deletable": false,
    "editable": false,
    "id": "CyDo8jZwBEpS",
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    }
   },
   "source": [
    "
",
    "## Exploratory data analysis (EDA)
",
    "
",
    "We will start our EDA by creating simple plots to visualize scatterplots of all numeric features against each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": [],
    "colab_type": "code",
    "deletable": false,
    "editable": false,
    "id": "sDz7mXIDBEpT",
    "nbgrader": {
     "grade": false,
     "locked": true,
     "solution": false
    },
    "outputId": "246cc53e-0721-46f4-a150-6a61c024b783"
   },
   "outputs": [
    {
     "data": {
      "image/png":

Module 5Coding Assignment: Linear Regression with Python

Answer To: Module 5Coding Assignment: Linear Regression with Python

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment

	Date	Time	Location	Operator	Flight #	Route	Type	Registration	cn/In	Aboard	Fatalities
0	09/17/1908	17:18	Fort Myer, Virginia	Military - U.S. Army	NaN	Demonstration	Wright Flyer III	NaN	1	2.0	1.0
1	07/12/1912	06:30	AtlantiCity, New Jersey	Military - U.S. Navy	NaN	Test flight	Dirigible	NaN	NaN	5.0	5.0
2	08/06/1913	NaN	Victoria, British Columbia, Canada	Private	-	NaN	Curtiss seaplane	NaN	NaN	1.0	1.0
3	09/09/1913	18:30	Over the North Sea	Military - German Navy	NaN	NaN	Zeppelin L-1 (airship)	NaN	NaN	20.0	14.0
4	10/17/1913	10:30	Near Johannisthal, Germany	Military - German Navy	NaN	NaN	Zeppelin L-2 (airship)	NaN	NaN	30.0	30.0

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price	Address
0	79545.458574	5.682861	7.009188	4.09	23086.800503	1.059034e+06	208 Michael Ferry Apt. 674\\nLaurabury, NE 3701...
1	79248.642455	6.002900	6.730821	3.09	40173.072174	1.505891e+06	188 Johnson Views Suite 079\\nLake Kathleen, CA...
2	61287.067179	5.865890	8.512727	5.13	36882.159400	1.058988e+06	9127 Elizabeth Stravenue\\nDanieltown, WI 06482...
3	63345.240046	7.188236	5.586729	3.26	34310.242831	1.260617e+06	USS Barnett\\nFPO AP 44820
4	59982.197226	5.040555	7.839388	4.23	26354.109472	6.309435e+05	USNS Raymond\\nFPO AE 09386

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price
count	5000.000000	5000.000000	5000.000000	5000.000000	5000.000000	5.000000e+03
mean	68583.108984	5.977222	6.987792	3.981330	36163.516039	1.232073e+06
std	10657.991214	0.991456	1.005833	1.234137	9925.650114	3.531176e+05
min	17796.631190	2.644304	3.236194	2.000000	172.610686	1.593866e+04
25%	61480.562388	5.322283	6.299250	3.140000	29403.928702	9.975771e+05
50%	68804.286404	5.970429	7.002902	4.050000	36199.406689	1.232669e+06
75%	75783.338666	6.650808	7.665871	4.490000	42861.290769	1.471210e+06
max	107701.748378	9.519088	10.759588	6.500000	69621.713378	2.469066e+06