{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"deletable": false,
"editable": false,
"id": "viK_dHiXBEo2",
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"# Module 5: Linear Regression - Interactions and Transformations\n",
"\n",
"**_Author:Favio Vázquez and Jessica Cervi_**\n",
"\n",
" In this assignment, we will perform some feature transformation on a database describing airplane accidents and next, we will study a complete example of linear regression.\n",
" \n",
" \n",
" \n",
"### Index:\n",
"\n",
"\n",
"- [Question 1](#q01)\n",
"- [Question 2](#q02)\n",
"- [Question 3](#q03)\n",
"- [Question 4](#q04)\n",
"- [Question 5](#q05)\n",
"- [Question 6](#q06)\n",
"- [Question 7](#q07)\n",
"- [Question 8](#q08)\n",
"- [Question 9](#q09)\n",
"- [Question 10](#q10)\n",
"- [Question 11](#q11)\n",
"- [Question 12](#q12)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"deletable": false,
"editable": false,
"id": "qY13QYshBEo3",
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"## Import the necessary libraries"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [],
"source": [
"## DON'T CHANGE THIS CODE\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"\n",
"[Return to top](#questions)\n",
"\n",
"## Question 1\n",
"\n",
"Read the CSV file named \"airplane_crash.csv\" in the `data` folder and assign it to a dataframe called `accident`. Next, drop the column `Summary` using the `pandas` command `drop` ."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type',\n",
" 'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground', 'Summary'],\n",
" dtype='object')\n",
"Index(['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type',\n",
" 'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground'],\n",
" dtype='object')\n"
]
},
{
"data": {
"text/plain": [
"Index(['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type',\n",
" 'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground'],\n",
" dtype='object')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### GRADED\n",
"\n",
"### YOUR SOLUTION HERE\n",
"accident = pd.read_csv(\"data/airplane_crash.csv\")\n",
"#print(accident.columns)\n",
"\n",
"###\n",
"### YOUR CODE HERE\n",
"accident = accident.drop('Summary', 1)\n",
"\n",
"###\n",
"\n",
"### Answer check\n",
"accident.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 01",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"Now we extract the info and visuaize the first 10 rows of our dataframe"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 5268 entries, 0 to 5267\n",
"Data columns (total 12 columns):\n",
"Date 5268 non-null object\n",
"Time 3049 non-null object\n",
"Location 5248 non-null object\n",
"Operator 5250 non-null object\n",
"Flight # 1069 non-null object\n",
"Route 3562 non-null object\n",
"Type 5241 non-null object\n",
"Registration 4933 non-null object\n",
"cn/In 4040 non-null object\n",
"Aboard 5161 non-null float64\n",
"Fatalities 5256 non-null float64\n",
"Ground 5246 non-null float64\n",
"dtypes: float64(3), object(9)\n",
"memory usage: 494.0+ KB\n"
]
}
],
"source": [
"## DON'T CHANGE THIS CODE\n",
"accident.info()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"Date | \n",
"Time | \n",
"Location | \n",
"Operator | \n",
"Flight # | \n",
"Route | \n",
"Type | \n",
"Registration | \n",
"cn/In | \n",
"Aboard | \n",
"Fatalities | \n",
"Ground | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"09/17/1908 | \n",
"17:18 | \n",
"Fort Myer, Virginia | \n",
"Military - U.S. Army | \n",
"NaN | \n",
"Demonstration | \n",
"Wright Flyer III | \n",
"NaN | \n",
"1 | \n",
"2.0 | \n",
"1.0 | \n",
"0.0 | \n",
"
\n",
"\n",
"1 | \n",
"07/12/1912 | \n",
"06:30 | \n",
"AtlantiCity, New Jersey | \n",
"Military - U.S. Navy | \n",
"NaN | \n",
"Test flight | \n",
"Dirigible | \n",
"NaN | \n",
"NaN | \n",
"5.0 | \n",
"5.0 | \n",
"0.0 | \n",
"
\n",
"\n",
"2 | \n",
"08/06/1913 | \n",
"NaN | \n",
"Victoria, British Columbia, Canada | \n",
"Private | \n",
"- | \n",
"NaN | \n",
"Curtiss seaplane | \n",
"NaN | \n",
"NaN | \n",
"1.0 | \n",
"1.0 | \n",
"0.0 | \n",
"
\n",
"\n",
"3 | \n",
"09/09/1913 | \n",
"18:30 | \n",
"Over the North Sea | \n",
"Military - German Navy | \n",
"NaN | \n",
"NaN | \n",
"Zeppelin L-1 (airship) | \n",
"NaN | \n",
"NaN | \n",
"20.0 | \n",
"14.0 | \n",
"0.0 | \n",
"
\n",
"\n",
"4 | \n",
"10/17/1913 | \n",
"10:30 | \n",
"Near Johannisthal, Germany | \n",
"Military - German Navy | \n",
"NaN | \n",
"NaN | \n",
"Zeppelin L-2 (airship) | \n",
"NaN | \n",
"NaN | \n",
"30.0 | \n",
"30.0 | \n",
"0.0 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" Date Time Location \\\n",
"0 09/17/1908 17:18 Fort Myer, Virginia \n",
"1 07/12/1912 06:30 AtlantiCity, New Jersey \n",
"2 08/06/1913 NaN Victoria, British Columbia, Canada \n",
"3 09/09/1913 18:30 Over the North Sea \n",
"4 10/17/1913 10:30 Near Johannisthal, Germany \n",
"\n",
" Operator Flight # Route Type \\\n",
"0 Military - U.S. Army NaN Demonstration Wright Flyer III \n",
"1 Military - U.S. Navy NaN Test flight Dirigible \n",
"2 Private - NaN Curtiss seaplane \n",
"3 Military - German Navy NaN NaN Zeppelin L-1 (airship) \n",
"4 Military - German Navy NaN NaN Zeppelin L-2 (airship) \n",
"\n",
" Registration cn/In Aboard Fatalities Ground \n",
"0 NaN 1 2.0 1.0 0.0 \n",
"1 NaN NaN 5.0 5.0 0.0 \n",
"2 NaN NaN 1.0 1.0 0.0 \n",
"3 NaN NaN 20.0 14.0 0.0 \n",
"4 NaN NaN 30.0 30.0 0.0 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## DON'T CHANGE THIS CODE\n",
"accident.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"
\n",
"[Return to top](#questions)\n",
"\n",
"\n",
"## Question 2\n",
"\n",
"This dataset does not have duplicate rows, however it is always good practice to verify that you aren't aggregating duplicate rows. \n",
"\n",
"Double up the `accident` dataframe by appending it to itself. Assign the resulting dataframe to a new dataframe called `temp_accident`. Finally, drop the `last` of the duplicate rows and assign the result to a dataframe called `orig_accident`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(5268, 12)\n",
"Rows in duplicate dataframe: 10536\n",
"Rows in duplicate-free dataframe: 5268\n"
]
}
],
"source": [
"### GRADED\n",
"\n",
"### YOUR SOLUTION HERE\n",
"frames = [accident , accident]\n",
"temp_accident = pd.concat(frames)\n",
"\n",
"###\n",
"### YOUR CODE HERE\n",
"orig_accident = temp_accident.drop_duplicates()\n",
"###\n",
"\n",
"### Answer check\n",
"print(\"Rows in duplicate dataframe: {}\".format(temp_accident.shape[0]))\n",
"print(\"Rows in duplicate-free dataframe: {}\".format(orig_accident.shape[0]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 02",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"
\n",
"[Return to top](#questions)\n",
"\n",
"## Question 3\n",
"\n",
"Imputation is a feature engineering technique used to keep valuable data that have null values by replacing the missing values with an estimate.\n",
"\n",
"In our dataframe `orig_accident`, the column `Aboard` has some missing values. Follow these steps to impute the missing values:\n",
"- Extract this column and as a `pandas.Series` and assign it to a variable called `aboard_missing`.\n",
"- Compute the mean of `aboard_missing` and store the result in `aboard_average`.\n",
"- Finally, create a new variable `aboard_people` where the missing values in `aboard_missing` have been imputed with the value in `aboard_average`.\n",
"\n",
"**Hint:** you can use the methods `.fillna()` or `.isnull()`."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
"5 False\n",
"6 False\n",
"7 False\n",
"8 True\n",
"9 False\n",
"10 False\n",
"11 False\n",
"12 False\n",
"13 False\n",
"14 True\n",
"15 False\n",
"16 False\n",
"17 False\n",
"18 False\n",
"19 False\n",
"20 True\n",
"21 False\n",
"22 False\n",
"23 False\n",
"24 False\n",
"25 False\n",
"26 True\n",
"27 False\n",
"28 False\n",
"29 False\n",
" ... \n",
"5238 False\n",
"5239 False\n",
"5240 False\n",
"5241 False\n",
"5242 False\n",
"5243 False\n",
"5244 False\n",
"5245 False\n",
"5246 False\n",
"5247 False\n",
"5248 False\n",
"5249 False\n",
"5250 False\n",
"5251 False\n",
"5252 False\n",
"5253 False\n",
"5254 False\n",
"5255 False\n",
"5256 False\n",
"5257 False\n",
"5258 False\n",
"5259 False\n",
"5260 False\n",
"5261 False\n",
"5262 False\n",
"5263 False\n",
"5264 False\n",
"5265 False\n",
"5266 False\n",
"5267 False\n",
"Name: Aboard, Length: 5268, dtype: bool\n",
"27.595427242782407\n",
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
"5 False\n",
"6 False\n",
"7 False\n",
"8 True\n",
"9 False\n",
"10 False\n",
"11 False\n",
"12 False\n",
"13 False\n",
"14 True\n",
"15 False\n",
"16 False\n",
"17 False\n",
"18 False\n",
"19 False\n",
"20 True\n",
"21 False\n",
"22 False\n",
"23 False\n",
"24 False\n",
"25 False\n",
"26 True\n",
"27 False\n",
"28 False\n",
"29 False\n",
" ... \n",
"5238 False\n",
"5239 False\n",
"5240 False\n",
"5241 False\n",
"5242 False\n",
"5243 False\n",
"5244 False\n",
"5245 False\n",
"5246 False\n",
"5247 False\n",
"5248 False\n",
"5249 False\n",
"5250 False\n",
"5251 False\n",
"5252 False\n",
"5253 False\n",
"5254 False\n",
"5255 False\n",
"5256 False\n",
"5257 False\n",
"5258 False\n",
"5259 False\n",
"5260 False\n",
"5261 False\n",
"5262 False\n",
"5263 False\n",
"5264 False\n",
"5265 False\n",
"5266 False\n",
"5267 False\n",
"Name: Aboard, Length: 5268, dtype: bool\n",
"Average aboard people: 27.595427242782407\n"
]
}
],
"source": [
"### GRADED\n",
"\n",
"### YOUR SOLUTION HERE\n",
"aboard_missing = orig_accident['Aboard']\n",
"aboard_average = aboard_missing.mean()\n",
"\n",
"###\n",
"### YOUR CODE HERE\n",
"aboard_people = aboard_missing.fillna(value = aboard_average)\n",
"\n",
"###\n",
"\n",
"### Answer check\n",
"print(\"Average aboard people: {}\".format(aboard_average))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 03",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"## Regression Evaluation Metrics\n",
"\n",
"\n",
"In general, we can use three different error evaluation metrics:\n",
"\n",
"**Mean Absolute Error** (MAE): the mean of the absolute value of the errors\n",
"\n",
"$$\\frac 1n\\sum_{i=1}^n|y_i-\\hat{y}_i|$$\n",
"\n",
"**Mean Squared Error** (MSE): the mean of the squared errors\n",
"\n",
"$$\\frac 1n\\sum_{i=1}^n(y_i-\\hat{y}_i)^2$$\n",
"\n",
"**Root Mean Squared Error** (RMSE): the square root of the mean of the squared errors\n",
"\n",
"$$\\sqrt{\\frac 1n\\sum_{i=1}^n(y_i-\\hat{y}_i)^2}$$\n",
"\n",
"These evaluation metrics compare to each other in the following way:\n",
"\n",
"- The **MAE** is the easiest to understand because it's just the average error.\n",
"- The **MSE** is more popular than MAE because MSE \"punishes\" larger errors. For this reason MSE tends to be more useful in real world problems.\n",
"- The **RMSE** is even more popular than MSE because is interpretable in the \"y\" units.\n",
"\n",
"Because our goal is to minimize the error, we can also refer to these metrics as **loss functions**."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"
\n",
"[Return to top](#questions)\n",
"\n",
"## Question 4\n",
"\n",
"Next, we will find the error between the aboard people and fatalities in our dataset. \n",
"- Fill the missing values in the column `Fatalities` from `orig_accident` with the average value. Store the result as a `Pandas.Series` to `fatal_count`.\n",
"- Compute the MAE, MSE and RMSE between `fatal_count` and `aboard_people`. Save the result of each metric comparison into variables called `crash_mae`, `crash_mse`, and `crash_rmse`, respectively."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7.852438601750174\n",
"908.9028198094563\n",
"30.14801518855688\n",
"Missing values in fatal_count: 0\n",
"MAE: 7.852438601750174\n",
"MSE: 908.9028198094563\n",
"RMSE: 30.14801518855688\n"
]
}
],
"source": [
"from sklearn import metrics\n",
"from sklearn.metrics import mean_squared_error , mean_absolute_error\n",
"from math import sqrt\n",
"### GRADED\n",
"\n",
"fatalities = orig_accident['Fatalities']\n",
"fatalities_mean = fatalities.mean()\n",
"fatalities =fatalities.fillna(value = fatalities_mean)\n",
"\n",
"### YOUR SOLUTION HERE\n",
"fatal_count = fatalities\n",
"\n",
"crash_mae = None\n",
"crash_mse = None\n",
"crash_rmse = None\n",
"\n",
"\n",
"crash_mae = mean_absolute_error(aboard_people, fatal_count)\n",
"crash_mse = mean_squared_error(aboard_people, fatal_count)\n",
"crash_rmse = sqrt(mean_squared_error(aboard_people, fatal_count))\n",
"###\n",
"### YOUR CODE HERE\n",
"\n",
"###\n",
"\n",
"### Answer check\n",
"print(\"Missing values in fatal_count: {}\".format(fatal_count.isnull().sum()))\n",
"print(\"MAE: {}\".format(crash_mae))\n",
"print(\"MSE: {}\".format(crash_mse))\n",
"print(\"RMSE: {}\".format(crash_rmse))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 04",
"locked": true,
"points": "15",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"
\n",
"[Return to top](#questions)\n",
"\n",
"## Question 5\n",
"\n",
"Sometimes it is useful to extract rows or columns from a dataframe by setting a condition based on some specific feature we are interested in. For example, we can extract only the entries from the dataframe that have a desired value. Alternatively, we can select entries by applying a boolean condition to the DataFrame. \n",
"\n",
"From the dataframe `orig_accident`, extract only the rows that have the value `\"Zeppelin L-1 (airship)\"` in the column `Type`. Assign the resulting array to the variable `zeppelin_flights`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"Date | \n",
"Time | \n",
"Location | \n",
"Operator | \n",
"Flight # | \n",
"Route | \n",
"Type | \n",
"Registration | \n",
"cn/In | \n",
"Aboard | \n",
"Fatalities | \n",
"Ground | \n",
"
\n",
"\n",
"\n",
"\n",
"3 | \n",
"09/09/1913 | \n",
"18:30 | \n",
"Over the North Sea | \n",
"Military - German Navy | \n",
"NaN | \n",
"NaN | \n",
"Zeppelin L-1 (airship) | \n",
"NaN | \n",
"NaN | \n",
"20.0 | \n",
"14.0 | \n",
"0.0 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" Date Time Location Operator Flight # \\\n",
"3 09/09/1913 18:30 Over the North Sea Military - German Navy NaN \n",
"\n",
" Route Type Registration cn/In Aboard Fatalities Ground \n",
"3 NaN Zeppelin L-1 (airship) NaN NaN 20.0 14.0 0.0 "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### GRADED\n",
"\n",
"### YOUR SOLUTION HERE\n",
"\n",
"zeppelin_flights = orig_accident.loc[orig_accident['Type'] == \"Zeppelin L-1 (airship)\" ]\n",
"\n",
"###\n",
"### YOUR CODE HERE\n",
"###\n",
"\n",
"### Answer check\n",
"zeppelin_flights"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 05",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"## Training a model and the Shapiro-Wilk test\n",
"\n",
"In the second part of the assignment, we will work using the linear regression and the _Shapiro-Wilk_ test on a dataframe containing information about houses in different regions of the United States.\n",
"\n",
"Imagine your friend is a real estate agent and wants some help predicting housing prices for different regions in the USA. It would be helpful if you could somehow create a model that takes a few features of a house and returns the estimate of what the house would sell for.\n",
"\n",
"He has asked you if you could help him out with your new data science skills. You say yes and decide that Linear Regression might be a good path to solve this problem!\n",
"\n",
"The dataset with historical real state information is stored in the file `USA_Housing.csv` and contains the following columns:\n",
"\n",
"* 'Avg. Area Income': Avg. Income of residents of the city the house is located in.\n",
"* 'Avg. Area House Age': Avg Age of Houses in the same city.\n",
"* 'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in the same city.\n",
"* 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in the same city.\n",
"* 'Area Population': Population of the city the house is located in.\n",
"* 'Price': Final Sale Price for the house.\n",
"* 'Address': Address for the house."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"deletable": false,
"editable": false,
"id": "f7Yr6WqnBEo8",
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"## Read and extract information about the data\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": [],
"colab_type": "code",
"collapsed": true,
"deletable": false,
"editable": false,
"id": "qd5jmv6YBEo9",
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [],
"source": [
"## DON'T CHANGE THIS CODE\n",
"USAhousing = pd.read_csv('data/USA_Housing.csv')"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"Avg. Area Income | \n",
"Avg. Area House Age | \n",
"Avg. Area Number of Rooms | \n",
"Avg. Area Number of Bedrooms | \n",
"Area Population | \n",
"Price | \n",
"Address | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"79545.458574 | \n",
"5.682861 | \n",
"7.009188 | \n",
"4.09 | \n",
"23086.800503 | \n",
"1.059034e+06 | \n",
"208 Michael Ferry Apt. 674\\nLaurabury, NE 3701... | \n",
"
\n",
"\n",
"1 | \n",
"79248.642455 | \n",
"6.002900 | \n",
"6.730821 | \n",
"3.09 | \n",
"40173.072174 | \n",
"1.505891e+06 | \n",
"188 Johnson Views Suite 079\\nLake Kathleen, CA... | \n",
"
\n",
"\n",
"2 | \n",
"61287.067179 | \n",
"5.865890 | \n",
"8.512727 | \n",
"5.13 | \n",
"36882.159400 | \n",
"1.058988e+06 | \n",
"9127 Elizabeth Stravenue\\nDanieltown, WI 06482... | \n",
"
\n",
"\n",
"3 | \n",
"63345.240046 | \n",
"7.188236 | \n",
"5.586729 | \n",
"3.26 | \n",
"34310.242831 | \n",
"1.260617e+06 | \n",
"USS Barnett\\nFPO AP 44820 | \n",
"
\n",
"\n",
"4 | \n",
"59982.197226 | \n",
"5.040555 | \n",
"7.839388 | \n",
"4.23 | \n",
"26354.109472 | \n",
"6.309435e+05 | \n",
"USNS Raymond\\nFPO AE 09386 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms \\\n",
"0 79545.458574 5.682861 7.009188 \n",
"1 79248.642455 6.002900 6.730821 \n",
"2 61287.067179 5.865890 8.512727 \n",
"3 63345.240046 7.188236 5.586729 \n",
"4 59982.197226 5.040555 7.839388 \n",
"\n",
" Avg. Area Number of Bedrooms Area Population Price \\\n",
"0 4.09 23086.800503 1.059034e+06 \n",
"1 3.09 40173.072174 1.505891e+06 \n",
"2 5.13 36882.159400 1.058988e+06 \n",
"3 3.26 34310.242831 1.260617e+06 \n",
"4 4.23 26354.109472 6.309435e+05 \n",
"\n",
" Address \n",
"0 208 Michael Ferry Apt. 674\\nLaurabury, NE 3701... \n",
"1 188 Johnson Views Suite 079\\nLake Kathleen, CA... \n",
"2 9127 Elizabeth Stravenue\\nDanieltown, WI 06482... \n",
"3 USS Barnett\\nFPO AP 44820 \n",
"4 USNS Raymond\\nFPO AE 09386 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"USAhousing.head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": [],
"colab_type": "code",
"deletable": false,
"editable": false,
"id": "U8IkEWefBEpF",
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"outputId": "5d8de68b-0199-4e53-9b32-4cf862b9ca81"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"
\n",
"RangeIndex: 5000 entries, 0 to 4999\n",
"Data columns (total 7 columns):\n",
"Avg. Area Income 5000 non-null float64\n",
"Avg. Area House Age 5000 non-null float64\n",
"Avg. Area Number of Rooms 5000 non-null float64\n",
"Avg. Area Number of Bedrooms 5000 non-null float64\n",
"Area Population 5000 non-null float64\n",
"Price 5000 non-null float64\n",
"Address 5000 non-null object\n",
"dtypes: float64(6), object(1)\n",
"memory usage: 273.5+ KB\n"
]
}
],
"source": [
"USAhousing.info()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": [],
"colab_type": "code",
"deletable": false,
"editable": false,
"id": "ZlmgNJriBEpI",
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"outputId": "2baf4687-162b-4165-a462-c666509b9eea"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"Avg. Area Income | \n",
"Avg. Area House Age | \n",
"Avg. Area Number of Rooms | \n",
"Avg. Area Number of Bedrooms | \n",
"Area Population | \n",
"Price | \n",
"
\n",
"\n",
"\n",
"\n",
"count | \n",
"5000.000000 | \n",
"5000.000000 | \n",
"5000.000000 | \n",
"5000.000000 | \n",
"5000.000000 | \n",
"5.000000e+03 | \n",
"
\n",
"\n",
"mean | \n",
"68583.108984 | \n",
"5.977222 | \n",
"6.987792 | \n",
"3.981330 | \n",
"36163.516039 | \n",
"1.232073e+06 | \n",
"
\n",
"\n",
"std | \n",
"10657.991214 | \n",
"0.991456 | \n",
"1.005833 | \n",
"1.234137 | \n",
"9925.650114 | \n",
"3.531176e+05 | \n",
"
\n",
"\n",
"min | \n",
"17796.631190 | \n",
"2.644304 | \n",
"3.236194 | \n",
"2.000000 | \n",
"172.610686 | \n",
"1.593866e+04 | \n",
"
\n",
"\n",
"25% | \n",
"61480.562388 | \n",
"5.322283 | \n",
"6.299250 | \n",
"3.140000 | \n",
"29403.928702 | \n",
"9.975771e+05 | \n",
"
\n",
"\n",
"50% | \n",
"68804.286404 | \n",
"5.970429 | \n",
"7.002902 | \n",
"4.050000 | \n",
"36199.406689 | \n",
"1.232669e+06 | \n",
"
\n",
"\n",
"75% | \n",
"75783.338666 | \n",
"6.650808 | \n",
"7.665871 | \n",
"4.490000 | \n",
"42861.290769 | \n",
"1.471210e+06 | \n",
"
\n",
"\n",
"max | \n",
"107701.748378 | \n",
"9.519088 | \n",
"10.759588 | \n",
"6.500000 | \n",
"69621.713378 | \n",
"2.469066e+06 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms \\\n",
"count 5000.000000 5000.000000 5000.000000 \n",
"mean 68583.108984 5.977222 6.987792 \n",
"std 10657.991214 0.991456 1.005833 \n",
"min 17796.631190 2.644304 3.236194 \n",
"25% 61480.562388 5.322283 6.299250 \n",
"50% 68804.286404 5.970429 7.002902 \n",
"75% 75783.338666 6.650808 7.665871 \n",
"max 107701.748378 9.519088 10.759588 \n",
"\n",
" Avg. Area Number of Bedrooms Area Population Price \n",
"count 5000.000000 5000.000000 5.000000e+03 \n",
"mean 3.981330 36163.516039 1.232073e+06 \n",
"std 1.234137 9925.650114 3.531176e+05 \n",
"min 2.000000 172.610686 1.593866e+04 \n",
"25% 3.140000 29403.928702 9.975771e+05 \n",
"50% 4.050000 36199.406689 1.232669e+06 \n",
"75% 4.490000 42861.290769 1.471210e+06 \n",
"max 6.500000 69621.713378 2.469066e+06 "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"USAhousing.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"\n",
"[Return to top](#questions)\n",
"\n",
"## Question 6\n",
"\n",
"Extract the first 10 rows of the the `USAhousing` dataset and store them in a dataframe called `df`:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"colab": [],
"colab_type": "code",
"id": "hs1MkVCqBEpA",
"outputId": "7d4b61c6-924b-4fae-818d-2d471c3c56e6"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"Avg. Area Income | \n",
"Avg. Area House Age | \n",
"Avg. Area Number of Rooms | \n",
"Avg. Area Number of Bedrooms | \n",
"Area Population | \n",
"Price | \n",
"Address | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"79545.458574 | \n",
"5.682861 | \n",
"7.009188 | \n",
"4.09 | \n",
"23086.800503 | \n",
"1.059034e+06 | \n",
"208 Michael Ferry Apt. 674\\nLaurabury, NE 3701... | \n",
"
\n",
"\n",
"1 | \n",
"79248.642455 | \n",
"6.002900 | \n",
"6.730821 | \n",
"3.09 | \n",
"40173.072174 | \n",
"1.505891e+06 | \n",
"188 Johnson Views Suite 079\\nLake Kathleen, CA... | \n",
"
\n",
"\n",
"2 | \n",
"61287.067179 | \n",
"5.865890 | \n",
"8.512727 | \n",
"5.13 | \n",
"36882.159400 | \n",
"1.058988e+06 | \n",
"9127 Elizabeth Stravenue\\nDanieltown, WI 06482... | \n",
"
\n",
"\n",
"3 | \n",
"63345.240046 | \n",
"7.188236 | \n",
"5.586729 | \n",
"3.26 | \n",
"34310.242831 | \n",
"1.260617e+06 | \n",
"USS Barnett\\nFPO AP 44820 | \n",
"
\n",
"\n",
"4 | \n",
"59982.197226 | \n",
"5.040555 | \n",
"7.839388 | \n",
"4.23 | \n",
"26354.109472 | \n",
"6.309435e+05 | \n",
"USNS Raymond\\nFPO AE 09386 | \n",
"
\n",
"\n",
"5 | \n",
"80175.754159 | \n",
"4.988408 | \n",
"6.104512 | \n",
"4.04 | \n",
"26748.428425 | \n",
"1.068138e+06 | \n",
"06039 Jennifer Islands Apt. 443\\nTracyport, KS... | \n",
"
\n",
"\n",
"6 | \n",
"64698.463428 | \n",
"6.025336 | \n",
"8.147760 | \n",
"3.41 | \n",
"60828.249085 | \n",
"1.502056e+06 | \n",
"4759 Daniel Shoals Suite 442\\nNguyenburgh, CO ... | \n",
"
\n",
"\n",
"7 | \n",
"78394.339278 | \n",
"6.989780 | \n",
"6.620478 | \n",
"2.42 | \n",
"36516.358972 | \n",
"1.573937e+06 | \n",
"972 Joyce Viaduct\\nLake William, TN 17778-6483 | \n",
"
\n",
"\n",
"8 | \n",
"59927.660813 | \n",
"5.362126 | \n",
"6.393121 | \n",
"2.30 | \n",
"29387.396003 | \n",
"7.988695e+05 | \n",
"USS Gilbert\\nFPO AA 20957 | \n",
"
\n",
"\n",
"9 | \n",
"81885.927184 | \n",
"4.423672 | \n",
"8.167688 | \n",
"6.10 | \n",
"40149.965749 | \n",
"1.545155e+06 | \n",
"Unit 9446 Box 0958\\nDPO AE 97025 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms \\\n",
"0 79545.458574 5.682861 7.009188 \n",
"1 79248.642455 6.002900 6.730821 \n",
"2 61287.067179 5.865890 8.512727 \n",
"3 63345.240046 7.188236 5.586729 \n",
"4 59982.197226 5.040555 7.839388 \n",
"5 80175.754159 4.988408 6.104512 \n",
"6 64698.463428 6.025336 8.147760 \n",
"7 78394.339278 6.989780 6.620478 \n",
"8 59927.660813 5.362126 6.393121 \n",
"9 81885.927184 4.423672 8.167688 \n",
"\n",
" Avg. Area Number of Bedrooms Area Population Price \\\n",
"0 4.09 23086.800503 1.059034e+06 \n",
"1 3.09 40173.072174 1.505891e+06 \n",
"2 5.13 36882.159400 1.058988e+06 \n",
"3 3.26 34310.242831 1.260617e+06 \n",
"4 4.23 26354.109472 6.309435e+05 \n",
"5 4.04 26748.428425 1.068138e+06 \n",
"6 3.41 60828.249085 1.502056e+06 \n",
"7 2.42 36516.358972 1.573937e+06 \n",
"8 2.30 29387.396003 7.988695e+05 \n",
"9 6.10 40149.965749 1.545155e+06 \n",
"\n",
" Address \n",
"0 208 Michael Ferry Apt. 674\\nLaurabury, NE 3701... \n",
"1 188 Johnson Views Suite 079\\nLake Kathleen, CA... \n",
"2 9127 Elizabeth Stravenue\\nDanieltown, WI 06482... \n",
"3 USS Barnett\\nFPO AP 44820 \n",
"4 USNS Raymond\\nFPO AE 09386 \n",
"5 06039 Jennifer Islands Apt. 443\\nTracyport, KS... \n",
"6 4759 Daniel Shoals Suite 442\\nNguyenburgh, CO ... \n",
"7 972 Joyce Viaduct\\nLake William, TN 17778-6483 \n",
"8 USS Gilbert\\nFPO AA 20957 \n",
"9 Unit 9446 Box 0958\\nDPO AE 97025 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### GRADED\n",
"\n",
"### YOUR SOLUTION HERE\n",
"df = USAhousing.head(10)\n",
"\n",
"###\n",
"### YOUR CODE HERE\n",
"###\n",
"\n",
"### Answer check\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 06",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"deletable": false,
"editable": false,
"id": "CyDo8jZwBEpS",
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"\n",
"## Exploratory data analysis (EDA)\n",
"\n",
"We will start our EDA by creating simple plots to visualize scatterplots of all numeric features against each other."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"colab": [],
"colab_type": "code",
"deletable": false,
"editable": false,
"id": "sDz7mXIDBEpT",
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"outputId": "246cc53e-0721-46f4-a150-6a61c024b783"
},
"outputs": [
{
"data": {
"image/png":...