The dataset for this assignment contains house prices as well as 19 other features for each property. Those features are detailed below and include information about the house (number of bedrooms, bathrooms…), the lot (square footage…) and the sale conditions (period of the year…) The overall goal of the assignment is to predict the sale price of a house by using a linear regression. For this assignment, the training set is in the file "house_prices_train.csv" and the test set is in the file "house_prices_test.csv"
Here is a brief description of each feature inthe dataset:
SalePrice: the property's sale price in dollars. This is the target variable that you're trying to predict.
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
YearBuilt: Original construction date
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
BedroomAbvGr: Number of bedrooms above basement level
KitchenAbvGr: Number of kitchens
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
PoolArea: Pool area in square feet
MoSold: Month Sold
YrSold: Year Sold
Question 1: Data Cleaning
- Open the training dataset and remove all rows that contain at least one missing value (NA)
- Return the new clean dataset and the number of rows in that dataset
Question 2:
For the training dataset, print a summary of the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr” and “SalePrice”. Return the whole summary and a list containing (in that order):
- The maximum sale price
- The minimum garage area
- The first quartile of lot area
- The second most common year built
- The mean of BedroomAbvGr
Hint: Use the built-in method describe() for a pandas.DataFrame
Question 3:
Run a linear regression on "SalePrice" using the variables “LotArea”, “YearBuilt”, “GarageArea”, and “BedroomAbvGr”. For each variable, return the coefficient associated to the regression in a dictionary similar to this: {“LotArea”: 1.888, “YearBuilt”: -0.06, ...} (This is only an example not the right answer)
Compute the Root Mean Squared Error (RMSE) using the file "house_prices_test.csv" to measure the out-of-sample performance of the model.
Question 4:
Refit the model on the training set using all the variables and return the RMSE on the test set.
(The first column "unnamed: 0" isnota variable)