11/10/2019 1/2 Dataset: City The original data from Chicago’s data portal contains detailed information for each crime and call to 311. We have split the city up into regions using a simple grid and...

1 answer below »
a numpy assignment


11/10/2019 1/2 Dataset: City The original data from Chicago’s data portal contains detailed information for each crime and call to 311. We have split the city up into regions using a simple grid and have aggregated this data by region. Each city data file contains data for different types of complaints (that is, calls to 311) and the total amount of crimes on a per-region basis. The first row in the file contains column labels, for example, GRAFFITI or POT_HOLES. Subsequent rows contain data for different regions of the city. A column contains data for a given variable across all the rows. For example, the column with index 1 (the second column) contains the number of calls about pot holes for each region. In addition to information about specific types of com- plaints, the file also has one column that contains the total number of crimes in each region. File paths: data/city Parameters: {"name": "City", "predictor_vars": [0, 1, 2, 3, 4, 5, 6], "dependent_var": 7, "training_fraction": 0.55, "seed": 22992} City Task 1a: CRIME_TOTALS ~ 575.687669 + 0.678349 * GRAFFITI R2: 0.14027491610313492 CRIME_TOTALS ~ -22.208880 + 5.375417 * POT_HOLES R2: 0.6229070858532731 CRIME_TOTALS ~ 227.414583 + 7.711958 * RODENTS R2: 0.5575360783921093 CRIME_TOTALS ~ 11.553128 + 18.892669 * GARBAGE R2: 0.7831498392992615 CRIME_TOTALS ~ -65.954319 + 13.447459 * STREET_LIGHTS R2: 0.7198560514392484 CRIME_TOTALS ~ 297.222082 + 10.324616 * TREE_DEBRIS R2: 0.32659079486818354 CRIME_TOTALS ~ 308.489056 + 10.338500 * ABANDONED_BUILDINGS R2: 0.6897288976957778 City Task 1b: CRIME_TOTALS ~ -35.784745 + -0.347343 * GRAFFITI + 3.596555 * POT_HOLES + -0.143517 * RO- DENTS + 4.214673 * GARBAGE + 2.446765 * STREET_LIGHTS + -4.148366 * TREE_DEBRIS + 5.724136 * ABANDONED_BUILDINGS R2: 0.8909173620789893 City Task 2: CRIME_TOTALS ~ -36.151629 + 3.300180 * POT_HOLES + 7.129337 * ABANDONED_BUILDINGS R2: 0.8580580940940485 City Task 3: 11/10/2019 2/2 CRIME_TOTALS ~ 308.489056 + 10.338500 * ABANDONED_BUILDINGS R2: 0.6897288976957778 CRIME_TOTALS ~ -36.151629 + 3.300180 * POT_HOLES + 7.129337 * ABANDONED_BUILDINGS R2: 0.8580580940940485 CRIME_TOTALS ~ -53.303574 + -0.213704 * GRAFFITI + 3.948901 * POT_HOLES + 6.769038 * ABAN- DONED_BUILDINGS R2: 0.8650034618337505 CRIME_TOTALS ~ -29.057833 + -0.386986 * GRAFFITI + 5.057974 * POT_HOLES + -3.424232 * TREE_DEBRIS + 7.226820 * ABANDONED_BUILDINGS R2: 0.8799155180187794 CRIME_TOTALS ~ -22.991702 + -0.337971 * GRAFFITI + 3.900442 * POT_HOLES + 5.033985 * GARBAGE + -3.433079 * TREE_DEBRIS + 6.078619 * ABANDONED_BUILDINGS R2: 0.8877056719368024 CRIME_TOTALS ~ -35.457501 + -0.348926 * GRAFFITI + 3.532662 * POT_HOLES + 4.058232 * GARBAGE + 2.554864 * STREET_LIGHTS + -4.135113 * TREE_DEBRIS + 5.688046 * ABAN- DONED_BUILDINGS R2: 0.8908748485824084 CRIME_TOTALS ~ -35.784745 + -0.347343 * GRAFFITI + 3.596555 * POT_HOLES + -0.143517 * RO- DENTS + 4.214673 * GARBAGE + 2.446765 * STREET_LIGHTS + -4.148366 * TREE_DEBRIS + 5.724136 * ABANDONED_BUILDINGS R2: 0.8909173620789893 City Task 4: CRIME_TOTALS ~ -35.457501 + -0.348926 * GRAFFITI + 3.532662 * POT_HOLES + 4.058232 * GARBAGE + 2.554864 * STREET_LIGHTS + -4.135113 * TREE_DEBRIS + 5.688046 * ABAN- DONED_BUILDINGS R2: 0.8908748485824084 Adjusted R2: 0.8875512399097913 City Task 5: CRIME_TOTALS ~ -35.457501 + -0.348926 * GRAFFITI + 3.532662 * POT_HOLES + 4.058232 * GARBAGE + 2.554864 * STREET_LIGHTS + -4.135113 * TREE_DEBRIS + 5.688046 * ABAN- DONED_BUILDINGS Training R2: 0.8908748485824084 Testing R2: 0.8084939761877112 11/10/2019 1/15 Linear Regression In this assignment, you will fit linear regression models and implement a few simple variable selection al- gorithms. The assignment will give you experience with NumPy and more practice with using classes and functions to support code reuse. You must work alone on this assignment. Introduction At the heart of the assignment is a table, where each column is a variable and each row is a sample unit. As an example, in a health study, each sample unit might be a person, with variables like height, weight, sex, etc. In your analysis, you will build models that, with varying levels of accuracy, can predict the value of one of the variables as a function of the others. Predictions are only possible if variables are related somehow. As an example, look at this plot of recorded crimes against logged complaint calls about garbage to 311. Each point describes a sample unit, which in this example represents a geographical region of Chicago. Each region is associated with variables, such as the number of crimes or complaint calls during a fixed time frame. Given this plot, if you got the question of how many crimes you think were recorded for a re- gion that had 150 complaint calls about garbage, you would follow the general trend and probably say something like 3000 recorded crimes. To formalize this prediction, we need a model for the data that re- lates a dependent variable (e.g., crimes) to a set of predictor variables (e.g., complaint calls). Our model will assume a linear dependence. To make this precise, we will use the following notation: the total number of sample units. the total number of predictor variables. In the example above, . the sample unit that we are currently considering (an integer from to ). an observation of a predictor variable for sample unit , e.g., the number of complaint calls about garbage. an observation of the dependent variable for sample unit , e.g., the total number of crimes. N K K = 1 n 0 N − 1 xnk k n yn n Due: Wednesday, Nov 13 at 9pm 11/10/2019 2/15 (1) (2) (3) (4) our prediction for the dependent variable for sample unit , based on our observation of the predictor variables. This value corresponds to a point on the red line. The residual or observed error, that is, the difference between the actual observed value of the depen- dent variable, and our prediction for it. Ideally, our predictions would match the observations, so that would always be zero. In practice, there will be some discrepancy, for two reasons. For one, when we make predictions on new data, we will not have access to the observations of the dependent vari- able. But also, our model will assume a linear dependence between the predictor variables and the de- pendent variable, while in reality the relationship will not be quite linear. So, even when we do have direct access to the observations of the dependent variable, we will not have equal to zero. Our prediction for the dependent variable will be given by a linear equation: where the coefficients are real numbers. We would like to select values for these coefficients that result in small residuals . We can rewrite this equation more concisely using vector notation. We define: a column vector of the regression coefficients, where is the intercept and (for ) is the coefficient associated with the predictor. This vector describes the red line in the figure above. Note that a positive value of a coefficient suggests a positive correlation with the dependent variable. The same is true for a negative value and a negative correlation. a column vector representation of all the predictors for a given sample unit. Note that a 1 has been prepended to the vector. This will allow us to rewrite equation (1) in vector notation without having to treat separately from the other coefficients . We can then rewrite equation (1) as: This equation can be written for all sample units at the same time using matrix notation. We define: a column vector of observations of the dependent variable. a column vector of predictions for the dependent variable. a column vector of the residuals (observed errors). an matrix where each row is one sample unit. The first column of this matrix is all ones. We can then write equations (1) and (2) for all sample units at once as And, we can express the residuals as Matrix multiplication: ŷ n n = −εn yn ŷ n εn εn = + + ⋯ + ,ŷ n β0 β1xn1 βK xnK , , … ,β0 β1 βK εn β =β ( )β0 β1 β2 ⋯ βK T β0 βk 1 ≤ k ≤ K kth =xn ( )1 xn1 xn2 ⋯ xnK T β0 βk = β.ŷ n x T n β y = ( )y0 y1 ⋯ yN−1 T =y ̂  ( )ŷ 0 ŷ 1 ⋯ ŷ N−1 T ε =ε ( )ε0 ε1 ⋯ εN−1 T X N × (K + 1) = Xβ,y ̂  β ε = y − .ε y ̂  11/10/2019 3/15 Equations (2) and (3) above involve matrix multiplication. If you are unfamiliar with matrix mul- tiplication, you will still be able to do this assignment. Just keep in mind that to make the calcula- tions less messy, the matrix contains not just the observations of the predictor variables, but also an initial column of all ones. The data we provide does not yet have this column of ones, so you will need to prepend it. Model fitting There are many possible candidates for some that fit the data better than others. Finding the best value of is referred to as fitting the model. For our purposes, the “best” value of is the one that minimizes the residuals in the least-squared sense. That is, we want the value for such that the predicted val- ues are as close to the observed values as possible (in a statistically-motivated way using maxi- mum likelihood). We will provide a function that computes this value of ; see “The linear_regression function” below. Getting started We have seeded your repository with a directory for this assignment. To pick it up, change to your capp30121-aut-19-username directory (where the string username should be replaced with your user- name) and then run the command: git pull upstream master. You should also run git pull to make sure your local copy of your repository is in sync with the server. The pa5 directory contains the following files: regression.py: Python file where you will write your code. util.py: Python file with several helper functions, some of which you will need to use in your code. output.py: This file is described in detail in the “Testing your code” section below. test_regression.py: Python file with the automated tests for this assignment. The pa5 directory also contains a data directory which, in turn, contains two sub-directories: city and stock. Data In this assignment you will write code that can be used
Answered Same DayNov 11, 2021

Answer To: 11/10/2019 1/2 Dataset: City The original data from Chicago’s data portal contains detailed...

Abr Writing answered on Nov 13 2021
157 Votes
'''
Linear regression
YOUR NAME HERE
Main file for linear regression and model selection.
'''
import numpy as np
from sklearn.model_selection import train_test_split
import util
class DataSet(object):
'''
Class fo
r representing a data set.
'''
def __init__(self, dir_path):
'''
Constructor
Inputs:
dir_path: (string) path to the directory that contains the
file
'''
# REPLACE pass WITH YOUR CODE
params = util.load_json_file(dir_path, "parameters.json")
data = util.load_numpy_array(dir_path, "data.csv")
self.name = params['name']
self.dependent_var = params['dependent_var']
self.pred_vars = params['predictor_vars']
self.seed = params['seed']
self.split_fraction = params['training_fraction']
self.col_names = data[0]
self.data = data[1]
class Model(object):
'''
Class for representing a model.
'''
def __init__(self, dataset, pred_vars):
'''
Construct a data structure to hold the model.
Inputs:
dataset: an dataset instance
pred_vars: a list of the indices for the columns (of the
original data array) used in the model.
'''
# REPLACE pass WITH YOUR CODE
self.col_names = dataset.col_names
self.dep_var = dataset.dependent_var
self.train, self.test = train_test_split(dataset.data,
test_size=None, train_size=dataset.split_fraction, random_state=dataset.seed)
self.pred_vars = pred_vars
self.pred_obs = self.train[:, self.pred_vars]
self.dependent_obs = self.train[:, self.dep_var]
self.beta = util.linear_regression(self.pred_obs,
self.dependent_obs)
x = self.rsquared()
self.R2 = x
self.adj_R2 = None
def __repr__(self):
'''
Format model as a string.
'''
# Replace this return statement with one that returns a more
# helpful string representation
# return "!!! You haven't implemented the Model __repr__ method yet !!!"
n = "{} ~ {}".format(self.col_names[self.dep_var], self.beta[0])
if type(self.pred_vars) == list:
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here