Attached is my computational biology project, which consists of a Python and R part. The R section references a previous project, in which I utilized the railTrail dataset from the mosaicData package in R to measure the affect of average temperature and precipitation on rider volume via linear regression. The R and Python templates are attached below, as is the pdf that has instructions for the project. Thank you for your time!
Project 3: CART, Random Forests, and Basic Python Project 3: CART, Random Forests, and Basic Python Instructions You will submit four documents associated with this project: • A .pdf file containing the results from knitting a .rmd file for the R portion of this project. • The raw .rmd file used to obtain the .pdf file. • A .pdf file containing results from a Jupyter notebook for the Python portion of this project. • The .ipynb file used to obtain the .pdf for the Python part of the project. Included with these instructions are .rmd and .ipynb files to use as a template. All results presented must have corresponding code. Any answers/results/plots/etc given without the corresponding R code that generated the result will not be graded. Furthermore, all code contained in your final project document should work properly. Please do not include any extraneous code or code which produces error messages. (Code which produces warnings is acceptable, as long as you understand what the warnings mean.) Problems in R: CART and Random Forests This portion of the project will concern the CART and Random Forests algorithms, and will be done in R. We will apply these methods to the dataset you chose to analyze in Project 2. 1. (5 pts) Briefly reintroduce your dataset, and state the response and predictor variables associated to one of your conceptual questions. For this problem, use the continuous response you used for the linear regression part. 2. (10 pts) Use the rpart function to fit a CART to your response variable. Then plot the tree (using, e.g., the rpart.plot package) and interpret the result. 3. (35 pts) Determine which of randomForest, rpart, and the model you fit from Project 2, does the best at predicting new data by doing the following: a. (5 pts) Divide the dataset into 20 folds using the createFolds function in the caret package. This should produce a list folds such that folds[[k]] gives the indicies to use as the testing set for the kth fold. b. (15 pts) For each fold k in 1:20, fit the data using rpart, randomForest, and your model from Project 2 using my_data[folds[[k]],] as the testing set and my_data[-folds[[k]],] as the training set. Then compute the mean-squared error MSE = 1 Ntest ∑ i (Ŷ testi − Y testi )2 for each method on each fold (here Ŷ testi denotes the ith prediction on the testing set, while Y testi is the ith response in the testing set). Print your results for the mean-squared-error across all methods and folds as a matrix with 20 rows (one for each fold) and 3 columns (one for each method). c. (15 pts) For each method, average together the misclassification rate (for binary responses) or mean-squared-error (for continuous responses) to obtain an overall measure of how well each method did. Which approach performed the best? Justify your answer. 1 Problems in Python: Designing a Linear Model Class For this half of the project, you will work on adding functionality similar to the lm function to Python. The following code defines a LinearModel class which can be used to fit linear models of the form Yi = α+ J∑ j=1 βjXij + �i where Xi1, . . . , XiJ are a collection of predictor variables of the response Yi and �i has zero mean and constant variance σ2. import numpy as np import matplotlib.pyplot as plt from scipy.stats import norm %matplotlib inline class LinearModel: def __init__(self, X, y): N = len(y) X2 = np.c_[np.ones(N), X] XtXi = np.linalg.inv(np.matmul(X2.T , X2)) XtY = np.matmul(X2.T , y) beta_hat = np.matmul(XtXi , XtY) self.X = X self.alpha_beta = beta_hat self.alpha_hat = beta_hat[0] self.beta_hat = beta_hat[1:] self.fitted = np.matmul(X2 , beta_hat) self.residuals = y - self.fitted self.sigma_hat = np.sqrt(np.sum(self.residuals * self.residuals) / (N - 1.0)) vcov = self.sigma_hat * self.sigma_hat * XtXi self.se_alpha = np.sqrt(vcov[0,0]) self.se_beta = np.sqrt(np.diag(vcov[1:,1:])) def __str__(self): out_str = "Coefficient Estimates\n" out_str += "alpha = " + str(self.alpha_hat) + "\n" for i in range(0, len(self.beta_hat)): out_str += "beta[" + str(i) + "] = " + str(self.beta_hat[i]) + "\n" out_str += "sigma = " + str(self.sigma_hat) + "\n" return out_str # Below here add the functions residual_plot(self), significance_test(self), # confidence_interval(self), and backwards_select(self) def residual_plot(self): return None def significance_test(self): return None def confidence_interval(self): return None def backwards_select(self): 2 ## The functions np.argmax and np.delete will be useful! return None Given a numpy matrix X and vector y, running my_lm = LinearModel(X,y) creates a linear model object with the following attributes: • alpha_hat: an estimate of the intercept α. • beta_hat: an estimate of the βj ’s, with beta_hat[0] being an estimate of β1, for example. • sigma_hat: an estimate of the standard deviation of the error �i. • fitted: the fitted values, α̂+ ∑J j=1 Xij β̂j . • residuals: the residuals Yi − α̂− ∑J j=1 Xij β̂j . • se_alpha: the standard error of alpha_hat • se_beta: a vector such that se_beta[i] is the standard error of beta_hat[i]. • y and X: the original response and predictors are saved as part of the object. To test that your code works, we will use the boston dataset: from sklearn import datasets boston = datasets.load_boston() X = boston.data # Store the predictors in X y = boston.target # Store medv in y boston_lm = LinearModel(X, y) 1. (10 points) Write a method .residual_plot() which uses the matplotlib library to plot the fitted values on the x-axis and the residuals on the y-axis, labeling the x-axis with “Fitted Values” and the y-axis with “Residuals”. Use this function to make a residual plot for the boston dataset and comment on your results. 2. (10 points) Write a method .significance_test() which computes, for each βj , a p-value for whether this coefficient is equal to 0 or not. You do not need to remember any statistics to do this. You just need to know that the p-value is given by 2× norm.cdf(−|β̂j/sej |) where sej is given by se_beta[j-1]. Then, run boston_lm.significance_test() and print the resulting p-values. (Note: the absolute value can be taken using numpy.abs.) 3. (10 points) Write a method .confidence_interval() which computes, for each βj , a 95% confidence interval for the coefficients. Again, you don’t need to remember any statistics to do this. You just need to know that the interval is given by β̂j ± 1.96× sej . Any format for the output you want to give is OK, but it is probably most natural to output the result as a two-dimensional numpy array with the first column giving the lower endpoint and the second column giving the upper endpoint. Then, run boston_lm.confidence_interval() and print the resulting confidence intervals. 4. (20 points) The backwards elimination algorithm is a (questionable?) approach for choosing which variables to select in a linear model. We will consider the following algorithm: 1. Fit a linear model predicting y from X. 2. Compute the p-values using .significance_test(). 3 3. If the largest p-value is greater than 0.1, remove the associated column from X and return to step 1. For example, if we have p-values p_value[0] = 0.05, p_value[1] = 0.2, and p_value[2] = 0.3, then we would remove the third column of X (given by X[:,2]) and return to step 1. 4. If all p-values are less than 0.1, or if there is only one column of X remaining, stop the procedure and return the last model you fit as the selected model. Add a method .backwards_select() which implements the above backward selection algorithm. The output method should be another object of class LinearModel. Hint: it’s probably easiest to implement this algorithm recursively, i.e., the function .backwards_select() can create another linear model which then calls .backwards_select(). Run the backwards selection algorithm and then state how many variables are included in the final model. One more hint: you can check your answers here by comparing your results to what you get using the lm function in R fit the Boston housing dataset. The answers won’t be exactly the same due to minor details in how I implemented the linear regression, but they should be very close. 4 Instructions Problems in R: CART and Random Forests Problems in Python: Designing a Linear Model Class