this assessment has 2 parts, but currently, i just have a part A which part B is still haven't published yet. both parts due on the same date.
37357 Spring 2021 Page 1 of 5 37357 Advanced Statistical Modelling SAS Assignment 1 Part 1 – Linear Regression This assignment is worth 40% of your final mark, split into two parts, the linear regression component and the logistic regression. Both parts are due in Week 11 – Friday at 5pm via CANVAS upload. Late submissions will attract a penalty of 25% for every 24 hours or part thereof. The assignment uses a dataset called housing, which can be found in the SAS course library. Run the following code to access the dataset and load it into your temporary library. LIBNAME mydata "~/my_shared_file_links/james.brown/ASM2021" access=readonly; run; proc print data=mydata.housing; run; Submission Guidelines • Assignments are to be typed (not handwritten), and uploaded in CANVAS. • Assignments must include a signed cover sheet (can be found on CANVAS). • Questions are to be answered in order. • Each question must be clearly distinguishable – do not answer multiple questions in one par- agraph. • Miscellaneous plots tacked onto the end of the assignment will not be considered. When you are asked for a specific plot, you should provide it within the relevant question. • You are welcome to copy relevant bits of output into your answers if you think it supports your argument, but you will be marked down if you simply copy and paste everything. • Written answers must be no more than a few lines for each question. SAS coding hints are provided at the end of the document. 37357 Spring 2021 Page 2 of 5 Linear Regression of the Boston Housing Dataset [15 marks] The dataset housing in the SAS course library contains information regarding the housing values in the suburbs of Boston. The data frame has 506 rows and 14 variables (in order): crim per capita crime rate by town; zn proportion of residential land zoned for lots over 25,000 sq ft; indus proportion of non-retail business acres per town; chas Charles River dummy variable (= 1 if tract bounds river, 0 otherwise); nox nitrogen oxides concentration (parts per 10 million); rm categorized average number of rooms per dwelling (1 = 6 or less rooms, 2 = 7 rooms, 3 = 8 or more rooms); age proportion of owner-occupied units built prior to 1940. dis weighted mean of distances to five Boston employment centres; rad index of accessibility to radial highways; tax full-value property-tax rate per USD 10,000; ptratio pupil-teacher ratio by town; black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town; lstat percentage of lower status of the population; medv median value of owner-occupied homes in USD 1000's. (1) [1 mark] Produce a scatterplot of 'medv' against 'nox' and add a regression line. Describe the relationship. (2) [1 mark] Perform a simple regression of 'medv' against 'nox'. Using the diagnostic plots, comment on the model. (3) [2 marks] Fit a multiple regression of 'medv' on the rest of the predictors, using only main linear effects (that is, no interactions or polynomial terms). Briefly explain the process by which you select the model. Produce the diagnostic plots and regression summary and com- ment on the model. (4) [1 mark] Produce a scatterplot of 'medv' against ‘lstat', and include linear, quadratic and cu- bic regression lines. Describe the presence of potential non-linear relationships. (5) [1 mark] Add a suitable polynomial term to your final model from question (3). Produce the diagnostic plots and compare these against the previous model. Comment on the suitability of the polynomial term. 37357 Spring 2021 Page 3 of 5 (6) [1 mark] Produce a scatterplot of 'medv' against 'nox' and include the regression lines grouped by the dummy variable 'chas'. Describe the relationship between 'chas', 'nox' and 'medv'. Comment on the necessity of including or not the interaction between 'chas' and 'nox' in the regression model. (7) [1 mark] Fit a multiple regression with the variables from question (5), adding an interaction term between 'rm' and 'lstat' and, if you believe necessary, the potential interaction from question (6). Produce the diagnostic plots, compare these against the previous model, and comment on the suitability of the interaction between 'rm' and ‘lstat'. (8) [2 marks] Using the model from (7), remove any unnecessary terms and include any other factors you think might be appropriate such as interaction and polynomial terms. Comment on the model assumptions. Explain why this model is the most appropriate in your opinion. (9) [2 marks] Write down the regression equation for this final model identified in question (8). Explain if the overall regression significant and provide the coefficient of determination. Recall from Lab Week 2 that we used the proc score statement to predict the response of new ob- servations, which unfortunately does not provide confidence limits. To obtain the relevant confidence limits, first remove variables that are redundant in the dataset, so that the variables in the dataset correspond to that of your final model. Then create a new dataset for prediction with y as missing values. Make sure the variables in both dataset are in the same or- der. Append the new dataset to the existing one and re-run a regression using the combined da- taset, specifying the relevant keyword in the output statement of proc reg (see code below). (10) [3 marks] Explain what factors appear to be influential in determining the median value of owner-occupied homes, based on the available data and your model. In order to explain this, you should interpret the coefficients of your final model and also compare predictions from the model with relevant intervals. 37357 Spring 2021 Page 4 of 5 SAS Coding Hints Deleting Observations The following code deletes the 7th observation in the dataset data. data data; set data; if _n_ = 7 then delete; run; Deleting Variables The following code tells SAS to drop three variables – bla1, bla2 and bla3 – when reading from the dataset data in order to create the data2 dataset. data data2; set data (drop = bla1 bla2 bla3); run; Creating Dummy Variables The following code creates a new variable called dummy in the dataset data. This new variable will be equal to 1 when the variable x is equal to bla, and 0 otherwise. data data; set data; dummy = 0; if x = 'bla' then dummy = 1; run; Representing Numeric Missing Values Missing values for numeric variables are represented by a single decimal point. The following code creates a new dataset data with a single observation and two variables, x which takes a value of 8 and y with missing value. data data; input x y; cards; 8 . ; run; 37357 Spring 2021 Page 5 of 5 Appending Dataset The following code appends the dataset new_data to the existing_data dataset. proc append base = existing_data data = new_data; run; Confidence Intervals for the Parameter Estimates Recall from Lab Week 2 that we can print out the confidence intervals for the parameter estimates using the clb option in the model statement. The default setting for alpha is 0.05, but this can be changed with the alpha option. For example, the following code will produce 100×(1-0.1) = 90% confidence intervals for the parameter estimates in the dataset data. proc reg data=data; model y = x / alpha = 0.1 clb; run; Confidence and Prediction Intervals for New Observations The following code fit a simple linear regression to the dataset data and creates an output dataset predData. In addition to the variables in data, predData contains the following variables (you could have named these variables anything you like): • ypredicted: predicted values of the dependent variable y, including the ones with miss- ing y. • ylowerin: lower bound of a 100(1-0.05)% confidence interval for an individual prediction. • yupperin: upper bound of a 100(1-0.05)% confidence interval for an individual prediction. • ylowermean: lower bound of a 100(1-0.05)% confidence interval for the expected value of the dependent variable. • yuppermean: upper bound of a 100(1-0.05)% confidence interval for the expected value of the dependent variable. prog reg data=data; model y = x / alpha = 0.05; output out = predData p = ypredicted LCL = ylowerin LCLM = ylowermean UCL = yupperin UCLM = yuppermean; run; Saving/Copying Output The easiest way is to take screen shots or use the snipping tool (windows) to copy graphs.