On file
STAT54385 - Homework 8 Due: May 4th before 11:59pm • 100 points • Submit your solution in OneNote • You MUST attach all your R codes used to produce outputs. The instructor will test your R code. • Using the R Markdown for this homework is highly encouraged. Answer the following questions: 1. Consider the logistic model for the south African heart disease data. 1. Based on the full model, we want to use the LRT to see whether or not two coefficients for tobacco and alcohol are zero. What is the test statistic, its degree of freedom and p-value? 2. At α = 0.05, what is your conclusion? 2. (Continued) The south African heart disease data. 1. Let p = Pr(chd = 1|X). We consider two models: M1 : logit(p) = β0 + β1ldl + +β2famhist+ β3age+ β4tobacco; M2 : logit(p) = β0 + β1ldl + +β2famhist+ β3age+ β4alcohol. Are they nested models? 2. We want to choose one of these two models. What procedure can we use for the purpose? 3. Perfrom the procedure you described in the last question, and select a model. 1 STAT54385 - Homework 8 Due: May 4th before 11:59pm Answer the following questions: Logistic Regression In this session, we will see how to implement logistic regression using R. South African Heart Disease Descriptions of variables in this data set can be found here. install: knitr and rmarkdown Load a data set (csv format) The data set follows the comma separated value file format. We loaded the data, and drop a few variables that you will not use. saheart = read.table("SAheart.data",sep=",",head=T,row.names=1) names(saheart) # variable names in the advertising data ## [1] "sbp" "tobacco" "ldl" "adiposity" "famhist" "typea" ## [7] "obesity" "alcohol" "age" "chd" saheart$chd = as.factor(saheart$chd) drops = c("typea","adiposity") saheart = saheart[,!(names(saheart) %in% drops)] attach(saheart) We use ggpairs to do some exploratory data analysis (EDA)! plotsaheart = ggpairs(saheart, mapping = aes(color = chd, alpha = 0.4), axisLabels="none", lower = list( continuous = "smooth", combo = "facetdensity")) + theme_bw() + theme(axis.line = element_line(colour = "black"), #panel.grid.major = element_blank(), panel.grid.minor = element_blank(), #panel.border = element_blank() panel.background = element_blank() ) Save a plot to a pdf file. plotsaheart http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.info.txt Alternatively, you can save the plot directly to a pdf file. This is very usefully technique when generating plots takes quite long or when the machine does not have the GUI. pdf("plotsaheart.pdf", width=12, height=9) print(plotsaheart) dev.off() Fit Logistic regression glm() is the one that people use most frequently to fit the logistic regression with binary outcomes. One needs to specify family = "binomial" in glm() as the observation follows the binomial distribution. Once we fit the model. mylogit = glm(chd~.,data = saheart, family = "binomial") summary(mylogit) ## ## Call: ## glm(formula = chd ~ ., family = "binomial", data = saheart) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.7517 -0.8378 -0.4552 0.9292 2.4434 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.1295997 0.9641558 -4.283 1.84e-05 *** ## sbp 0.0057607 0.0056326 1.023 0.30643 ## tobacco 0.0795256 0.0262150 3.034 0.00242 ** ## ldl 0.1847793 0.0574115 3.219 0.00129 ** ## famhistPresent 0.9391855 0.2248691 4.177 2.96e-05 *** ## obesity -0.0345434 0.0291053 -1.187 0.23529 ## alcohol 0.0006065 0.0044550 0.136 0.89171 ## age 0.0425412 0.0101749 4.181 2.90e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 596.11 on 461 degrees of freedom ## Residual deviance: 483.17 on 454 degrees of freedom ## AIC: 499.17 ## ## Number of Fisher Scoring iterations: 4 likelihood ratio test (LRT) One can perform the significant test for the coefficients using the LRT. mylogit1 = update(mylogit,~.-sbp-obesity-alcohol) anova(mylogit1,mylogit,test="Chisq") ## Analysis of Deviance Table ## ## Model 1: chd ~ tobacco + ldl + famhist + age ## Model 2: chd ~ sbp + tobacco + ldl + famhist + obesity + alcohol + age ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 457 485.44 ## 2 454 483.17 3 2.2698 0.5183 Model selection The AIC has ? = 2, and the BIC has ? = log(?). We apply the forward stepwise selection using BIC and the backward stepwise selection using AIC. Final selected models do not agree, but their AICs and BICs are quite close. N = log(length(chd)) # number of observations fmod = glm(chd~.,data = saheart, family = "binomial") # full model nmod = glm(chd~1,data = saheart, family = "binomial") # null model # forward stepwise (BIC) fstep = step(nmod, scope=list(lower=nmod, upper=fmod), direction="forward", k=log(N), trace=FALSE) fstep$anova # enumerated varialbes that are added and droped ## Step Df Deviance Resid. Df Resid. Dev AIC ## 1 NA NA 461 596.1084 597.9225 ## 2 + age -1 70.546083 460 525.5623 529.1905 ## 3 + famhist -1 18.904183 459 506.6582 512.1005 ## 4 + tobacco -1 11.272755 458 495.3854 502.6418 ## 5 + ldl -1 9.941538 457 485.4439 494.5144 BIC(fstep) ## [1] 516.1217 AIC(fstep) ## [1] 495.4439 # backard stepwise (AIC) bstep = step(fmod, nmod, direction="backward", k=2, trace=FALSE) bstep$anova ## Step Df Deviance Resid. Df Resid. Dev AIC ## 1 NA NA 454 483.1740 499.1740 ## 2 - alcohol 1 0.01850382 455 483.1925 497.1925 ## 3 - sbp 1 1.10421166 456 484.2967 496.2967 ## 4 - obesity 1 1.14711316 457 485.4439 495.4439 BIC(bstep) ## [1] 516.1217 AIC(bstep) ## [1] 495.4439 We can also consider two-way interaction terms with the full model, and perform the variable selection. We improve the model fit slightly by including interaction terms. # backward stepwise + 2 way interactions f2istep = step(fmod, ~.^2, k=log(N), trace=FALSE) f2istep$anova ## Step Df Deviance Resid. Df Resid. Dev AIC ## 1 NA NA 454 483.1740 497.6868 ## 2 + ldl:famhist -1 9.324987 453 473.8490 490.1760 ## 3 + obesity:alcohol -1 5.612713 452 468.2363 486.3774 ## 4 - sbp 1 1.300862 453 469.5372 485.8641 BIC(f2istep) ## [1] 524.7573 AIC(f2istep) ## [1] 487.5372 Logistic Regression South African Heart Disease Load a data set (csv format) Save a plot to a pdf file. Fit Logistic regression likelihood ratio test (LRT) Model selection