BIST 512 : Categorical Data Analysis
I need help with my practice exam. Thank you
Name : BIST 512 : Categorical Data Analysis Final Exam May 9th, 2pm to 4:30pm There are seven problems. Read all problems carefully before starting. Put your name on every sheet of paper before you turn in your test. No credit will be given for answers without showing work. Partial credit will be given for answers when work is shown. You have 2 hours and 30 minutes to complete the exam. Critical values for various χ2 distributions: χ21(0.05) = 3.841 χ22(0.05) = 5.991 χ23(0.05) = 7.815 χ24(0.05) = 9.488 χ25(0.05) = 11.070 χ26(0.05) = 12.592 χ27(0.05) = 14.067 χ28(0.05) = 15.507 χ29(0.05) = 16.919 χ210(0.05) = 18.307 Critical values for the N(0, 1) distribution: z0.005 = 2.57 z0.025 = 1.96 z0.05 = 1.64 z0.10 = 1.28 1 Name : 1. (20 pts) In the survey, gender was cross-classified with party identification. Table below shows some results (Reschi denotes the Pearson residual and StReschi denotes the standard- ized residual). (a) Use X2 and G2 to test the hypothesis of independence between party identification and gender. (b) Use residuals to describe the evidence of association. (c) Summarize association by constructing 95% confidence interval for the odds ratio be- tween gender and whether a Democrat or Republican. (d) Construct a log-linear model for testing independence between party identification and gender. Which parameter(s) need(s) to be tested for this purpose. Is it equivalent to the test in (a)? 2 Name : 3 Name : 4 Name : 2. (15 pts) Table below describes survival for 539 males diagnosed with lung cancer. The prognostic factors are histology (H) and state of disease (S). Time scale (T) was decided into two-month intervals and let the rate vary by the time interval. Let µijk denote the number of expected number of deaths and tijk the total time at risk for histology i and state of disease j, in the follow-up time interval k. (a) (bonus 5pts) Does the assumption of a constant rate over time is used? (b) The main effects model log(µijk/tijk) = α+βHi +β S j +β T k has deviance 43.9. Explain why df =52. Does the model seems to fit adequately? (c) For this model, interpret the estimated effects of S, β̂S2 − β̂S1 = 0.470(SE = 0.174), β̂S3 − β̂S1 = 1.324(SE = 0.152). Note that β̂S1 = 0. (d) The model adds an S×H interaction term has deviance 41.5 with df=48. Test whether a significantly improved fit results by allowing this interaction. 5 6 Name : 7 Name : 3. (20 pts) A study of mental health for a random sample of adult residents in Florida has a primary goal to assess the association between mental impairment and two explanatory variables. Mental impairment is an ordinal response (Y), with categories well (1), mild symptom formation (2), moderate symptom formation (3), and impaired (4). The life events index x1 is a composite measure of the number and severity of important life events such as birth of child, new job, divorce, or death in family that occurred to the subject within the past 3 years. Socioeconomic status (x2 =SES) is measured here as binary (1=high, 0=low). The main effect model is given logit[P (Y ≤ j|x)] = αj + β1x1 + β2x2 and the results are shown at the bottom. (a) What assumption does the model above make? Can you test the assumption? If yes, provide the null hypothesis and test it. (b) Compute P̂ (Y = 1) and P̂ (Y = 2) when x1 = 2 and x2 = 1. (c) Interpret β̂1 and β̂2 in terms of the log odds ratio. (d) Show that logit[P (Y ≤ j|X1 = i + 1, X2 = x2)] − logit[P (Y ≤ j|X1 = i,X2 = x2)] = β1, i.e., the uniform association model holds. (e) (bonus 5pts) Explain why the cumulative logit model of proportional odds form is not a special case of a baseline-category logit model. 8 Name : 3. Output for Fitting Cumulative Logit Model 9 Name : 10 Name : 4. (15pts) The study based on the MBTI national Sample uses the four scales of Myers-Briggs personality test: Extroversion/Introversion (E/I), Sensing/iNtuitive (S/N), Thinking/Feeling (T/F), and Judging/Perceiving (J/P). The 16 cells in Table below correspond to the personal- ity types. The log-linear model (EI*SN, SN*TF, SN*JP, TF*JP) was fitted and reported in the Table. (a) Calculate the df of this model and test the goodness-of-fit. (b) Compare this to the fit of the homogeneous association model that contains all the pairwise associations, which has deviance 10.16 with df=5. What do you conclude? (c) Reported maximized log-likelihood values of 3,475.19 for the mutual independence model, 3538.05 for the homogeneous association model, and 3539.58 for the model containing all the three-factor interaction terms. Write the log-linear model for each case, and show the numbers of model parameters and residual degrees of freedoms. (d) (bonus 5pts) Compute AICs, which model in (c) seems best? Why? 11 12 Name : 13 Name : 5. (10pts) Answer the following questions. (a) Consider log linear model (WX, XY, YZ). Explain why W and Z are independent given X alone or given Y alone. When are W and Y conditionally independent? (b) For a four-way table, is the WX conditional association the same as the WX marginal association for the log linear model (a) (WX, XYZ)? (b) (WX, WZ, XY, YZ)? Why? 14 Name : 15 Name : 6. (20pts) In a 2011 article in North Carolina Law Review, M. Radelet and G. Pierce reported a logistic prediction equation for death penalty verdicts in North Carolina. Let Y denote whether a subject convicted of murder received the death penalty (1=yes), for Defendant’s race h (h=1, black; h=2, white), Victim’s race i (i=1, black; i=2, white), and number of additional Factor j (j=0, 1, 2). For the model logit[P (Y = 1)] = α + βDh + β V i + β F j they reported α̂ = −5.09, β̂D1 = 0.00, β̂D2 = 0.17, β̂V1 = 0.00, β̂V2 = 0.81, β̂F0 = 0.00, β̂F1 = 2.02, β̂F2 = 3.46. (a) Estimate the probability of receiving the death penalty for the group most likely to receive it. (b) Give the symbol for the log linear model that is equivalent to this logistic model (c) Which logistic model corresponds to log linear model (YD, YV, DVF)? Why? (d) State the equivalent log linear and logit models for which (i) Y is jointly independent of D, V, and F; (ii) there are main effects of F on Y, but Y is conditionally independent of D and V, given F; and (iii) there is interaction between D and V in their effects on Y, and F has main effects. 16 Name : 17 Name : 7. (10pts) A case-control study has 8 pairs of subjects. The cases have colon cancer, and the controls are matched with the cases on gender and age. A possible explanatory variable is the extent of red meat in a subject’s diet, measured as “1=high” or “0=low”. The (case, control) observations on this were (1, 1) for 3 pairs, (0, 0) for 1 pair, (1, 0) for 3 pairs, and (0, 1) for 1 pair. (a) Cross-classify the 8 pairs in terms of diet (1 or 0) for the case against diet (1 or 0) for the control. Call this Table A. Display 2× 2× 8 table with 8 partial tables relating diet (1 or 0) to response (case or control) for the 8 pairs. Call this Table B. (b) Calculate the McNemar z2 for Table A and the CMH statistic for Table B. (c) (bonus 5pts) This sample size is small for large-sample tests. Use the binomial distri- bution with Table A to find the exact two-sided P-value (Just provide the formula not P-value) . 18 Name : 19 Name : 20