Problem 1. The file
SpeedTrap.RData
is an R data set that contains a data frame called
SpeedTrap. This data frame consists of 184 observations (rows) and 7 variables (columns). Each row corresponds to a town in the Chicago area. The variables are as follows:
In each community we want to compare the rate of ticketing outsiders who are stopped for a
traffic violation to the rate of ticketing residents who are stopped. To do this we will use the
odds ratio
which is defined as follows:
where πout
and πres
are the probabilities of being ticketed for outsiders and resident, respectively. The odds ratio is used to compare probabilities between two populations. It often is preferred to using the straight difference πout
- πres
in statistical modeling. An odds ratio of 1.0 implies that the two probabilities are equal. An odds ratio greater than 1.0 implies that πout
is greater than πres. The odds ratio is estimated from the counts of successes and failures in each community by replacing πout
and πres
with sample estimates.
(a) Begin by calculating the estimated odds ratio for each community. Append this variable to the data frame (call it
OddsRatio). The first three values should match the following:
> SpeedTrap[1:3,"OddsRatio"]
[1] 1.146857 1.201661 1.264754
(b) Fit a regression model using
OddsRatio
as the outcome variable and
Pop, PPSQMI,
PPHU, and
PCI
as predictor variables. Using diagnostic plots, describe how well the
regression conforms to the assumptions of the normal, linear regression model.
(c) Identify those communities for which the leverage exceeds three times the average value.
Re-run the regression with these communities removed from the data set. Describe how
their removal affects the fitted model.
(d) Re-run the regression in (a), replacing each of the predictors by its logarithm. How does this change affect the presence of observations with high leverage?
(e) Using log-transformed predictors, find a Box-Cox transformation of the outcome variable that maximizes the likelihood. Re-fit the model with the transformed outcome variable. Does it better conform to the assumptions of the normal, linear regression model than the model that you fit originally? In what respects are the diagnostics still troublesome?
(f) Produce and interpret a set of partial residual plots for the model that you fit in (e). Do the predictor variables appear to be treated appropriately in the model?
(g) Assuming that all necessary assumptions are met with the model that you fit in (e):
i. Test the null hypothesis that the coefficients on
log(PPHU)
and
log(PPSQMI)
are both zero.
ii. Give a 95% confidence interval for the coefficient on
log(PCI).
iii. Give a 95% prediction interval for the estimated odds ratio in a community that has a population of 25,000; 4000 persons per square mile; 2.8 persons per housing unit, and a per capita income of $26,000.
(h) Conduct an outlier analysis on residuals from the regression in (e). Use a family-wide Type I error probability of α = .01. Which communities should be considered for removal from the regression?
Problem 2. The file
Ozone.RData
contains a vector named
ozone
which has length n = 111. This vector was obtained from a regression of air quality measurements (ozone) taken on 111 consecutive days in New York City in 1973. Each entry of
ozone
is either –1 if the residual is negative or +1 if the residual is positive. We are interested in testing the null hypothesis that the residuals are not serially correlated versus the alternative hypothesis that the residuals are serially correlated. Using the Runs Test, report a p-value and state your conclusion at the .05 test level.
Problem 3.
For this problem you will use the
prostate
data that is available in the faraway package. The outcome variable is
lcavol, all other variables are predictors. We want to determine if a regression model behaves differently for younger (under age 65) subjects than for older (age 65 and over) subjects.
(a) To do this, introduce a new variable called
Young
to the data set as a factor that distinguishes younger from older men. Introduce it in a way that separate intercepts and slopes are applied to the two groups of men. Show a summary of your regression. Note: we will accept the validity of all regression assumptions in this exercise.
(b) Using the model in (a) conduct an F-test to see if you reject the null hypothesis that coefficients associated with
Young
are all equal to zero. Explain in practical terms what your results mean.