Variable selection with randomly generated ‘‘noise’’ (adapted from Freedman, 1983):
(a) Sampling from the standard normal distribution, independently generate 500 observations for 101 variables. Call the first of these variables the response variable Y and the other variables the predictors X1; X2; ... ; X100. Perform a linear least-squares regression of Y on X1; X2; ... ; X100. Are any of the individual regression coefficients ‘‘statistically significant’’? Is the omnibus F-statistic for the regression ‘‘statistically significant’’? Is this what you expected to observe? (Hint: What are the ‘‘true’’ values of the regression coefficients β1; β2; ... ; β100?)
(b) Retain the three predictors in part (a) that have the largest absolute t-values, regressing Y only on these variables. Are the individual coefficients ‘‘statistically significant’’? What about the omnibus F? What happens to the p-values compared to part (a)?
(c) Using any method of variable selection (stepwise regression or subset regression with any criterion), find the ‘‘best’’ model with three explanatory variables. Obtain the individual t-statistics and omnibus F for this model. How do these tests compare to those in part (a)?
(d) Using the methods of model selection discussed in this chapter, find the ‘‘best’’ model for these data. How does that model compare to the true model that generated the data?
(e) Validation: Generate a new set of 500 observations as in part (a), and use that new data set to validate the models that you selected in parts (b), (c), and (d). What do you conclude?
(f) Repeat the entire experiment several times.