HTML DOCUMENTS
--- title: "Assignment 3" author: "Your name and ID here" date: "01/02/2020" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) # packages library(AER) library(tidyverse) library(stargazer) library(wooldridge) ``` ## Instructions Please read each question and answer where appropriate. The assignment is graded on a scale from 1 to 5. I grade effort as well as content. That means to obtain a 5, every question must be attempted, and I am a kind grader if the effort was high, but the result was not quite right. After you answer the questions, `knit` the document to HTML and submit on Moodle. I will **only grade** HTML. If you submit the `rmd` file instead, you will receive a zero. You have been warned, so there will be no exceptions. Groups of up to four are allowed, but every student must submit their own assignment. **If an interpretation of output is asked for, but only output or code is given, the question will get zero points** ## Question 1 The Project Student-Teacher Achievement Ratio (STAR) was a large randomized controlled experiment with the aim of asserting whether a class size reduction is effective in improving education outcomes. It has been conducted in 80 Tennessee elementary schools over a period of four years during the 1980s by the State Department of Education. In the first year, about 6400 students were randomly assigned into one of three interventions: small class (13 to 17 students per teacher), regular class (22 to 25 students per teacher), and regular-with-aide class (22 to 25 students with a full-time teacher’s aide). Teachers were also randomly assigned to the classes they taught. The interventions were initiated as the students entered school in kindergarten and continued through to third grade. For this question, we will focus on the results in grade 3 These variables are suffixed with the letter "k". Below, I perform some minor data transformations which include creating dummies for girl, access to free lunch (proxy for socioeconomic background of student), and white ethnicity. The variable `experiencek` is the years of experience of the teacher assigned to the class. A variable for the child's age is also constructed. ```{r} data('STAR') # focus on results from kindergarten -- IE variables than end with "k" star <- star="" %="">% # rename data select(ends_with("k"), gender, ethnicity, birth) %>% mutate(score = (readk + mathk) / 2, z_score = (score - mean(score, na.rm = T))/ 17.6, # stanardize score girl = ifelse(gender == "female",1,0), age = as.yearqtr('1985 Q3') - birth, free.lunch = ifelse(lunchk == "free",1,0), white = ifelse(ethnicity=="cauc",1,0), small = ifelse(stark == "small",1,0), regular.aid = ifelse(stark == "regular+aide",1,0)) %>% filter(!is.na(score)) ``` To assess the effect of class size on student performance, we can estimate the following regression: $$ z\_score_i = \beta_0 + \beta_1 small_i + \beta_2 Aide_i + \eta_i $$ where $small_i$ is a dummy variable for small class size (the main treatment group) and $Aid_i$ is a dummy variable for regular + teaching Aid. There are three categories (regular, small, regular+Aid), so we can only include two dummies. The third category, $regular_i$ class size becomes the reference group. Thus, the coefficient $\beta_1$ reflects the average performance of students in the small class relative to the regular class size, and so on. The $\eta_i$ is the error term, which contains all the other factors that affect student performance other than class size. In this regression the dependent variable is the standardized test score. This is common in this literature. It means the the coefficients can be interpreted in terms of units of standard deviations. ### part (a) Due to randomization, on average each group of students in regular, small and regular + teaching aid groups should have similar characteristics. However, in practice, randomization might work out perfectly -- parents who find out their child was put into a regular class might try to switch to small classes, the principal might put better teachers in the small class group, students could leave the school district. The first step in analyzing this program is to assess whether randomization was actually done correctly. To assess this, use a regression to predict treatment. For example, run a regression of `small` on `girl`, `free.lunch`, `white`, `age`, and `experiencek`: $$ small_i = \gamma_0 + \gamma_1 girl_i + \gamma_2 white_i + \gamma_3age_i + \gamma_4experiencek_i + \gamma_5free.lunch_i + \eta_i $$ and another regression using `regular.aid` as a dependent variable (so, two regressions, one with `small` as the dependent variable and another with `regular.aid` as the dependent variable). Put your two regressions in a regression table (either `stargazer` or `modelsummary`). Make sure your tables contain the `sandwich` standard errors. ### part (b) For each of your regressions in part (a), test the hypothesis your regression has no predictive power. That is, test that all the $\gamma_1,\cdots, \gamma_5$ are jointly zero using the command, `linearHypothesis()`. Be sure to use the `vcov. = sandwich` option. Interpret your estimates. For the regression with `small` as the dependent variable, at what level of statistical significance would you just fail to reject the hypothesis? ### part (c) Estimate the impact of class size on student performance using the following regression: $$ z\_score_i = \beta_0 + \beta_1 small_i + \beta_2 Aide_i + \eta_i $$ along with an extended regression that controls for `girl`, `free.lunch`, `white`, `age`, and `experiencek`. What is the effect of class size on student performance? Put your regression results in a table. Interpret the results. Does the coefficient $\hat\beta_1$ change much between regressions with and without controls -- why or why not? Is this expected? ## part (c) For your extended regression above, interpret the coefficient on `free.lunch`. Do you think proving students free lunches is a good idea? # Question 2 In class we discussed the conditional independence assumption. When we do not have a randomized control trail to work with, we can try to use a control approach -- that is, running a regression with additional control variables in the hopes that the CIA assumption is satisfied. Here we can take a look at a new data set: `MAschools`. This data set has test scores from students across districts along with other information about the school districts. We have already looked at `CASchools` in previous work, and we have seen that student teacher ratios are correlated with characteristics of the districts -- in particular, variables that correspond to wealth in each district. Here, we analyse the data set for Massachusetts schools. Below, I perform some minor data cleaning to make the data set comparable with the project star data. In particular, I standardized tests scores and rename the variable for student teacher ratios. ```{r} data("MASchools") # data on Massachusetts schools # rename data for convenience ma <- maschools="" #="" a="" few="" data="" manipulations="" ma="">-><- ma="" %="">% mutate(str = stratio, # rename stration str (student teacher ratio) score = score4, z_score = (score - mean(score)) / sd(score)) # Standardize test score ``` ### part (a) Create a table with 2 regressions from the Massachusetts data. For each regression, we look at the following causal question: Does the number of students per teacher affect student outcomes? $$ z\_score_i = \beta_0 + \beta_1 str_i + \eta_i \hspace{2in} (1) $$ In this regression, the dependent variable is the standardized test score. This is common in this literature. It puts the test scores into units of standard deviations -- this makes them comparable across data sources. Thus, the coefficient $\beta_1$ is interpreted as the association with one more student per teacher with standard deviations in test scores. For example, if $\beta_1$ is -0.05, for example, then one more student is associated with a reduction in test score by 0.05 standard deviation. In this regression, $\eta_i$ contains many things that affect test scores aside from class sizes. We worry that $E(\eta, str) \neq 0$, or that student teacher ratios are correlated with the error term. If this is due to differences in wealth across districts, we can try to control for these differences with a set of proxy variables. In particular, the data set contains information on the fraction of students eligible for the free lunch program, average district income, and the fraction of English learners. Consider the set of controls $[free.lunch_i, log(income)_i, english_i]$. We might make the CIA assumption: $$ \begin{aligned} E(\eta_i | str_i, free.lunch_i, log(income)_i, english_i) &= E(\eta_i | free.lunch_i, log(income)_i, english_i) \\ &= \gamma_0 + \gamma_1 free.lunch_i + \gamma_2 + log(income)_i + \gamma_3 english_i \end{aligned} $$ Which says that once the variables $[free.lunch_i, log(income)_i, english_i]$ are held fixed, $str_i$ is no longer correlated with error term. In the table you construct below, estimate a bivariate relationship given in equation (1), along with a specifications that employs the CIA strategy outlined above. Make sure to construct the appropriate standard errors. ```{r} # Estimate your regressions and construct a table here ``` ### part (b) Does the coefficient $\hat \beta_1$ change much between the specifications, why or why not? Is this expected? ### part (b) In the first question analyzing the project star data, the coefficient on the small class dummy variable estimates the difference in outcomes for students in the small (13-17 students) class relative to the regular (22-25 students) class size. It turns out that the average reduction in the number of students between regular and small classes was about 7.5. In the Massachusetts data, predict the change in test scores for an 7.5 reduction $str_i$. Is the answer comparable with the project star? In your opinion, does this give you more confidence that your estimate of the effect of $str$ in the observational data is _causal_? # Question 3 Since post-secondary education in the US is very expensive, there has been a movement to improve access. One policy from the Obama administration has been to increase the number of two-year colleges. These are less expensive and offer students the opportunity to transfer to a four-year college. One obvious question, though, is whether time->->