This is a statistics homework, that needs to use the program STATA to solve. I attached the 2 documents and a picture that has the instructions. Thank you
/*============================================================================= ** NAME:_______________ ** ** HOMEWORK:2 ** ** DUE: August 18, 2021 ** ** PRE-REQ: Download insurance.csv from Canvas. =============================================================================*/ ** Housekeeping clear all set more off /* The code above clears the Stata session so that when you run your .do file your memory is cleared. You do not need to keep this here, but I personally start all of my .do files with a heading and these lines to clear the session */ ** Import data here: ****** REPLACE WITH YOUR CODE ****** /****************************************************************************** The insurance.csv file includes 1,338 examples of beneficiaries currently enrolled in the insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year. The features are: • age: This is an integer indicating the age of the primary beneficiary (excluding those above 64 years, since they are generally covered by the government). • sex: This is the policy holder's gender, either male or female. • bmi: This is the body mass index (BMI), which provides a sense of how over or under-weight a person is relative to their height. BMI is equal to weight (in kilograms) divided by height (in meters) squared. An ideal BMI is within the range of 18.5 to 24.9. • children: This is an integer indicating the number of children / dependents covered by the insurance plan. • smoker: This is yes or no depending on whether the insured regularly smokes tobacco. • region: This is the beneficiary's place of residence in the U.S., divided into four geographic regions: northeast, southeast, southwest, or northwest. It is important to give some thought to how these variables may be related to billed medical expenses. For instance, we might expect that older people and smokers are at higher risk of large medical expenses. Thinking about this before we run our regression may help us better interpret our model ******************************************************************************/ /****************************************************************************** SECTION 1: EXPLORING AND PRESENTING DATA ******************************************************************************/ /* Explore the variable `charges' and describe the distribution (mean, median, skewness). Sometimes it is helpful to visualize the distribution, too! Consider what graphical displays may be useful here. One of my personal favorite displays is the kernal density plot, but feel free to do whatever! */ ****** REPLACE WITH YOUR CODE ****** /* Take a look at the type of your data. Remember that our regression models require that every feature is numeric, yet we have three factor types in our data. We will come back to fix that later. Before fitting a regression model to data, it can be useful to determine how the independent variables are related to the dependent variable and each other. A correlation matrix provides a quick overview of these relationships. Given a set of variables, it provides a correlation for each pairwise relationship. Create a correlation matrix for the four numeric variables in the insurance data frame, use the corr() command. Comment on any discoveries. */ ****** REPLACE WITH YOUR CODE ****** /* Let's create a scatterplot matrix for the four numeric features: age, bmi, children, and charges. The "graph matrix" command may be useful here. */ ****** REPLACE WITH YOUR CODE ****** /* Let's create a scatterplot matrix for the four numeric features: age, bmi, children, and charges. The graph matrix command may be useful here. */ ****** REPLACE WITH YOUR CODE ****** /* As with the correlation matrix, the intersection of each row and column holds the scatterplot of the variables indicated by the row and column pair. The diagrams above and below the diagonal are transpositions since the x axis and y axis have been swapped. Do you notice any patterns in these plots? Although some look like random clouds of points, a few seem to display some trends. ****** REPLACE WITH YOUR ANSWER ****** /****************************************************************************** SECTION 3: RUNNING A REGRESSION ******************************************************************************/ /* To fit a linear regression model to data in Stata, the syntax is simply: regress y x where y is your dependent variable and x can be one or more independent variables. Try running a regression using your numeric variables. What do you notice about your coefficients and R^2 as you add more features to your regression model? */ ****** REPLACE WITH YOUR ANSWER ****** /* Interpret R^2 for the model with 4 regressors. How would you explain what that means to someone who has never learned about regression? */ ****** REPLACE WITH YOUR ANSWER ****** /* OK, now let's handle those non-numeric variables so we can include them in our regression. Convert our non-numeric variables into dummy variables. Hint: For region this won't make sense. Think about how we handled this in the Stata Tutorial (i.e. factor variable notation)*/ ****** REPLACE WITH YOUR CODE ****** /* For your regression model using all features as inputs, what is the effect of each variable (positive or negative) and significance level? */ ****** REPLACE WITH YOUR ANSWER ****** /****************************************************************************** SECTION 4: IMPROVING MODEL PERFORMANCE ******************************************************************************/ /* In linear regression, the relationship between an independent variable and the dependent variable is assumed to be linear, yet this may not necessarily be true. For example, the effect of age on medical expenditures may not be constant throughout all age values; the treatment may become disproportionately expensive for the oldest populations. Let's add a non-linear age to the model by creating a variable called age2 which is just the squared values of age. Later, when we run our improved regression we will incorporate this term ****** REPLACE WITH YOUR CODE ****** /* Suppose we have a hunch that the effect of a feature is not cumulative, but rather it has an effect only once a specific threshold has been reached. For instance, BMI may have zero impact on medical expenditures for individuals in the normal weight range, but it may be strongly related to higher costs for the obese (that is, BMI of 30 or above). We can model this relationship by creating a binary indicator variable `bmi30' that is 1 if the BMI is at least 30 and 0 otherwise. The estimated beta for this binary feature would then indicate the average net impact on medical expenses for individuals with BMI of 30 or above, relative to those with BMI less than 30. Create this binary variable for BMI 30 or above and replace it with the multivariate regression you ran in the previous step. */ ****** REPLACE WITH YOUR CODE ****** /* So far, we have only considered each feature's individual contribution to the outcome. What if certain features have a combined impact on the dependent variable? For instance, smoking and obesity may have harmful effects separately, but it is reasonable to assume that their combined effect may be worse than the sum of each one alone. When two features have a combined effect, this is known as an interaction. If we suspect that two variables interact, we can test this hypothesis by adding their interaction to the model. Interaction effects can be specified using the proper Stata syntax. To interact the obesity indicator (bmi30) with the smoking indicator (smoker), we would add bmi30##smoker as a feature to our regression. Run a regression on charges using bmi30, smoker, and the interaction term. */ ****** REPLACE WITH YOUR CODE ****** /****************************************************************************** SECTION 5: PUTTING IT ALL TOGETHER -- AN IMPROVED REGRESSION MODEL ******************************************************************************/ */ Based on a bit of subject matter knowledge of how medical costs may be related to patient characteristics, we developed what we think is a more accurately-specified regression formula. To summarize the improvements, we: • Added a non-linear term for age • Created an indicator for obesity • Specified an interaction between obesity and smoking Create a regression model using all of our interactions We'll train the model using the regress command as before, but this time we'll also use the newly constructed variables and the interaction term. Compare your results with the multivariate regression run before our model improvement. What conclusions can you draw from your regression results? age,sex,bmi,children,smoker,region,charges 19,female,27.9,0,yes,southwest,16884.924 18,male,33.77,1,no,southeast,1725.5523 28,male,33,3,no,southeast,4449.462 33,male,22.705,0,no,northwest,21984.47061 32,male,28.88,0,no,northwest,3866.8552 31,female,25.74,0,no,southeast,3756.6216 46,female,33.44,1,no,southeast,8240.5896 37,female,27.74,3,no,northwest,7281.5056 37,male,29.83,2,no,northeast,6406.4107 60,female,25.84,0,no,northwest,28923.13692 25,male,26.22,0,no,northeast,2721.3208 62,female,26.29,0,yes,southeast,27808.7251 23,male,34.4,0,no,southwest,1826.843 56,female,39.82,0,no,southeast,11090.7178 27,male,42.13,0,yes,southeast,39611.7577 19,male,24.6,1,no,southwest,1837.237 52,female,30.78,1,no,northeast,10797.3362 23,male,23.845,0,no,northeast,2395.17155 56,male,40.3