Economics 104: Project 1Fall 2022, UCLADue Date: Oct 12, 2022 by 11:59 PM (PST)For this project, you will work any dataset you like, however, it must contain at least 5 different predictors...

1 answer below »

https://www.kaggle.com/datasets/sudhanshu2198/analyzing-exam-scores?resource=download





library(AER)

data("TeachingRatings")




Economics 104: Project 1 Fall 2022, UCLA Due Date: Oct 12, 2022 by 11:59 PM (PST) For this project, you will work any dataset you like, however, it must contain at least 5 different predictors and one response variable which you will aim to predict. Your task will be to find a reasonable model by following the 11 steps outlined below. As an illustration of a good dataset (you cannot use this dataset), the file diamonds.csv contains the prices and other attributes of almost 54,000 diamonds. The data description and file can be accessed directly from kaggle and the goal is to predict diamond prices . There are many datasets that are publicly available in kaggle but you can also get data from AER, FRED, BLS, and so on. 1. Provide a descriptive analysis of your variables. This should include histograms and fitted distributions, correlation plot, boxplots, scatterplots, and statistical summaries (e.g., the five-number summary). All figures must include comments. 2. Estimate a multiple linear regression model that includes all the main effects only (i.e., no interactions nor higher order terms). We will use this model as a baseline. Comment on the statistical and economic significance of your estimates. Also, make sure to provide an interpretation of your estimates. 3. Identify if there are any outliers, high leverage, and or influential observations worth removing. If so, remove them but justify your reason for doing so and re-estimate your model. 4. Use Mallows Cp for identifying which terms you will keep in the model (based on part 3 ) and also use the Boruta algorithm for variable selection. Based on the two results, determine which subset of predictors you will keep. 5. Test for multicollinearity using VIF on the model from (4) . Based on the test, remove any appropriate variables, and estimate a new regression model based on these findings. 6. For your model in part (5) plot the respective residuals vs. ŷ and comment on your results. 7. For your model in part (5) perform a RESET test and comment on your results. 8. For your model in part (5) test for heteroskedasticity and comment on your results. If you identify heteroskedasticy, make sure to account for it before moving on to (9). 9. Estimate a model based on all your findings that also includes interaction terms (if appro- priate) and if needed, any higher power terms. Comment on the performance of this model compared to your other models. Make sure to use AIC and BIC for model comparison. 10. Evaluate your model performance (from 9) using cross-validation, and also by dividing your data into the traditional 2/3 training and 1/3 testing samples, to evaluate your out-of-sample performance. Comment on your results. 11. Provide a short (1 paragraph) summary of your overall conclusions/findings. https://www.kaggle.com/shivam2503/diamonds https://www.rdocumentation.org/packages/AER/versions/1.2-10 https://www.rdocumentation.org/packages/AER/versions/1.2-10
Answered 1 days AfterOct 10, 2022

Answer To: Economics 104: Project 1Fall 2022, UCLADue Date: Oct 12, 2022 by 11:59 PM (PST)For this...

Mohd answered on Oct 12 2022
62 Votes
-
-
-
2022-10-12
1. Provide a descriptive analysis of your variables. This should include histograms and fitted distributions, correlation plot, boxplots, scatterplots, and statistical summaries (e.g., the five-number summary). All figures must include comments.
library(readr)
exams <- read_csv("exams.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, race/ethnicity, parent_education_le
vel, lunch, test_prep_co...
## dbl (1): math
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(exams)
First look of data
# descriptive measures
skimr::skim(exams)
Data summary
    Name
    exams
    Number of rows
    1000
    Number of columns
    6
    _______________________
    
    Column type frequency:
    
    character
    5
    numeric
    1
    ________________________
    
    Group variables
    None
Variable type: character
    skim_variable
    n_missing
    complete_rate
    min
    max
    empty
    n_unique
    whitespace
    gender
    0
    1
    4
    6
    0
    2
    0
    race/ethnicity
    0
    1
    7
    7
    0
    5
    0
    parent_education_level
    0
    1
    11
    18
    0
    6
    0
    lunch
    0
    1
    8
    12
    0
    2
    0
    test_prep_course
    0
    1
    4
    9
    0
    2
    0
Variable type: numeric
    skim_variable
    n_missing
    complete_rate
    mean
    sd
    p0
    p25
    p50
    p75
    p100
    hist
    math
    0
    1
    66.09
    15.16
    0
    57
    66
    77
    100
    ▁▁▅▇▃
#histogram of math score
hist(exams$math, main="Histogram of math score")
#Boxplot of math score
boxplot(exams$math, main="Boxplot of Math score")
# Removing Outliers
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# using 1.5*IQR where IQR = Q3-Q1
exams<-exams%>%
filter(math>=30)
#After removing outliers
boxplot(exams$math)
hist(exams$math)
Categorical variables distribution
library(ggplot2)
# gender distribution
ggplot(data=exams, aes(x=gender)) +
geom_bar() +
labs (title = "Gender Distribution", x = "Gender", y = "Total Count")+ theme_classic()
#Race/ ethnicity distribution
ggplot(data=exams, aes(x=exams$`race/ethnicity`)) +
geom_bar() +
labs (title = "race/ethnicity Distribution", x = "race/ethnicity", y = "Total Count")+ theme_classic()
#parent Education level distribution
ggplot(data=exams, aes(x=parent_education_level)) +
geom_bar() +
labs (title = "parent_education_level Distribution", x = "parent_education_level", y = "Total Count")+ theme_classic()
#Lunch distribution
ggplot(data=exams, aes(x=lunch)) +
geom_bar() +
labs (title = "lunch Distribution", x = "lunch", y = "Total Count")+ theme_classic()
#test preparation course
ggplot(data=exams, aes(x=test_prep_course)) +
geom_bar() +
labs (title = "test_prep_course Distribution", x = "test_prep_course", y = "Total Count")+ theme_classic()
1. Estimate a multiple linear regression model that includes all the main effects only (i.e., no interactions nor higher order terms). We will use this model as a baseline. Comment on the statistical and economic significance of your estimates. Also, make sure to provide an interpretation of your estimates.
baseline_mod<-lm(math~.,data=exams)
stargazer::stargazer(baseline_mod,type = "text")
##
## ===================================================================
## Dependent variable:
## ---------------------------
## math
## -------------------------------------------------------------------
## gendermale 4.322***
## (0.808)
##
## `race/ethnicity`group B 2.685
## (1.639)
##
## `race/ethnicity`group C 2.746*
## (1.532)
##
## `race/ethnicity`group D 5.413***
## (1.563)
##
## `race/ethnicity`group E 10.033***
## (1.729)
##
## parent_education_levelbachelor's degree 2.074
## ...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here