Answer To: R code
Aakarsh answered on Feb 12 2021
output_word.docx
Inquiries 2015 Analysis for Enrollment Management
Dataset contains inquiries numbered about 90,000+ students, and the university enrolled from 2400 to 2800 new freshmen each Fall semester. This provided data will be used to build a predictive model for better enrollment management
1.Structure of the dataset, including variable names, data types
## col_types
## ETHNICITY factor
## TERRITORY factor
## ACADEMIC_INTEREST_1 factor
## ACADEMIC_INTEREST_2 factor
## Enroll integer
## CONTACT_DATE factor
## TOTAL_CONTACTS integer
## SELF_INIT_CNTCTS integer
## TRAVEL_INIT_CNTCTS integer
## SOLICITED_CNTCTS integer
## REFERRAL_CNTCTS integer
## CAMPUS_VISIT integer
## CONTACT_CODE1 factor
## LEVEL_YEAR factor
## IRSCHOOL factor
## satscore integer
## sex integer
## mailq integer
## telecq integer
## premiere integer
## interest integer
## stucell integer
## init_span integer
## int1rat numeric
## int2rat numeric
## hscrat numeric
## avg_income integer
## distance numeric
## Instate factor
Factor is used for non-numeric i.e string data.
Integer is used for all int type variables.
Numeric is used for all decimal floating point values.
a. The nominal variables ACADEMIC_INTEREST_1, ACADEMIC_INTEREST_2, and IRSCHOOL were rejected because they were replaced by the interval variables INT1RAT, INT2RAT, and HSCRAT, respectively. For example, academic interest codes 1 and 2 were replaced by the percentage of inquirers over the past five years who indicated those interest codes and then enrolled. The variable IRSCHOOL is the high school code of the student, and it was replaced by the percentage of inquirers from that high school over the last five years who enrolled.
b. CONTACT_CODE1 and CONTACT_DATE are also rejected due to their irrelevance suggested by Enrollment Management.
c&d. Should your model reject any other variables for your analyses? If so, please explain reasons for each additionally rejected variable.What’s the target variable?
Target variable is clearly Enroll as predictive model needs to be built for enrollment management.
Finding correlation matrix to check variablity of variables w.r.t to each other and Enroll.
## ETHNICITY TERRITORY Enroll TOTAL_CONTACTS
## ETHNICITY 1.0000000000 0.012607065 0.008704672 0.14237574
## TERRITORY 0.0126070650 1.000000000 -0.008432170 -0.01584598
## Enroll 0.0087046723 -0.008432170 1.000000000 0.45178948
## TOTAL_CONTACTS 0.1423757428 -0.015845975 0.451789484 1.00000000
## SELF_INIT_CNTCTS 0.1249024060 -0.002158310 0.471775416 0.92375298
## TRAVEL_INIT_CNTCTS -0.1141321482 -0.017709316 0.052103175 0.22949147
## SOLICITED_CNTCTS 0.2158050449 -0.017885753 0.016668154 0.24508518
## REFERRAL_CNTCTS -0.0928461700 -0.018663301 0.048135332 0.11508998
## CAMPUS_VISIT 0.0004859854 0.016455601 0.235263591 0.29884717
## LEVEL_YEAR NA NA NA NA
## satscore -0.0003616180 0.041809616 0.190603812 0.33003615
## sex 0.0359722960 -0.023936771 -0.009335419 0.04124317
## mailq 0.0992853310 0.037409858 -0.050649759 -0.20429355
## telecq -0.0081718464 0.011542674 -0.278762589 -0.35556031
## premiere 0.0133729945 -0.023993955 0.399841753 0.49095358
## interest -0.0058959170 -0.022689121 0.181966425 0.28736380
## stucell 0.2159124085 0.027391821 0.183566594 0.36150757
## init_span 0.0754131797 -0.022244683 -0.024458718 0.08531583
## int1rat 0.0172225127 -0.150953431 0.125830110 0.19577708
## int2rat 0.0067889085 -0.165003202 0.123739993 0.18228442
## hscrat 0.0240512013 0.006454665 0.329380570 0.19548723
## avg_income 0.0571974089 0.026341530 0.107851731 0.14313147
## distance 0.0218457825 0.319958672 -0.055795275 -0.12813556
## Instate 0.0094691507 -0.250313134 0.070777436 0.15090125
From above correlation matrix its easy to remove unnecessary variables for model which are very low correlated with Enroll i.e. ETHNICITY,TERRITORY,TRAVEL_INIT_CNTCTS,SOLICITED_CNTCTS,REFERRAL_CNTCTS,sex,mailq,init_span,Instate,avg_income,distance. LEVEL_YEAR has only one value FRI14 therefore it won’t be having any effect on model
Also TOTAL_CONTACTS and SELF_INIT_CNTCTS are highly correlated. Removing TOTAL_CONTACTS for dimensionality reduction.
e. Do you need to change any measurement levels of your existing variables? Why? Yes, we have to perform scaling to perform Normalisation on data and put in the range [0 to 1] as it will result in better performance of our model putting all variables on same scale.
2. Explain whether variable imputation and transformation are needed. If so, please explain which variables have been imputed, transformed and how.
Imputation is essential for improving bias in the data like telesq field had many NA values so it is being replaced by its mean so that this variable could be used on all data.
Transforming data or normalising data according to use produce better outputs.
Checking for null values in Dataframe.
## na_count
## Enroll 0
## SELF_INIT_CNTCTS 0
## CAMPUS_VISIT 0
## satscore 64479
## telecq 70880
## premiere 0
## interest 0
## stucell 0
## int1rat 0
## int2rat 0
## hscrat 0
SAT SCore and telecq fields have around 65k-70k null values
There is only one row i.e outlier where SAT score is nan and Enroll is 1, because students can’t be enrolled without SAT Score.
## Enroll SELF_INIT_CNTCTS CAMPUS_VISIT satscore telecq premiere interest
## 1 1 1 0 NA NA 0 0
## stucell int1rat int2rat hscrat
## 1 1 0.04926967 0.05666969 0.06451613
Therefore dropping all enteries which have SAT score null
Replaced null values of telecqwith its mean.
Now there is no NA field in the dataframe.
3. Please provide the following results for each model:
a Regression Model Summary
##
## Call:
## lm(formula = Enroll ~ SELF_INIT_CNTCTS + CAMPUS_VISIT + satscore +
## telecq + premiere + hscrat, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55745 -0.09830 -0.02401 0.02742 1.05392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.674e-02 1.144e-02 2.337 0.0194 *
## SELF_INIT_CNTCTS 2.866e-02 9.631e-04 29.761< 2e-16 ***
## CAMPUS_VISIT 1.200e-01 6.018e-03 19.934< 2e-16 ***
## satscore 5.228e-05 9.057e-06 5.772 7.9e-09 ***
## telecq -5.597e-02 2.756e-03 -20.310< 2e-16 ***
## premiere 2.374e-01 6.661e-03 35.643< 2e-16 ***
## hscrat 1.000e+00 1.882e-02 53.151< 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2525 on 26996 degrees of freedom
## Multiple R-squared: 0.3282, Adjusted R-squared: 0.328
## F-statistic: 2198 on 6 and 26996 DF, p-value: < 2.2e-16
b. Decision tree
c. Neural network
4. Which model will you choose? Why? Please provide support for your answer.
For this case regression works best. As data provided has most numeric features, so whenever its been told to predict some future value of a process which is currently running, you can go with regression algorithm.
5. Please explain and summarize your major findings to the director of the Office of Enrollment Management.
Student has high chances for enrollment if self initiated contact accounts are more with good sat score,has...