1. Clean the excel data set using Rstudio.
-Utilize the forcats package for reducing categories
-Utilize the mice package for imputing
The HOR_state and UnitST columns of States needs to be reduced to 4 regions of West, Midwest, South and Northeast. The Branch categories needs to be reduced to unrestricted and restricted:
Unrestricted - Armor, Air Defense Artillery, Ammunition, Aviation, Field Artillery, Infantry, Logistics, Mechanical Maintenance, Military Police, Special Forces.
Restricted - Adjacent General, Army Medical Specialist Corps, Army Nurse Corps, Behavioral Sciences, CBRN, Chaplain, Civil Affairs, CMF Immaterial, Corps of Engineers, Cyber, Dental Corps, Electronic Maintenance, Financial Management, Force Management, Health Services, Information Operations, Information Systems Engineer, Judge Advocate Generals Corps, Laboratory Sciences, Medical Corps, Military Intelligence, Nuclear & Counterproliferation, Operations Research/Systems Analysis, Personnel Special Reporting Codes, Preventative Medical Sciences, Psychological Operations, Public Affairs, Quartermaster Corps, Recruitment & Reenlistment, Research/Development/Acquisition, Signal Corps, Simulations Operations, Space Operations, Strategist Intelligence, Strategist, Systems Automation Officer, Telecommunications Systems Engineers, Transportation Corps, Veterinary Corps.
2. After the data has been cleaned. Fit a random forest model using the of the unvac_pop column as the response variable to the Branch column and UnitST column using the supporting material word document as guidance.
3.Then estimate the AUC value of the random forest model using the supporting material word document.