..
1 University at Buffalo, Industrial and Systems Engineering IE322 Analytics and Computing for Industrial Engineers Lab#3 Fall 2022 Machine Learning Practices (This is an individual lab) Due 23:59 November 13th, 2022 Description: The dataset for this lab is tuition.csv, and it is available on UBlearns. The dataset has information about school tuition. The description of each variable is displayed in Table 1. Requirements: Draft a report to document your R code and results (or partial results if there are too many) in each step. Note that your report will be graded on both technical content (70%) and report quality (30%). Submit two files to UBLearns: 1) your report, and 2) your R script. Table 1 VARIABLES DESCRIPTION DATA TYPE tuition College tuition ("out-of-state" rate). continuous. pcttop25 Percent of new students from the top 25% of high school class. continuous. sf_ratio Student to faculty ratio. continuous. fac_comp Average faculty compensation. continuous. accrate Fraction of applicants accepted for admission. continuous. graduat Percent of students who graduate. continuous. pct_phd Percent of faculty with Ph.D.'s. continuous. fulltime Percent of undergraduates who are full time students. continuous. alumni Percent of alumni who donate. continuous. num_enrl Number of new students enrolled. continuous. public.private Is the college a public or private institution? public=0, private=1 discrete. Abdullah Fahad Abdullah Fahad Abdullah Fahad 2 1. Basic plotting (20 pts) Read the tuition.csv data into R console as D0. Using D0 for the following questions. a) Change the data type of “public.private” into a factor. b) Use ggplot to draw a scatter plot, where the x-axis is “num_enrl” and y-axis is “fac_comp”, each data point is distinguished by “public.private”. c) Based on b), add linear regression lines for public institutions and private institutions. Copy and paste the final plot to your report. 2. Feature selection (30 pts) Using D0 for the following questions. a) Build a full linear regression model, named it as full_model, where “tuition” is dependent variable, and the rest of variables are independent variables. Report the summary of this full model into your report. b) Based on the full model, perform forward feature selection to select top 3 key features. This selection is based on the p-value of inclusion (i.e., penter). Report the results to the report. c) Based on the full model, perform backward feature selection to select top 3 key features. This selection is based on the p-value of exclusion (i.e., prem). Report the results to the report. 3. KNN (50 pts) Using D0 to create a subset named as D1, where D1 only includes three features: “accrate”, “graduat”, “public.private”. Then, delete all missing values from D1, and overwrite D1. Hint: D1<- na.omit(d1). among all three features in d1, we consider independent variables are “accrate”, “graduat”, and target variable is “public.private”. use d1 for the following questions. a) use min-max normalization to normalize two independent variables “accrate”, “graduat”. this step is to eliminate the effect of different value range on the model. b) set the seed number as 123456. hint: set.seed(123456). this step is to make sure that you will get same model results every time you run the code. c) split the d1 into training set with 70% of the data, and test set with the remaining 30% of the data. d) build a knn model using the training set, and test the model performance using the test set. report the confusion matrix into your report. na.omit(d1).="" among="" all="" three="" features="" in="" d1,="" we="" consider="" independent="" variables="" are="" “accrate”,="" “graduat”,="" and="" target="" variable="" is="" “public.private”.="" use="" d1="" for="" the="" following="" questions.="" a)="" use="" min-max="" normalization="" to="" normalize="" two="" independent="" variables="" “accrate”,="" “graduat”.="" this="" step="" is="" to="" eliminate="" the="" effect="" of="" different="" value="" range="" on="" the="" model.="" b)="" set="" the="" seed="" number="" as="" 123456.="" hint:="" set.seed(123456).="" this="" step="" is="" to="" make="" sure="" that="" you="" will="" get="" same="" model="" results="" every="" time="" you="" run="" the="" code.="" c)="" split="" the="" d1="" into="" training="" set="" with="" 70%="" of="" the="" data,="" and="" test="" set="" with="" the="" remaining="" 30%="" of="" the="" data.="" d)="" build="" a="" knn="" model="" using="" the="" training="" set,="" and="" test="" the="" model="" performance="" using="" the="" test="" set.="" report="" the="" confusion="" matrix="" into="" your="">- na.omit(d1). among all three features in d1, we consider independent variables are “accrate”, “graduat”, and target variable is “public.private”. use d1 for the following questions. a) use min-max normalization to normalize two independent variables “accrate”, “graduat”. this step is to eliminate the effect of different value range on the model. b) set the seed number as 123456. hint: set.seed(123456). this step is to make sure that you will get same model results every time you run the code. c) split the d1 into training set with 70% of the data, and test set with the remaining 30% of the data. d) build a knn model using the training set, and test the model performance using the test set. report the confusion matrix into your report.>