Datasets:There are two datasets for this exercise: 1) “FlightInfo_scheduled.csv” and 2) “FlightInfo_actual.csv”. The former one provides information about the flight schedules that were plannedtoarriveatthreeNewYorkbasedairports(JFK,LGA,andEWR)onJanuary2022.Thelatter datasetprovidesinformationaboutthedeparturetimesoftheseflights,theweatherstatusduringdeparture(1indicatesbadweather),andwhethertheflightdepartedontimeornot.FlightIDistheunique identifierofflights,Carrieristheairlinecode,andoriginistheairportoforigin.DayOfWeekis1=Monday,2=Tue,etc.NoticethatthetimesareenteredinHHMMformat.Ourgoalistopredictwhetheraflight willbedelayedornot.
STEP 1: Data Pre-processing
1.1.
Reading the data:Read both datasets into the software and make sure that the attribute measurement types areset.
1.2.
Mergingdatasets:Mergethesedatasets and make one file as shown in the screenshot (attached)
1.3.
DataQualityIssues:Twonotableproblemsare:
1.2.1.
Duplicates:Removeduplicates,whilekeepingthe1strecordineachgroupinthedata.
1.2.2.
Missingvalues:ActualDeptTimehasmissingvalues(in@NULLform).ReplacethesemissingvalueswiththeScheduledDeptTime,iftheflightdepartedontime.
1.3.
Selectingsub-sample:SinceUSAirwaysdoesnotexistanymore,remove/discarditsflightsfromthedataset.NotethatUSAirwayscarriercodeis“US”.
1.4.
Re-classifyanattribute:Insteadofusingindividualdaysoftheweekasafeature,assumethatweareonlyinterestedinknowingwhetheraflightwasscheduledonaweekday,orweekend.Re-classifytheDayOfWeekattributeintoanewattributeaccordingly(newattributeshouldhavetwovalues:“weekday” and“weekend”).NoticethatoriginalDayOfWeekvalues6and7indicateSaturdayandSunday.AlsonotethatyoumayneedtochangethemeasurementtypeoftheDayOfWeekfirst,beforedoingthere- classification.
1.5.
Deriveanewattribute:Similartothepreviousstep,assumethatinsteadofusingindividualdaysofthemonthasafeature,weareonlyinterestedinknowingwhetheraflightwasoperatedearlymonthor latemonth.IfaflightwasscheduledonJan15thorlater,thenitisa“LateMonth”flight.Otherwise,itisan
“EarlyMonth”flight.
Derive another attribute:Finally, let’s now assume that you’d like to group flights intothreecategoriesbasedontheirscheduleddeparturetimes:Morning,Afternoon,andEvening.Anyflightthatisbefore12PMisa“Morning”flight;anyflightbetween12PMand17PMisan“Afternoon”flight;andany flightthatislaterthan5PMisan“Evening”flight.
QUESTION1:Submit the merged and updated CSV file.
1.6.
Filteroutattributes:Removeallunnecessaryattributesfromthedatasetbeforerunninganyclassificationmodels.Theseincludeuniqueidentifierattributes,timerelatedattributes(HHMM),thedateattribute,andattributesthatwerere-codedintonewforms.
STEP2:PatternIdentification:Inthisstep,youareexpectedtobuildaDecisionTreemodelusingthealgorithmandmanuallyinterpretitsdecisionrules.
QUESTION2:Whatarethemaincontributorstoaflight’sbeingdelayed?Ifyouwereplanningforanair travel,whenwouldyouprefertobookyourflight(assumingthatyouareflexiblewithyourschedule)?
STEP3:ModelEvaluation:Inthisstep,youwillanalyzetheperformanceofdifferentclassificationmodels.
3.1.
Partitioning:Splitthedatasetintotrainingandtestsets.Keep70%oftheobservationsinthetrainingsetand30%ofobservationsinthetestset and cross-validation fold 10 separately.
3.2.
C5.0 Performance:Build a decision tree model from the Training dataset and test its performanceusingtheTestingdataset.
QUESTION 3:
What is the overall accuracy of the decision tree model on thecross-validation fold 10 andtesting dataset?
3.3.
Random Forest & SVM:Now,usetheRandom Forest and SVM toconductthesameanalysisinthe previousstep.
QUESTION4:WhatistheoverallaccuracyoftheRandom Forest and SVMonthetestingdataset andcross-validation fold 10?Which wouldyoupick,basedonaccuracyperformance?