For this assignment, you will use the following three data sets. US_airlines.csv, US_airports.csv, US_airrecord.csv
Using R you will prepare and explore the datasets using data cleaning and analysis techniques and will discuss the discovered trends and points of interest.
Steps you will complete should include:
Inspect and summarise your data.
Clean and combine datasets where appropriate
· Check for and handle missing values
· Remove any unnecessary variables
· Transform any variables that you would like to use in a different form
Plot data and identify trends and/or points of interest
Perform data analysis to investigate
· The airlines which experience the most delays
· The busiest routes
· The relationship between distance between airports and flying time
Predicting flying time based on distance
Discuss your findings, comment your code and prepare explanatory visualisations.
Some observations i have made
·
The times are as per the 24 hour clock so 10 is 00.10 and 1542 is 15.42.
·
Any time differences are in minutes.
·
There are 19 flights have a wheels off time but have a cancellation reason what do we do with them. All reasons relate to the weather and airline.
·
TAIL_NUMBER – One value needs an N put in front of it 7819A
·
Elapsed time NA values need to be calculated by AIR_TIME+TAXI_IN+TAXI_OUT however first the NA values in AIR_TIME need to be replaced with a calculation of time duration between WHEELS_OFF and WHEELS_ON
·
Need to add relevant data from
US_Airlines.csv and US_airports.csv using the IATA_COD
·
The NA values for these variables below relate to where there was no delay. The 0 values relate to where there was a delay but not for that reason. Values other than these signify how long of a delay there was for each reason. Some delays can be for more than one reason. Eg Air system and airline delay. These need some transformation.
o
AIR_SYSTEM_DELAY
o
SECURITY_DELAY
o
AIRLINE_DELAY
o
LATE_AIRCRAFT_DELAY
o
WEATHER_DELAY