OverviewThis assignment requires you to find some open data, and use your knowledge, skills gainedduring the course to preprocess the data. You will create a report using R Markdown toexplain the...

1 answer below »
OverviewThis assignment requires you to find some open data, and use your knowledge, skills gainedduring the course to preprocess the data. You will create a report using R Markdown toexplain the steps taken by you in order to perform the data preprocessing tasks. You willalso publish this report online (in RPubs) which will give you the opportunity to build yourdata analytics portfolio. This is a great way of showing potential employers what you arecapable of. You will be awarded (with marks) the clearer you demonstrate your skills.
Assignment Data Sources
Assignment 2 is open-ended however you are required to find suitable datasets thatfulfill the minimum requirements given below. All of the datasets that you use in thisassignment must be open and ideally have a Creative Commons Licence. This will ensureyou can share your work with anyone provided you make proper attribution. If you’re notsure if data is Open, contact the provider, read the documentation or post on the discussionboard and I will investigate. Some open data sources are provided below, but I encourageyou to find others:
o https://www.kaggle.como UCI Machine Learning Repositoryo data.govo world banko amazon web serviceso google data setso youtube video data setso analytics vidhyao quandlo driven datao http://www.abs.gov.au/o https://www.data.vic.gov.au/o http://www.bom.gov.au/o https://relational.fit.cvut.cz
Minimum Requirements for the Data setsConsidering this is a data preprocessing class, I do expect your data set to have certainrequirements so that you can demonstrate your knowledge of data preprocessing. Thefollowing are the minimum requirements for the data sets that I will look for:1. At least two data sets should be merged to create your assignment data (forexample you can take crime statistics for the cities/states in Australia and merge thisdata set with cities/states’ per capita income data).2. Your data set should include multiple data types (numerics, characters, factors, etc).3. Your data set should include variables suitable for data type conversions so that youshould be able to apply the required data type conversions (e.g., character ->factor, character -> date, numeric -> factor, etc. conversions).4. Your data set should include at least one factor variable that needs to be labelledand/or ordered.5. At least one of the data sets that you use should be Untidy. You need to explainwhy the data set or data sets you used is/are Untidy. Then you need to apply therequired steps to reshape your data into a tidy format.6. At least one variable needs to be created/mutated from the existing ones (e.g.the data may contain income and expense variables and you may create a savingsvariable out of the income and expense variables).7. You are expected to scan all variables for missing values, special values andobvious errors (i.e. inconsistencies). If there are missing values, use any of thesuitable techniques outlined in Module 5 to deal with them, reason and documentyour approach properly. If there are no missing values in the data, then scan allvariables for any special values and obvious errors, use any of the suitabletechniques outlined in Module 5 to deal with them, reason and document yourapproach properly.
8. You are expected to scan all numeric variables for outliers. If there are outliers,use any of the suitable techniques outlined in Module 6 to deal with them, reasonand document your approach properly.9. You are expected to apply data transformations on at least one of the variables.The purpose of this transformation should be one of the following reasons: i) tochange the scale for better understanding of the variable, ii) to convert a non-linearrelation into linear one, or iii) to decrease the skewness and convert the distributioninto a normal distribution.10. You are expected to use only readr, xlsx, readxl, foreign, gdata, rvest, dplyr, tidyr,deductive, deducorrect, editrules, validate, Hmisc, forecast, stringr, lubridate, car,outliers, MVN, infotheo, MASS, caret, MLR , ggplot2, knitr and base R functions forthis section. You can also use your own functions. This will show your accumulatedknowledge that you gained throughout the semester in this course.Optional things that you can do to preprocess data:● You can subset your data by selecting variables and/or filtering in (or out) cases.Please don’t forget to put an explanation in your report if you do so.● Your data set can include date or string information or both. If this is the case, Iexpect you to apply required date conversions for dates and string manipulations forstrings as required.● Depending on your level of knowledge gained in other courses (i.e. Applied Analyticsand/or Machine Learning, etc) you may apply data normalisation, feature selectionand feature extraction. Note that, this is an optional task and you don’t have to applyany of these techniques if you don’t know the theory and the fundamentals.
Report Section Details1. Report title and student details [Plain text]: You can add the title of your reportand student details by updating the “title” and “author” entries at the top of the RMarkdown Template.2. Required packages [R code]: Provide the packages required to reproduce thereport. Make sure you fulfilled the minimum requirement #10.3. Executive Summary [Plain text]: In your own words, provide a brief summary of thepreprocessing. Explain the steps that you have taken to preprocess your data. Writethis section last after you have performed all data preprocessing. (Word count Max:300 words).4. Data [Plain text & R code & Output]: A clear description of data sets, their sources,and variable descriptions should be provided. In this section, you must also providethe R codes with outputs (e.g. head of data sets) that you used to import/read/scrapethe data set. You need to fulfil the minimum requirement #1 and merge at least twodata sets to create the one you are going to work on. In addition to the R codes andoutputs, you need to explain the steps that you have taken.5. Understand [Plain text & R code & Output]: Summarise the types of variables anddata structures, check the attributes in the data and apply proper data typeconversions. In addition to the R codes and outputs, explain briefly the steps that youhave taken. In this section, show that you have fulfilled minimum requirements 2-4.6. Tidy & Manipulate Data I [Plain text & R code & Output]: Explain why your data(or one of the data sets) doesn’t conform the tidy data principles (minimumrequirement #5). Apply the required steps to reshape the data into a tidy format. Inaddition to the R codes and outputs, explain everything that you do in this step.
7. Tidy & Manipulate Data II [Plain text & R code & Output]: Create/mutate at leastone variable from the existing variables (minimum requirement #6). In addition to theR codes and outputs, explain everything that you do in this step.8. Scan I [Plain text & R code & Output]: Scan the data for missing values, specialvalues and obvious errors (i.e. inconsistencies). In this step, you should fulfil theminimum requirement #7. In addition to the R codes and outputs, explain yourmethodology (i.e. explain why you have chosen that methodology and the actionsthat you have taken to handle these values) and communicate your results clearly.9. Scan II [Plain text & R code & Output]: Scan the numeric data for outliers. In thisstep, you should fulfil the minimum requirement #8. In addition to the R codes andoutputs, explain your methodology (i.e. explain why you have chosen thatmethodology and the actions that you have taken to handle these values) andcommunicate your results clearly.10. Transform [Plain text & R code & Output]: Apply an appropriate transformation forat least one of the variables. In addition to the R codes and outputs, explaineverything that you do in this step. In this step, you should fulfil the minimumrequirement #9.
NOTE:Note that sometimes the order of the tasks may be different than the order given here.For example, you may need to tidy the data sets first to be able to create thecommon key to merge. Therefore, for such cases you may have a differentordering of the sections.Any further or optional pre-processing tasks can be added to the template using anadditional section in the R Markdown file. Make sure your code is visible (within themargin of the page). Do not use View() to show your data, instead give headers (usinghead()).Submission Steps:1. Create the report using R MarkdownThe Assignment 2 report must be completed using the R Markdown template provided here:
R Markdown Template - Assignment 2
In the report, all R chunks and outputs need to be visible. Failure to do so will resultin a loss of marks.2. Submit your Report in CanvasUpload the report as a PDF file via the File Upload tab under the Assignment 2 page inCANVAS (see below):
The easiest way to produce a PDF file from the RMarkdown is to Run all R chunks, thenPreview your notebook in HTML (by clicking Preview) → Open in Browser (Chrome) →Right click on the report in Chrome → Click Print and Select the Destination Option to Saveas PDF.3. Publish your Report to RPubs and submit the link via Canvas
Publish your report to RPubs (see here) and enter your report’s RPubs URL into theWebsite URL tab under Assignment 2 RPubs Link Submission page in Canvas (seebelow) and submit this too.
This online version of the report will be used for marking. Failure to submit your link willdelay your feedback and risk late penalties.
Referencing guidelinesYou must acknowledge all the sources of information you have used in your assessments.Refer to the RMIT Easy Cite referencing tool to see examples and tips on how to referencein the appropriate style. You can also refer to the library referencing page for more toolssuch as EndNote, referencing tutorials and referencing guides for printing. Use the RMITHarvard referencing method for this assessment.
Assessment DeclarationWhen you submit work electronically, you agree to the Assessment Declaration.CollaborationYou are permitted to discuss and collaborate on the assignment with your classmates.However, the write-up of the report must be an individual effort. Assignments will besubmitted through Turnitin, so if you’ve copied from a fellow classmate, it will be detected. Itis your responsibility to ensure you do not copy or do not allow another classmate to copyyour work. If plagiarism is detected, both the copier and the student copied from will beresponsible. It is good practice to never share assignment files with other students. Youshould ensure you understand your responsibilities by reading the RMIT University websiteon academic integrity. Ignorance is no excuse.Academic integrity and plagiarismAcademic integrity is about honest presentation of your academic work. It meansacknowledging the work of others while developing your own insights, knowledge and ideas.You should take extreme care that you have:· acknowledged words, data, diagrams, models, frameworks and/or ideas of othersyou have quoted (i.e. directly copied), summarised, paraphrased, discussed ormentioned in your assessment through the appropriate referencing methods· provided a reference list of the publication details so your reader can locate thesource if necessary. This includes material taken from internet sites.If you do not acknowledge the sources of your material, you may be accused of plagiarismbecause you have passed off the work and ideas of another person, without appropriatereferencing, as if they were your own.
RMIT University treats plagiarism as a very serious offence constituting misconduct.Plagiarism covers a variety of inappropriate behaviours, including:· failure to properly document a source· copyright material from the internet or databases· collusion between studentsfor further information on our policies and procedures, please refer to the University website.Extensions and Special ConsiderationThis course follows the RMIT University Assessment policy for extensions and specialconsideration. Information is available here. Ensure you understand these guidelines beforeapplying.Extensions will only be granted in accordance with the RMIT University Extension andSpecial Consideration Policy. No exceptions. Assignments submitted late will be penalised(see below for further details).
Late Submission of AssessmentLate submissions, without an approved extension or special consideration, willincur a late penalty for up to 5 business days late (so the maximum late penalty is50%). Submissions more than 5 days late are not accepted.
Overdue Penalty≤ 1 business day -10%≤ 2 business days -20%≤ 3 business days -30%≤ 4 business days -40%≤ 5 business days -50%
Assignment 2 Marking RubricCriteria Not acceptable Needs Improvement Excellent
ExecutiveSummary (5)
No executive summary wasprovided.
The executive summary was providedbut there was room for improvement.
A complete summary of the datapreprocessing tasks was provided.
Data (10) No data source was given orthe data didn’t meet theminimum requirement #1, orthe attempt toread/import/merge data setswere unsuccessful.
The data source was given butit was described poorly, or variabledescriptions were missing or the attemptto read/import/merge data sets weresuccessful but there was room forimprovement.
1.Complete & clear description of● data sets,● their sources,● variable descriptionswere provided.2. Data met the minimum requirement #1.3. Merging data sets were correct.4. R codes with outputs(head of data) wereprovided.5. Brief explanations of steps were given.
Understand(15)
There was no attempt toinspect the data and thevariables in the data set andunable to meet the minimumrequirements #2-4.
There was an attempt to inspect the dataand variables but it didn’t meet theminimum requirements #2-4, or therewas room for improvement.
1.Complete inspection of● data structure,● variables types,were done.2.Attributes were checked & proper data typeconversions were applied.3.Inspection met the minimum requirements#2-4.4. R codes with outputs were provided.5. Brief explanations of steps were given.
Tidy &Manipulate I(15)
Unable to reflect on tidy dataprinciples (minimumrequirement #5). The data /
The data / data sets chosen were untidybut a clear explanation was not providedon why they are untidy, OR
1.Able to reflect on the tidy data principles.2.Clear explanation was provided.
data sets chosen were tidyand no or wrong explanationwas provided.
There was an attempt to tidy/manipulatethe data but it wasn’t aligned with thetidy data principles.
3.Complete set of tasks were provided to tidyand manipulate the data properly.4.R codes with outputs were provided.5. Brief explanations of steps were given.
Tidy &Manipulate II(5)
Unable to create/mutate atleast one variable from theexisting variables (minimumrequirement #6)
Able to create/mutate at least onevariable from the existing variables butthere was room for improvement or itwas poorly described.
1.Able to create/mutate at least one variablefrom the existing variables fulfilling the(minimum requirement #6).2..R codes with outputs were provided.3. Brief explanations of steps were given.
Scan I (20) Unable to scan for and dealwith missing values, specialvalues and obvious errors(minimum requirement #7).Some scripts were providedin an attempt to scan thedata, however nomethodology/actions weretaken to handle those values.
Able to scan the data for missing values,special values and obvious errors(minimum requirement #7), but the taskneeded improvements in themethodology.For example:- A methodology was applied to scanmissing values, special values andobvious errors (minimum requirement#7), however there was no attempt tocheck whether this approach can besafely applied, OR- A methodology was applied to scanmissing values, special values andobvious errors (minimum requirement#7), however the approach taken wasnot suitable/safe to apply, OR- The methodology was not explainedenough, the results and outputs weren'tpresented in a clearer way.
1.Complete set of tasks were provided to scanthe data for missing values, special values andobvious errors (minimum requirement #7).2-Safe and suitable methodology was followedto scan and deal with missing values, specialvalues and obvious errors.3. Methodology taken was explainedthoroughly.4.R codes were provided.5. Results and outputs were presented clearly.
Scan II (20) Unable to scan the data foroutliers (minimumrequirement #8). Somescripts were provided in anattempt to scan the data,however nomethodology/actions weretaken to handle those values.
Able to scan the data for outliers, but thetask needed improvements in themethodology. For example:- A methodology was applied to scanand deal with outliers however there wasno attempt to check whether thisapproach can be safely applied, OR- A methodology was applied to scanand deal with outliers however theapproach taken was not suitable/safe toapply.- The methodology was not explainedenough, the results and outputs weren'tpresented in a clearer way.
1.Complete set of tasks were provided to scanthe data for outliers.2. Safe and suitable methodology was followedto scan and deal with outliers3. Methodology taken was explainedthoroughly.4.R codes were provided.5. Results and outputs were presented clearly.
Transform(5)
Unable to apply anappropriate transformationfor at least one of thevariables (minimumrequirement #9).
There was an attempt to apply atransformation to the data but it waspoorly described OR there was room forimprovement.
1. Complete set of tasks were provided toapply the transformation properly, fulfillingrequirement #9.2. R codes with outputs were provided.3. Brief explanations of steps were given.
Succinct(5)
The report was too longand/or lacked clarity.
The report could be written moresuccinctly. There was unnecessarydetail that distracted from the mainfindings (like outputs were too long or
Answered Same DayOct 16, 2021MATH2349

Answer To: OverviewThis assignment requires you to find some open data, and use your knowledge, skills...

Naveen answered on Oct 19 2021
145 Votes
MATH2349 Data Wrangling Assignment 2
MATH2349 Data Wrangling Assignment 2
Marco lo
2 Required Packages
# This is the R chunk for the required packages
# Installing packages
# install.packages("readr")
# install.packages("magrittr")
# install.packages("tidyr")
# install.packages("Hmisc")
# install.packages("dplyr")
# install.packages("outliers")
# install.packages("lubridate")
# install.packages("forecast")
# Loading required packages
library(readr)
library(magrittr)
library(tidyr)
##
## Attaching package: ’tidyr’
## The following object is masked from ’package:magrittr’:
##
## extract
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: ’Hmisc’
## The
following objects are masked from ’package:base’:
##
## format.pval, units
1
library(dplyr)
##
## Attaching package: ’dplyr’
## The following objects are masked from ’package:Hmisc’:
##
## src, summarize
## The following objects are masked from ’package:stats’:
##
## filter, lag
## The following objects are masked from ’package:base’:
##
## intersect, setdiff, setequal, union
library(outliers)
## Warning: package ’outliers’ was built under R version 4.0.3
library(lubridate)
##
## Attaching package: ’lubridate’
## The following objects are masked from ’package:base’:
##
## date, intersect, setdiff, union
library(forecast)
## Registered S3 method overwritten by ’quantmod’:
## method from
## as.zoo.data.frame zoo
3 Execution Summary
pre processing is a most importatnt step to do analysis. From the given data i did the following preprocessing
steps:
• There are two data sets which they are given so i combined two data sets as one data set.
• After that i looked into each variable to know this data is comfortable to do an analysis or need to do
any tidy or any other steps to get conveniet data.
• Now i searched for the missing values or if there is any presence of special values. The missing values
are replaced with the mean value of that variable and the special values are replaced with zero for our
convenience.
• Then i searched for any outliers present in the data by using a boxplot I replaced outliers with their
nearest values.
• Finally i did one important step that is data is following normal distribution or not. The data is not
following normal so i transformed the data.
2
4 Data
The data was taken from the following link or site:
(https://data.gov.au/dataset/ds-dga-cc5d888f-5850-47f3-815d-08289b22f5a8/details).
The two datasets that was of interest is Airport Passenger Movements by Month - 20 major airports(CSV)
which is named “mon_pax” and Airport Aircraft Movements by Month - 20 airports(CSV) which is named
“mon_acm”. The two dataset contains 8967 records with 12 variables.
In mon_pax data variables are as follows:
• ARIPORT: Australia Airports
• Year: Year of the data
• Month: Month of the data
• Dom_Pax_In: Domestic flight Passengers Arrival
• Dom_Pax_Out: Domestic flight Passengers Departure
• Dom_Pax_Total: Total Domestic flight Passengers
• Int_Pax_In: International flight Passengers Arrival
• Int_Pax_Out: International flight Passengers Departure
• Int_Pax_Total: Total International flight Passengers
• Pax_In: Passengers Arrival
• Pax_Out: Passengers Departure
• Pax_Total: Total Passengers
In mon_acm data variables are as follows:
• ARIPORT: Australia Airports
• Year: Year of the data
• Month: Month of the data
• Dom_Acm_In: Domestic Arrival Aircrafts
• Dom_Acm_Out: Domestic Departure Aircrafts
• Dom_Acm_Total: Total Domestic Aircrafts
• Int_Acm_In: International Arrival Aircrafts
• Int_Acm_Out: International Departure Aircrafts
• Int_Acm_Total: Total International Aircrafts
• Acm_In: Inbound Aircrafts or Arrival Aircrafts
• Acm_Out: Outbound Aricrafts or Departure Aircrafts
# Reading datasets
mon_pax <- read.csv("mon_pax_web.csv")
mon_acm <- read.csv("mon_acm_web.csv")
# Showing first SIX records but 7th record onwards from both datasets
head(mon_pax[7:nrow(mon_pax),])
## AIRPORT Year Month Dom_Pax_In Dom_Pax_Out Dom_Pax_Total Int_Pax_In
## 7 CANBERRA 1985 1 33809 30739 64548 0
## 8 DARWIN 1985 1 17450 14328 31778 2942
## 9 GOLD COAST 1985 1 35352 44203 79555 0
## 10 HAMILTON ISLAND 1985 1 2106 2398 4504 0
## 11 HOBART 1985 1 25517 32140 57657 903
## 12 KARRATHA 1985 1 8030 7402 15432 0
## Int_Pax_Out Int_Pax_Total Pax_In Pax_Out Pax_Total
3
https://data.gov.au/dataset/ds-dga-cc5d888f-5850-47f3-815d-08289b22f5a8/details
## 7 0 0 33809 30739 64548
## 8 1837 4779 20392 16165 36557
## 9 0 0 35352 44203 79555
## 10 0 0 2106 2398 4504
## 11 886 1789 26420 33026 59446
## 12 0 0 8030 7402 15432
head(mon_acm[7:nrow(mon_acm),])
## AIRPORT Year Month Dom_Acm_In Dom_Acm_Out Dom_Acm_Total Int_Acm_In
## 7 CANBERRA 1985 1 690 684 1374 0
## 8 DARWIN 1985 1 474 470 944 30
## 9 GOLD COAST 1985 1 710 718 1428 0
## 10 HAMILTON ISLAND 1985 1 53 53 106 0
## 11 HOBART 1985 1 467 473 940 8
## 12 KARRATHA 1985 1 258 286 544 0
## Int_Acm_Out Int_Acm_Total Acm_In Acm_Out Acm_Total
## 7 0 0 690 684 1374
## 8 29 59 504 499 1003
## 9 0 0 710 718 1428
## 10 0 0 53 53 106
## 11 8 16 475 481 956
## 12 0 0 258 286 544
# flights details by merging TWO data sets
flightss <- left_join(mon_pax, mon_acm,
by=c("AIRPORT" = "AIRPORT",
"Year" = "Year",
"Month" = "Month"))
# Showing first SIX records but 7th record onwards from the merged data
head(flightss[7:nrow(flightss),])
## AIRPORT Year Month Dom_Pax_In Dom_Pax_Out Dom_Pax_Total Int_Pax_In
## 7 CANBERRA 1985 1 33809 30739 64548 0
## 8 DARWIN 1985 1 17450 14328 31778 2942
## 9 GOLD COAST 1985 1 35352 44203 79555 0
## 10 HAMILTON ISLAND 1985 1 2106 2398 4504 0
## 11 HOBART 1985 1 25517 32140 57657 903
## 12 KARRATHA 1985 1 8030 7402 15432 0
## Int_Pax_Out Int_Pax_Total Pax_In Pax_Out Pax_Total Dom_Acm_In Dom_Acm_Out
## 7 0 0 33809 30739 64548 690 684
## 8 1837 4779 20392 16165 36557 474 470
## 9 0 0 35352 44203 79555 710 718
## 10 0 0 2106 2398 4504 53 53
## 11 886 1789 26420 33026 59446 467 473
## 12 0 0 8030 7402 15432 258 286
## Dom_Acm_Total Int_Acm_In Int_Acm_Out Int_Acm_Total Acm_In Acm_Out Acm_Total
## 7 1374 0 0 0 690 684 1374
## 8 944 30 29 59 504 499 1003
## 9 1428 0 0 0 710 718 1428
## 10 106 0 0 0 53 53 106
## 11 940 8 8 16 475 481 956
## 12 544 0 0 0 258 286 544
4
5 Understand
• Here i checked for the data structure and also data type of each variable.
• The Month variable is replaced with the month name.
• The montnvariable is in character so i changed it into factor data type.
• After changing data type checking for is it chaged correctly with levels.
# Getting attributes of the dataset
Att_flightss <- attributes(flightss)
# Print column(Features) names
Att_flightss$names
## [1] "AIRPORT" "Year" "Month" "Dom_Pax_In"
## [5] "Dom_Pax_Out" "Dom_Pax_Total" "Int_Pax_In" "Int_Pax_Out"
## [9] "Int_Pax_Total" "Pax_In" "Pax_Out" "Pax_Total"
## [13] "Dom_Acm_In" "Dom_Acm_Out" "Dom_Acm_Total" "Int_Acm_In"
## [17] "Int_Acm_Out" "Int_Acm_Total" "Acm_In" "Acm_Out"
## [21] "Acm_Total"
# Print class of the dataset
Att_flightss$class
## [1] "data.frame"
# Printing structure of the dataset
str(flightss[20:nrow(flightss),])
## ’data.frame’: 8948 obs. of 21 variables:
## $ AIRPORT : chr "TOWNSVILLE" "NEWCASTLE" "ADELAIDE" "ALICE SPRINGS" ...
## $ Year : int 1985 1985 1985 1985 1985 1985 1985 1985 1985 1985 ...
## $ Month : int 1 1 2 2 2 2 2 2 2 2 ...
## $ Dom_Pax_In : int 39736 6739 67452 14526 957067 0 81160 12977 38132 9983 ...
## $ Dom_Pax_Out : int 38836 6742 64219 14432 957067 0 83347 12761 37396 10060 ...
## $ Dom_Pax_Total: int 78572 13481 131671 28958 1914134 0 164507 25738 75528 20043 ...
## $ Int_Pax_In : int 1358 0 4260 0 204953 0 15480 1023 0 1646 ...
## $ Int_Pax_Out : int 990 0 3555 0 179222 0 15964 1238 0 1391 ...
## $ Int_Pax_Total: int 2348 0 7815 0 384175 0 31444 2261 0 3037 ...
## $ Pax_In : int 41094 6739 71712 14526 1162020 0 96640 14000 38132 11629 ...
## $ Pax_Out : int 39826 6742 67774 14432 1136289 0 99311 13999 37396 11451 ...
## $ Pax_Total : int 80920 13481 139486 28958 2298309 0...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here