OverviewThis assignment requires you to find some open data, and use your knowledge, skills gainedduring the course to preprocess the data. You will create a report using R Markdown toexplain the steps taken by you in order to perform the data preprocessing tasks. You willalso publish this report online (in RPubs) which will give you the opportunity to build yourdata analytics portfolio. This is a great way of showing potential employers what you arecapable of. You will be awarded (with marks) the clearer you demonstrate your skills.
Assignment Data Sources
Assignment 2 is open-ended however you are required to find suitable datasets thatfulfill the minimum requirements given below. All of the datasets that you use in thisassignment must be open and ideally have a Creative Commons Licence. This will ensureyou can share your work with anyone provided you make proper attribution. If you’re notsure if data is Open, contact the provider, read the documentation or post on the discussionboard and I will investigate. Some open data sources are provided below, but I encourageyou to find others:
o https://www.kaggle.como UCI Machine Learning Repositoryo data.govo world banko amazon web serviceso google data setso youtube video data setso analytics vidhyao quandlo driven datao http://www.abs.gov.au/o https://www.data.vic.gov.au/o http://www.bom.gov.au/o https://relational.fit.cvut.cz
Minimum Requirements for the Data setsConsidering this is a data preprocessing class, I do expect your data set to have certainrequirements so that you can demonstrate your knowledge of data preprocessing. Thefollowing are the minimum requirements for the data sets that I will look for:1. At least two data sets should be merged to create your assignment data (forexample you can take crime statistics for the cities/states in Australia and merge thisdata set with cities/states’ per capita income data).2. Your data set should include multiple data types (numerics, characters, factors, etc).3. Your data set should include variables suitable for data type conversions so that youshould be able to apply the required data type conversions (e.g., character ->factor, character -> date, numeric -> factor, etc. conversions).4. Your data set should include at least one factor variable that needs to be labelledand/or ordered.5. At least one of the data sets that you use should be Untidy. You need to explainwhy the data set or data sets you used is/are Untidy. Then you need to apply therequired steps to reshape your data into a tidy format.6. At least one variable needs to be created/mutated from the existing ones (e.g.the data may contain income and expense variables and you may create a savingsvariable out of the income and expense variables).7. You are expected to scan all variables for missing values, special values andobvious errors (i.e. inconsistencies). If there are missing values, use any of thesuitable techniques outlined in Module 5 to deal with them, reason and documentyour approach properly. If there are no missing values in the data, then scan allvariables for any special values and obvious errors, use any of the suitabletechniques outlined in Module 5 to deal with them, reason and document yourapproach properly.
8. You are expected to scan all numeric variables for outliers. If there are outliers,use any of the suitable techniques outlined in Module 6 to deal with them, reasonand document your approach properly.9. You are expected to apply data transformations on at least one of the variables.The purpose of this transformation should be one of the following reasons: i) tochange the scale for better understanding of the variable, ii) to convert a non-linearrelation into linear one, or iii) to decrease the skewness and convert the distributioninto a normal distribution.10. You are expected to use only readr, xlsx, readxl, foreign, gdata, rvest, dplyr, tidyr,deductive, deducorrect, editrules, validate, Hmisc, forecast, stringr, lubridate, car,outliers, MVN, infotheo, MASS, caret, MLR , ggplot2, knitr and base R functions forthis section. You can also use your own functions. This will show your accumulatedknowledge that you gained throughout the semester in this course.Optional things that you can do to preprocess data:● You can subset your data by selecting variables and/or filtering in (or out) cases.Please don’t forget to put an explanation in your report if you do so.● Your data set can include date or string information or both. If this is the case, Iexpect you to apply required date conversions for dates and string manipulations forstrings as required.● Depending on your level of knowledge gained in other courses (i.e. Applied Analyticsand/or Machine Learning, etc) you may apply data normalisation, feature selectionand feature extraction. Note that, this is an optional task and you don’t have to applyany of these techniques if you don’t know the theory and the fundamentals.
Report Section Details1. Report title and student details [Plain text]: You can add the title of your reportand student details by updating the “title” and “author” entries at the top of the RMarkdown Template.2. Required packages [R code]: Provide the packages required to reproduce thereport. Make sure you fulfilled the minimum requirement #10.3. Executive Summary [Plain text]: In your own words, provide a brief summary of thepreprocessing. Explain the steps that you have taken to preprocess your data. Writethis section last after you have performed all data preprocessing. (Word count Max:300 words).4. Data [Plain text & R code & Output]: A clear description of data sets, their sources,and variable descriptions should be provided. In this section, you must also providethe R codes with outputs (e.g. head of data sets) that you used to import/read/scrapethe data set. You need to fulfil the minimum requirement #1 and merge at least twodata sets to create the one you are going to work on. In addition to the R codes andoutputs, you need to explain the steps that you have taken.5. Understand [Plain text & R code & Output]: Summarise the types of variables anddata structures, check the attributes in the data and apply proper data typeconversions. In addition to the R codes and outputs, explain briefly the steps that youhave taken. In this section, show that you have fulfilled minimum requirements 2-4.6. Tidy & Manipulate Data I [Plain text & R code & Output]: Explain why your data(or one of the data sets) doesn’t conform the tidy data principles (minimumrequirement #5). Apply the required steps to reshape the data into a tidy format. Inaddition to the R codes and outputs, explain everything that you do in this step.
7. Tidy & Manipulate Data II [Plain text & R code & Output]: Create/mutate at leastone variable from the existing variables (minimum requirement #6). In addition to theR codes and outputs, explain everything that you do in this step.8. Scan I [Plain text & R code & Output]: Scan the data for missing values, specialvalues and obvious errors (i.e. inconsistencies). In this step, you should fulfil theminimum requirement #7. In addition to the R codes and outputs, explain yourmethodology (i.e. explain why you have chosen that methodology and the actionsthat you have taken to handle these values) and communicate your results clearly.9. Scan II [Plain text & R code & Output]: Scan the numeric data for outliers. In thisstep, you should fulfil the minimum requirement #8. In addition to the R codes andoutputs, explain your methodology (i.e. explain why you have chosen thatmethodology and the actions that you have taken to handle these values) andcommunicate your results clearly.10. Transform [Plain text & R code & Output]: Apply an appropriate transformation forat least one of the variables. In addition to the R codes and outputs, explaineverything that you do in this step. In this step, you should fulfil the minimumrequirement #9.
NOTE:Note that sometimes the order of the tasks may be different than the order given here.For example, you may need to tidy the data sets first to be able to create thecommon key to merge. Therefore, for such cases you may have a differentordering of the sections.Any further or optional pre-processing tasks can be added to the template using anadditional section in the R Markdown file. Make sure your code is visible (within themargin of the page). Do not use View() to show your data, instead give headers (usinghead()).Submission Steps:1. Create the report using R MarkdownThe Assignment 2 report must be completed using the R Markdown template provided here:
R Markdown Template - Assignment 2
In the report, all R chunks and outputs need to be visible. Failure to do so will resultin a loss of marks.2. Submit your Report in CanvasUpload the report as a PDF file via the File Upload tab under the Assignment 2 page inCANVAS (see below):
The easiest way to produce a PDF file from the RMarkdown is to Run all R chunks, thenPreview your notebook in HTML (by clicking Preview) → Open in Browser (Chrome) →Right click on the report in Chrome → Click Print and Select the Destination Option to Saveas PDF.3. Publish your Report to RPubs and submit the link via Canvas
Publish your report to RPubs (see here) and enter your report’s RPubs URL into theWebsite URL tab under Assignment 2 RPubs Link Submission page in Canvas (seebelow) and submit this too.
This online version of the report will be used for marking. Failure to submit your link willdelay your feedback and risk late penalties.
Referencing guidelinesYou must acknowledge all the sources of information you have used in your assessments.Refer to the RMIT Easy Cite referencing tool to see examples and tips on how to referencein the appropriate style. You can also refer to the library referencing page for more toolssuch as EndNote, referencing tutorials and referencing guides for printing. Use the RMITHarvard referencing method for this assessment.
Assessment DeclarationWhen you submit work electronically, you agree to the Assessment Declaration.CollaborationYou are permitted to discuss and collaborate on the assignment with your classmates.However, the write-up of the report must be an individual effort. Assignments will besubmitted through Turnitin, so if you’ve copied from a fellow classmate, it will be detected. Itis your responsibility to ensure you do not copy or do not allow another classmate to copyyour work. If plagiarism is detected, both the copier and the student copied from will beresponsible. It is good practice to never share assignment files with other students. Youshould ensure you understand your responsibilities by reading the RMIT University websiteon academic integrity. Ignorance is no excuse.Academic integrity and plagiarismAcademic integrity is about honest presentation of your academic work. It meansacknowledging the work of others while developing your own insights, knowledge and ideas.You should take extreme care that you have:· acknowledged words, data, diagrams, models, frameworks and/or ideas of othersyou have quoted (i.e. directly copied), summarised, paraphrased, discussed ormentioned in your assessment through the appropriate referencing methods· provided a reference list of the publication details so your reader can locate thesource if necessary. This includes material taken from internet sites.If you do not acknowledge the sources of your material, you may be accused of plagiarismbecause you have passed off the work and ideas of another person, without appropriatereferencing, as if they were your own.
RMIT University treats plagiarism as a very serious offence constituting misconduct.Plagiarism covers a variety of inappropriate behaviours, including:· failure to properly document a source· copyright material from the internet or databases· collusion between studentsfor further information on our policies and procedures, please refer to the University website.Extensions and Special ConsiderationThis course follows the RMIT University Assessment policy for extensions and specialconsideration. Information is available here. Ensure you understand these guidelines beforeapplying.Extensions will only be granted in accordance with the RMIT University Extension andSpecial Consideration Policy. No exceptions. Assignments submitted late will be penalised(see below for further details).
Late Submission of AssessmentLate submissions, without an approved extension or special consideration, willincur a late penalty for up to 5 business days late (so the maximum late penalty is50%). Submissions more than 5 days late are not accepted.
Overdue Penalty≤ 1 business day -10%≤ 2 business days -20%≤ 3 business days -30%≤ 4 business days -40%≤ 5 business days -50%
Assignment 2 Marking RubricCriteria Not acceptable Needs Improvement Excellent
ExecutiveSummary (5)
No executive summary wasprovided.
The executive summary was providedbut there was room for improvement.
A complete summary of the datapreprocessing tasks was provided.
Data (10) No data source was given orthe data didn’t meet theminimum requirement #1, orthe attempt toread/import/merge data setswere unsuccessful.
The data source was given butit was described poorly, or variabledescriptions were missing or the attemptto read/import/merge data sets weresuccessful but there was room forimprovement.
1.Complete & clear description of● data sets,● their sources,● variable descriptionswere provided.2. Data met the minimum requirement #1.3. Merging data sets were correct.4. R codes with outputs(head of data) wereprovided.5. Brief explanations of steps were given.
Understand(15)
There was no attempt toinspect the data and thevariables in the data set andunable to meet the minimumrequirements #2-4.
There was an attempt to inspect the dataand variables but it didn’t meet theminimum requirements #2-4, or therewas room for improvement.
1.Complete inspection of● data structure,● variables types,were done.2.Attributes were checked & proper data typeconversions were applied.3.Inspection met the minimum requirements#2-4.4. R codes with outputs were provided.5. Brief explanations of steps were given.
Tidy &Manipulate I(15)
Unable to reflect on tidy dataprinciples (minimumrequirement #5). The data /
The data / data sets chosen were untidybut a clear explanation was not providedon why they are untidy, OR
1.Able to reflect on the tidy data principles.2.Clear explanation was provided.
data sets chosen were tidyand no or wrong explanationwas provided.
There was an attempt to tidy/manipulatethe data but it wasn’t aligned with thetidy data principles.
3.Complete set of tasks were provided to tidyand manipulate the data properly.4.R codes with outputs were provided.5. Brief explanations of steps were given.
Tidy &Manipulate II(5)
Unable to create/mutate atleast one variable from theexisting variables (minimumrequirement #6)
Able to create/mutate at least onevariable from the existing variables butthere was room for improvement or itwas poorly described.
1.Able to create/mutate at least one variablefrom the existing variables fulfilling the(minimum requirement #6).2..R codes with outputs were provided.3. Brief explanations of steps were given.
Scan I (20) Unable to scan for and dealwith missing values, specialvalues and obvious errors(minimum requirement #7).Some scripts were providedin an attempt to scan thedata, however nomethodology/actions weretaken to handle those values.
Able to scan the data for missing values,special values and obvious errors(minimum requirement #7), but the taskneeded improvements in themethodology.For example:- A methodology was applied to scanmissing values, special values andobvious errors (minimum requirement#7), however there was no attempt tocheck whether this approach can besafely applied, OR- A methodology was applied to scanmissing values, special values andobvious errors (minimum requirement#7), however the approach taken wasnot suitable/safe to apply, OR- The methodology was not explainedenough, the results and outputs weren'tpresented in a clearer way.
1.Complete set of tasks were provided to scanthe data for missing values, special values andobvious errors (minimum requirement #7).2-Safe and suitable methodology was followedto scan and deal with missing values, specialvalues and obvious errors.3. Methodology taken was explainedthoroughly.4.R codes were provided.5. Results and outputs were presented clearly.
Scan II (20) Unable to scan the data foroutliers (minimumrequirement #8). Somescripts were provided in anattempt to scan the data,however nomethodology/actions weretaken to handle those values.
Able to scan the data for outliers, but thetask needed improvements in themethodology. For example:- A methodology was applied to scanand deal with outliers however there wasno attempt to check whether thisapproach can be safely applied, OR- A methodology was applied to scanand deal with outliers however theapproach taken was not suitable/safe toapply.- The methodology was not explainedenough, the results and outputs weren'tpresented in a clearer way.
1.Complete set of tasks were provided to scanthe data for outliers.2. Safe and suitable methodology was followedto scan and deal with outliers3. Methodology taken was explainedthoroughly.4.R codes were provided.5. Results and outputs were presented clearly.
Transform(5)
Unable to apply anappropriate transformationfor at least one of thevariables (minimumrequirement #9).
There was an attempt to apply atransformation to the data but it waspoorly described OR there was room forimprovement.
1. Complete set of tasks were provided toapply the transformation properly, fulfillingrequirement #9.2. R codes with outputs were provided.3. Brief explanations of steps were given.
Succinct(5)
The report was too longand/or lacked clarity.
The report could be written moresuccinctly. There was unnecessarydetail that distracted from the mainfindings (like outputs were too long or