you will generate areportpresenting a data set and analysis. While you can use code to ask questions of data, the answers to the questions are meaningless if you can't share them with others!
By completing this assignment you will practice and master the following skills:
- Declaring document rendering using Markdown syntax
- Rendering R Markdown files using
knitr
- Synthesizing skills, tools, and concepts from across the course
Completing the Assignment
OpenRStudioLinks to an external site.
in a new tab, then start the assignment called "M6_assignment_data_reporting". You will put your code in the providedanalysis.R
file, but you will also need to create your ownindex.Rmd
R Markdown file using the built-in wizard. Place this file in the "root" of the project directory.
PICKING A DATA SET
For this assignment you can work with adata set of your choosing. It's perfectly acceptable to work with a data set from one of the previous assignments (though not one from the exercises)—in fact that is the "default" option.
However, if you wish to practice your data programming skills on data from a different domain, you can do that as well. You will need tolocateyour own data set in this case.
By "data set", we mean a.csv
file similar to ones you've worked with previously. It's acceptable to use an alternate data format (like a relational database, or web API), but that goes beyond the scope of what we've already covered so will require extensive additional outside study. Note that a "web site" (e.g., for web crawling) is not appropriate for this assignment, nor are image databases. This is not a machine learning course, we're just doing the basics! Stick with what you know :)
Open Data Government sites (e.g.,for Seattle(Links to an external site.)
),Kaggle(Links to an external site.)
, and theFiveThirtyEight blog(Links to an external site.)
also generally good places to find easy-to-work-with data.
Your data set need not be "Big Data", but should be of sufficient size to do some interesting analysis. Having at least 100 observations across 3-4 unique features is a good size. Make sure that your.csv
is less than 50mb so that there are no problems sharing it (if it's larger than that, take a subset!)
Upload your data file to your project (using the "Upload" button in the file pane) and save it in the provideddata/
folder.
MAKING THE REPORT
You will present your data analysis in a singlereportcreated withR Markdownandknitr
.
The report will be written in a file called
index.Rmd
. You will need to create this file (you can do this through the RStudio Wizard, as described inChapter 18(Links to an external site.)
). This file will contain your report, including both text in Markdown and instructions to dynamically executeR
code that will be executed to dynamically produce the data shown in the report.
- Be sure to specify appropriate metadata, including thetitle, your name as theauthor, and thedatethe report was generated. These should automatically be set up through the R Studio wizard.
Your report will include aR code chunkcalledsetup
(withinclude=FALSE
as a specified option), as described and shownin the textbook(Links to an external site.)
. In this code chunk, use thesource()
function to run youranalysis.R
script. Because the chunk hasinclude=FALSE
, any printed output from your script will not be shown, but the variables containing your plots will be defined so you can use them later.
- Your R Markdown should use arelative pathto the
analysis.R
file, with the assumption that they are in the same folder.
Remember that you can't have any calls toView()
in any code run by R Markdown! Be sure and remove or comment out any of those calls in youranalysis.R
Because plots may take some time to create, it may take a minute for your R Markdown file to knit. Be patient!
All of your "data wrangling" and analysis work must go in youranalysis.R
script! Only code related to the "presentation" must go in the R Markdown file. Any code generating data frames or plots shouldalsogo in theanalysis.R
file—save those plots to variables which you can then reference from the R Markdown. Debug your code in theanalysis.R
file, not in the Markdown!
You can use the built-inKnitbutton in R Studio to render your.Rmd
file into a.html
file which you can open with a web browser. Simply click the
Knitbutton(Links to an external site.)
at the top of RStudio, and yourindex.html
file will be saved in the same directory as your.Rmd
file. You will "re-knit" repeatedly as you work through the assignment to make sure everything works!
REPORT CONTENT
Your report will include a few different sections. Give each section an appropriate heading, and a sentence or so introducing it.
1. Data Description
The first part of your report will be a brief "introduction" to the data set and your analysis. This section will include a paragraph presenting the following information:
A non-technical description of the data sets you will be using (whatisthe data?) This only needs to be a sentence or two.
An explanation of where the data comes from, who originally collected the data, and any other information we may need to know about how this data set came to be. You must include ahyperlink to the source—we must be able to follow the links and find your data set ourselves. Again, this only needs to be a sentence or two.
For example, if you're working with the A3 World Bank data, you'd include a link to their website.
Asampleof the data set, so that we can see what raw data you'll be working with ("the data set looks like this"). This means that you'll need to load the data set intoR
(e.g., withread.csv()
) and present it as a table (or multiple tables) in your report.Do not include the entire table; just the a few rows is sufficient. Think about the "user experience" of reading the report!
Use thekable()
function to render a readable data table.
You don't need to include all columns of your data frames; only including the most important/relevant ones is acceptable. You are not required to do substantive data cleaning (or even rename columns), though it wouldn't hurt to do some of that wrangling now instead of later.
Remember to do your data wrangling in the.R
script file, not in the R Markdown file!
If any of the column names are not intuitive, also include a brief explanation. For example, if you're working with the A3 World Bank data, you'd include a explanation of which indicators you're working with.
This section must include some text formatting using Markdown (such as making text eitherboldoritalic)
2. Data Analysis
The second part of your report will be the analysis of your data. Your report will includetwo (2)different "questions" and the analysis that explores those questions.
For example, if you're working with the A3 World Bank Data, you could pick any two visualizations as your questions: "How are C02 emissions distributed globally?" "How has the share of wealth in the USA changed between groups over time"?
Look back at some of the reflections and analyses that you did in previous assignments to get a sense for what kinds of questions you might ask!
Each question should be presented in its own section (with asecond level heading). For each question, include the following:
A sentence or so presenting the question.
A graphical data representation (a plot, created withggplot
) that explores that question.
A briefevaluationof your exploration stating your conclusions (the "answer" to the questions you asked).
Your evaluation cannot rely purely on visual or anecdotal analysis (no "the line goes up!" or "the measure for one state looks large!"). Instead it must use somedescriptive statistics(e.g., mean/median) or measures ofeffect strength(e.g., correlations or predictive statistics) to definitively state relationships among your data. You do not need to perform advanced statistical analysis—this is not a stats class!—but your conclusions need to be grounded in the data, not in the representation.
It's quite likely that the results may not provide the answer you expected, and that's okay! In your evaluation, you can mention that, and offer a guess as to why your assumptions didn't hold up.
The descriptive statistics you use to answer your questions must be included in your report asinline R expressions. For example, you might have a sentence "The USA is the largest polluter in the world", where "USA" is an inline value drawn directly from the data.
Again, remember to do your data analysis in theanalysis.R
file! Save whatever values you want to include in your report inside of specific variables (or lists of values).
SUBMITTING YOUR WORK
We will grade your assignment by looking at your work in RStudio cloud. You can also download the completed .R and .Rmd files from RStudio Cloud and upload it here. Then return to this page and clickNext.
GRADING RUBRIC
Each item in the below grading rubric will be scored as roughly as follows:
100%Meets all requirements for grade item. Report is effectively coded, written, and presented.
80%Meets most requirements for grade item. Report may have a few errors or be missing minor aspects/components.
60%Meets many (but not all) requirements for grade item. Report may be missing significant aspects/components or have multiple errors.
40%Meets only a few requirements. Report may be started but incomplete.
0%Missing or meets no requirements. Report demonstrates no understanding of course material.