Data files
# Read in the Building_Permits.csv file in the data/ folder and store it in a data frame named
# `permits`. This file contains building permits from the City of Seattle. See the link below
# for more info. Be sure set the working directory to the appropriate folder location.
# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr
permits <->
# How many unique values are in the `OriginalCity` column of the dataset? Store the unique values
# in a vector called `unique_cities`. As you can see - there is only one unique city but there are
# multiple variations on how it is stored in the data. (hint: you may want to look into the
# `unique()` function)
# Suppose we only want information on closed permits that have a valid type description. Filter
# the permits data frame to only include rows where the `StatusCurrent` column is "Issued" and
# the `PermitTypeMapped` is "Building". Store the filtered data frame as `filtered_permits`. It is
# fine to do the filtering across successive steps
# How many unique values for the `PermitClass` column are in `filtered_permits`? Store the unique
# values in a vector called `unique_permit_classes`.
# We'll work with this filtered dataset in future exercises and look at the values in the
# `PermitClass` column. Store the entire `filtered_permits` data frame as a file named
# 'filtered_building_permits.csv'. Set `row.names` to FALSE when saving.
# Read the filtered_building_permits.csv file that you just saved back in and store the data in a
# data frame named `test_read_write`. Ensure that test_read_write has the same dimensions as your
# `filtered_permits`. If it doesn't, you should check that `row.names` was appropriately set
# in the previous step and re-save.
Basic grammar functions
## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar
## sign ($) or bracket ([]) notation!
library(tidyr)
library(dplyr)
# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This
# file contains building permits from the City of Seattle. See the link below for more info. Be sure
# set the working directory to the appropriate folder location.
# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr
permits <->
permits
# Also load in the data/filtered_building_permits.csv file (what you created as part of the last
# exercise in Module 3). Store the data in a data frame called `filtered_permits`.
filtered_permits <->
# We're going to use dplyr to filter the permits data using the same steps as we did previously.
# Specifically, we're going to keep only entries that have a status of 'Issued' in the
# `StatusCurrent` column and have a 'Building' value for the `PermitTypeMapped` column. Use only
# functions from the `dplyr` library to perform the filtering. Store the filtered data in a data
# frame called `filtered_permits_dplyr`. Store the dimensions of the resulting data frame in a
# variable named `dimensions`. Then, print the value of the `dimensions` variable. Also ensure
# that `filtered_building_permits` and `filtered_permits_dplyr` are identical
# Now, let's create a `Year` column that contains the year that the permit was issued. To do this,
# use the `mutate` dplyr function and extract the first 4 characters of the `AppliedDate` column.
# Add this `year` column to the `filtered_permits_dplyr` data frame.
# Keep only the rows of the `filtered_permits_dplyr` that have a valid (non-blank) year. To do so,
# filter out all Year values that are `NaN` values (hint: see the `drop_na()` function from tidyverse)
# and then drop all entries that have "" as the Year. Then, select only the following columns to keep
# in the data frame: PermitClass, PermitClassMapped, and Year. Store the resulting data frame as
# `filtered_permits_dplyr`
# Finally, arrange the `filtered_permits_dplyr` data frame such that the most recent years appear
# first and the earliest years appear last
# Save the `filtered_permits_dplyr` data frame as "filtered_permits_dplyr.csv". Set `row.names` to
# FALSE. We will be using this data frame in the next exercise.
Pipes
# Now, we're going to filter/mutate/arrange the permits data frame in the same manner that we did
# in the previous exercise. However, instead of performing subsequent calls using dplyr functions,
# we are going to chain together dplyr calls using pipes. In a single pipe, create a
# `filtered_permits_piped` data frame that is identical to the `filtered_permits_dplyr` data frame.
# The steps in the process are:
# 1. FILTER permits such that the `StatusCurrent` column is 'Issued'
# 2. FILTER permits such that the `PermitTypeMapped` column is 'Building'
# 3. MUTATE the data frame to add a `Year` column that contains the first 4 characters of the
# `AppliedDate` column
# 4. FILTER permits such that they have a valid (non-blank) year (removing NaNs and "" entries)
# 5. SELECT only the following columns from the data frame: `PermitClass`, `PermitClassMapped`, and
# `Year`
# 6. ARRANGE the data frame so the most recent years are first and the earliest years are last
# Ensure that your `filtered_permits_dplyr` and `filtered_permits_piped` data frames are identical.
# You will first need to MUTATE the `Year` column in your `filtered_permits_piped` data frame to
# be an an integer value (hint: use the `as.integer` function). This is because, when saving/loading
# data frames, R will default to storing the data using types that it sees most fit. As an example,
# though when you saved the `filtered_permits_dplyr` data frame, `Year` was a string (we literally
# took a substring to form the column), you can see that it is now a numeric/integer value because
# we filtered, saved, and then re-loaded.
# Now, you can use the `identical()` function to compare You do not need to save the
# `filtered_permits_piped` data frame
Groups
## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar
## sign ($) or bracket ([]) notation!
# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This
# file contains building permits from the City of Seattle. See the link below for more info. Be sure
# set the working directory to the appropriate folder location.
# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr
# Filter the `permits` to only include instances where the estimated project cost value
# (`EstProjectCost`) is greater than 0. Also filter to only include new permits for building
# (Hint: use the `PermitTypeMapped` and `PermitTypeDesc` columns). Also, as you did in the
# previous exercise, add a `Year` column by taking the first four characters from the `date.
# `AppliedDate`` column. However, do not filter on the `Year` column as you did previously.
# Store the result in a data frame named `filtered_permits`
# Group the permits by Year to get the mean estimated project cost for each year. Name the
# mean estimated cost column `MeanEstCost`. Store the results in a data frame named
# `grouped_by_year`
# Which year after 2010 had the highest mean estimated project cost? Use `dplyr` functions
# and pipes to find the value of interest (Hint: use filter operations). Store the value in
# a data frame with a single row that has `Year` and `MeanEstCost` columns. Name the data
# frame `highest_cost_by_year`.
# Note: if you wanted to extract the `Year` value, you could use dplyr's `pull()` function.
# However, you are not asked to do that here. Keep the result as a data frame with a single
# row
# Now, create a second grouping to get the mean estimated project cost for each year and
# `PermitClass` as well as the number of records for each year and `PermitClass`
# combination (i.e. group by both simultaneously and summarize by both simultaneously).
# Store the results in a data frame named 'grouped_by_year_and_class`. The data frame
# should have four columns: `PermitClass`, `Year`, `N` (for the number of records) and
# `MeanEstCost`. (Hint: use the `n()` function in your summarize to get counts)
# For year and permit type combinations that had at least 1000 permits, what was the
# minimum estimated cost? Store the result in a data frame called `min_cost_high_count`
# that has the same four columns as your `grouped_by_year_and_class` data frame.
Joins
## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar
## sign ($) or bracket ([]) notation!
# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This
# file contains building permits from the City of Seattle. See the link below for more info. Be sure
# set the working directory to the appropriate folder location.
# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr
# Read in the data/zip_code_demos_2000.csv file and store it in a data frame named `zip_codes` This
# file contains demographic information for many of Washington State's zip codes from the year
# 2000.
# In the `zip_code` data frame, you can see that both the `population` and `population_density`
# columns are stored as character columns while they contain numeric values. This is because of
# the commas present in the data source. Let's clean both columns and convert them to numeric
# values by first replacing the commas with "" characters (you can use functions from the
# `stringr` package to do this or use R's `gsub()` function) and then convert them to numeric
# values using the `as.numeric()` function. Instead of storing the results in a new variable,
# simply update the values of the existing `zip_codes` data frame.
# Filter the `permits` data frame to only include instances where the `OriginalZip` variable is
# not NA (use tidyverse's `drop_na()` function). Store the filtered data frame as a data frame
# named `filtered_permits`
# Which zip codes in the `filtered_permits` data frame do we not have information for in the
# `zip_codes` data frame? To answer this question, create a `not_found` vector that contains
# all the zip codes from `filtered_permits` that are not in `zip_codes`. One approach
# to create this variable is to first isolate all unique zip codes in `filtered_permits` and
# then use the `%in%` function to find which of the unique zip codes are NOT in the `zip_codes`
# data frame.
# Without changing the names of any columns, perform a left join of the `filtered_permits` data
# frame and the `zip_codes` data frame. Have the `filtered_permits` data frame be the "left"
# side of the join. Store the resulting data frame in a variable named `left_joined_data`
# Now, find which of the rows in the `filtered_permits` data frame were not matched to rows in
# the `zip_codes` data frame. To do so, filter to rows of `left_joined_data` where one of the
# columns from the `zip_codes` data frame is NA (you do not need to check for all columns, just
# pick one and use that). Store the result in a data frame named `not_matched_left_join`
# Do an inner join of the `filtered_permits` data frame and the `zip_codes` data frame. Store
# the result in a data frame named `inner_joined_data`.
# Now, we can see if there is any relationship between zip code demographics and the number of
# permits for a given area. Create a data frame named `grouped_by_zip` that is a grouping of
# the `inner_joined_data` data frame by the `OriginalZip` column and then summarized to give
# a column of counts (i.e. the number of permits in each zip code) and the mean population
# density of each zip code
# Is there any relationship that you can see between the number of permits in a zip code and
# the population density of that zip code? You do not need to do any formal tests of this -
# simply doing an "eyeball" test from looking at the data is enough.
#
# From here, we can see how joining multiple data sets would allow us to answer really
# interesting questions about the relationship between building permits and population density.
# Of course, in this instance, it comes with the caveat that the population demographic data is
# a bit older than the permit data. We can also expand this to see the relationship between
# population demographics (age, race, ethnicity, education) and the number of residential
# building permits in an area. We can also look at income and commercial building permits in
# an area. Being able to join data opens doors so we're no longer limited to a single data set!
Tidyr
filter the `permits` to only include instances where the
# estimated project cost value (`EstProjectCost`) is greater than 0. Also filter to only include
# new permits for building (Hint: use the `PermitTypeMapped` and `PermitTypeDesc` columns). Add
# a `Year` column by taking the first four characters from the date. Store the result in a data
# frame named `filtered_permits`
# Similar to what you did in a previous exercise: create a grouping to get the number of records
# for each year and `PermitClass` combination (i.e. group by both simultaneously and summarize
# by both simultaneously). Store the results in a data frame named 'grouped_by_year_and_class`.
# The data frame should have three columns: `PermitClass`, `Year`, and `N` (for the number of
# records). (Hint: use the `n()` function in your summarize to get counts)
# As you can see, the `grouped_by_year_and_class` data frame is a "long" data frame. Create a
# data frame named `grouped_by_year_and_class_wide` that is a wide version of the same data. Use
# tidyr's `pivote_wider()` function to do this. Your "wide" data frame should have a column for the
# Year and then a column for every category of `PermitClass` (8 columns in total). The values of
# the data frame should be counts (i.e. your values for `N`)
# Now, use the `grouped_by_year_and_class_wide` data frame to recreate the
# `grouped_by_year_and_class` data frame using tidyr's `pivot_longer()` function. The cols to pivot
# should be everything except the Year column (written as "-Year" , note the "-" symbol). The
# column names should go to a column called "PermitClass", and the values should go to a column
# called "N" (to match the orignal data frame). After creating the data frame, filter out rows with
# NA values. Store the results in a data frame named `grouped_by_year_and_class_long`. Check that
# the dimensions of the resulting data frame are the same as the `grouped_by_year_and_class` data frame
# Lastly, arrange both the `grouped_by_year_and_class` and `grouped_by_year_and_class_long` data
# frames by year and `PermitClass` (the order doesn't matter as long as it is consistent across)
# the two data frames). Ensure that the two data frames are identical by using the `identical()`
# function.
ggplot basics
# Read in the data/filtered_permits_dplyr.csv file (what was created during the previous module)
# as a data frame named `filtered_permits`. This data should be a filtered version of the permits
# data that has three columns: PermitClass, PermitClassMapped, and Year. The data should have
# 8,238 observations (rows).
# Group the data by year to get a count of the number of permits for each year. Store the result
# in a variable named `permits_by_year`. You should use dplyr to do this grouping (hint: use
# `group_by()` followed by a `summarize()` call using the `n()` function).
# Using ggplot, create a bar chart of the number of permits by year. Store the plot as `plot1`.
# After storing, you can show the plot using the `show` function (i.e. `show(plot1)`).
# Filter the `permits_by_year` data frame such that it only contains Years including and after
# 2010. Store the results in a data frame named `permits_by_year_filtered`. (hint: You can
# simply filter the `permits_by_year` data frame - you do not need to filter and perform the
# group by aggregation on the `permits` data frame, though that is an option.)
# Using ggplot, create a bar chart of the number of permits by year. Store the plot as `plot1`.
# After storing, you can show the plot using the `show` function (i.e. `show(plot1)`).
# Now, create a new plot (`plot2`) with axis labels. Name the bottom axis "Year" and the left
# (vertical) axis "Number of Permits". Change the title of the plot to
# "Number of Active Permits by Year".
# You can do make these changes by altering `plot1` (i.e. plot2 <- plot1="" +="" arguments)="">
# than recreating the plot from scratch Be sure the resulting plot is stored as `plot3`. You
# can use the `show` function to view the plot.
ggplot application
# Read in the data/filtered_permits_dplyr.csv file (what was created during the previous module)
# as a data frame named `filtered_permits`. This data should be a filtered version of the permits
# data that has three columns: PermitClass, PermitClassMapped, and Year. The data should have
# 8,238 observations (rows).
# As you did in the previous exercise, filter out permits prior to 2010. This time, however,
# instead of grouping by year, group by both year and the permit class to get a count of the
# number of permits for each class for each year. Store the result in a variable named
# `permits_by_year_by_class`. You should use dplyr pipes to do this filtering and grouping.
# Using ggplot, create a stacked bar chart that shows the number of permits each year with
# colors to indicate the `PermitClass` (see Figure 16.7 in the textbook for an example).
# Name the bottom axis "Year" and the left (vertical) axis "Number of Permits". Save the
# figure in a variable called `plot1`.
# Change the scale of the graph so it is colored using a ColorBrewer palette. You can see
# the available palettes here:
# https://www.r-graph-gallery.com/38-rcolorbrewers-palettes_files/figure-html/thecode-1.png
# Save the result as `plot1` (replacing the previous variable).
# Using ggplot, create a facet of the permits by year using `PermitClass` as your facet. This
# should result in a bar chart of the number of permits per year for each unique value of
# `PermitClass`. The facets should all be in a single plot that you should store as `plot2`.
->->->->