Data files # Read in the Building_Permits.csv file in the data/ folder and store it in a data...

Question

Data files

# Read in the Building_Permits.csv file in the data/ folder and store it in a data frame named

# `permits`. This file contains building permits from the City of Seattle. See the link below

# for more info. Be sure set the working directory to the appropriate folder location.

# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr

permits <->

# How many unique values are in the `OriginalCity` column of the dataset? Store the unique values

# in a vector called `unique_cities`. As you can see - there is only one unique city but there are

# multiple variations on how it is stored in the data. (hint: you may want to look into the

# `unique()` function)

# Suppose we only want information on closed permits that have a valid type description. Filter

# the permits data frame to only include rows where the `StatusCurrent` column is "Issued" and

# the `PermitTypeMapped` is "Building". Store the filtered data frame as `filtered_permits`. It is

# fine to do the filtering across successive steps

# How many unique values for the `PermitClass` column are in `filtered_permits`? Store the unique

# values in a vector called `unique_permit_classes`.

# We'll work with this filtered dataset in future exercises and look at the values in the

# `PermitClass` column. Store the entire `filtered_permits` data frame as a file named

# 'filtered_building_permits.csv'. Set `row.names` to FALSE when saving.

# Read the filtered_building_permits.csv file that you just saved back in and store the data in a

# data frame named `test_read_write`. Ensure that test_read_write has the same dimensions as your

# `filtered_permits`. If it doesn't, you should check that `row.names` was appropriately set

# in the previous step and re-save.

Basic grammar functions

## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar

## sign ($) or bracket ([]) notation!

library(tidyr)

library(dplyr)

# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This

# file contains building permits from the City of Seattle. See the link below for more info. Be sure

# set the working directory to the appropriate folder location.

# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr

permits <->

permits

# Also load in the data/filtered_building_permits.csv file (what you created as part of the last

# exercise in Module 3). Store the data in a data frame called `filtered_permits`.

filtered_permits <->

# We're going to use dplyr to filter the permits data using the same steps as we did previously.

# Specifically, we're going to keep only entries that have a status of 'Issued' in the

# `StatusCurrent` column and have a 'Building' value for the `PermitTypeMapped` column. Use only

# functions from the `dplyr` library to perform the filtering. Store the filtered data in a data

# frame called `filtered_permits_dplyr`. Store the dimensions of the resulting data frame in a

# variable named `dimensions`. Then, print the value of the `dimensions` variable. Also ensure

# that `filtered_building_permits` and `filtered_permits_dplyr` are identical

# Now, let's create a `Year` column that contains the year that the permit was issued. To do this,

# use the `mutate` dplyr function and extract the first 4 characters of the `AppliedDate` column.

# Add this `year` column to the `filtered_permits_dplyr` data frame.

# Keep only the rows of the `filtered_permits_dplyr` that have a valid (non-blank) year. To do so,

# filter out all Year values that are `NaN` values (hint: see the `drop_na()` function from tidyverse)

# and then drop all entries that have "" as the Year. Then, select only the following columns to keep

# in the data frame: PermitClass, PermitClassMapped, and Year. Store the resulting data frame as

# `filtered_permits_dplyr`

# Finally, arrange the `filtered_permits_dplyr` data frame such that the most recent years appear

# first and the earliest years appear last

# Save the `filtered_permits_dplyr` data frame as "filtered_permits_dplyr.csv". Set `row.names` to

# FALSE. We will be using this data frame in the next exercise.

Pipes

# Now, we're going to filter/mutate/arrange the permits data frame in the same manner that we did

# in the previous exercise. However, instead of performing subsequent calls using dplyr functions,

# we are going to chain together dplyr calls using pipes. In a single pipe, create a

# `filtered_permits_piped` data frame that is identical to the `filtered_permits_dplyr` data frame.

# The steps in the process are:

# 1. FILTER permits such that the `StatusCurrent` column is 'Issued'

# 2. FILTER permits such that the `PermitTypeMapped` column is 'Building'

# 3. MUTATE the data frame to add a `Year` column that contains the first 4 characters of the

# `AppliedDate` column

# 4. FILTER permits such that they have a valid (non-blank) year (removing NaNs and "" entries)

# 5. SELECT only the following columns from the data frame: `PermitClass`, `PermitClassMapped`, and

# `Year`

# 6. ARRANGE the data frame so the most recent years are first and the earliest years are last

# Ensure that your `filtered_permits_dplyr` and `filtered_permits_piped` data frames are identical.

# You will first need to MUTATE the `Year` column in your `filtered_permits_piped` data frame to

# be an an integer value (hint: use the `as.integer` function). This is because, when saving/loading

# data frames, R will default to storing the data using types that it sees most fit. As an example,

# though when you saved the `filtered_permits_dplyr` data frame, `Year` was a string (we literally

# took a substring to form the column), you can see that it is now a numeric/integer value because

# we filtered, saved, and then re-loaded.

# Now, you can use the `identical()` function to compare You do not need to save the

# `filtered_permits_piped` data frame

Groups

## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar

## sign ($) or bracket ([]) notation!

# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This

# file contains building permits from the City of Seattle. See the link below for more info. Be sure

# set the working directory to the appropriate folder location.

# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr

# Filter the `permits` to only include instances where the estimated project cost value

# (`EstProjectCost`) is greater than 0. Also filter to only include new permits for building

# (Hint: use the `PermitTypeMapped` and `PermitTypeDesc` columns). Also, as you did in the

# previous exercise, add a `Year` column by taking the first four characters from the `date.

# `AppliedDate`` column. However, do not filter on the `Year` column as you did previously.

# Store the result in a data frame named `filtered_permits`

# Group the permits by Year to get the mean estimated project cost for each year. Name the

# mean estimated cost column `MeanEstCost`. Store the results in a data frame named

# `grouped_by_year`

# Which year after 2010 had the highest mean estimated project cost? Use `dplyr` functions

# and pipes to find the value of interest (Hint: use filter operations). Store the value in

# a data frame with a single row that has `Year` and `MeanEstCost` columns. Name the data

# frame `highest_cost_by_year`.

# Note: if you wanted to extract the `Year` value, you could use dplyr's `pull()` function.

# However, you are not asked to do that here. Keep the result as a data frame with a single

# row

# Now, create a second grouping to get the mean estimated project cost for each year and

# `PermitClass` as well as the number of records for each year and `PermitClass`

# combination (i.e. group by both simultaneously and summarize by both simultaneously).

# Store the results in a data frame named 'grouped_by_year_and_class`. The data frame

# should have four columns: `PermitClass`, `Year`, `N` (for the number of records) and

# `MeanEstCost`. (Hint: use the `n()` function in your summarize to get counts)

# For year and permit type combinations that had at least 1000 permits, what was the

# minimum estimated cost? Store the result in a data frame called `min_cost_high_count`

# that has the same four columns as your `grouped_by_year_and_class` data frame.

Joins

## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar

## sign ($) or bracket ([]) notation!

# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This

# file contains building permits from the City of Seattle. See the link below for more info. Be sure

# set the working directory to the appropriate folder location.

# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr

# Read in the data/zip_code_demos_2000.csv file and store it in a data frame named `zip_codes` This

# file contains demographic information for many of Washington State's zip codes from the year

# 2000.

# In the `zip_code` data frame, you can see that both the `population` and `population_density`

# columns are stored as character columns while they contain numeric values. This is because of

# the commas present in the data source. Let's clean both columns and convert them to numeric

# values by first replacing the commas with "" characters (you can use functions from the

# `stringr` package to do this or use R's `gsub()` function) and then convert them to numeric

# values using the `as.numeric()` function. Instead of storing the results in a new variable,

# simply update the values of the existing `zip_codes` data frame.

# Filter the `permits` data frame to only include instances where the `OriginalZip` variable is

# not NA (use tidyverse's `drop_na()` function). Store the filtered data frame as a data frame

# named `filtered_permits`

# Which zip codes in the `filtered_permits` data frame do we not have information for in the

# `zip_codes` data frame? To answer this question, create a `not_found` vector that contains

# all the zip codes from `filtered_permits` that are not in `zip_codes`. One approach

# to create this variable is to first isolate all unique zip codes in `filtered_permits` and

# then use the `%in%` function to find which of the unique zip codes are NOT in the `zip_codes`

# data frame.

# Without changing the names of any columns, perform a left join of the `filtered_permits` data

# frame and the `zip_codes` data frame. Have the `filtered_permits` data frame be the "left"

# side of the join. Store the resulting data frame in a variable named `left_joined_data`

# Now, find which of the rows in the `filtered_permits` data frame were not matched to rows in

# the `zip_codes` data frame. To do so, filter to rows of `left_joined_data` where one of the

# columns from the `zip_codes` data frame is NA (you do not need to check for all columns, just

# pick one and use that). Store the result in a data frame named `not_matched_left_join`

# Do an inner join of the `filtered_permits` data frame and the `zip_codes` data frame. Store

# the result in a data frame named `inner_joined_data`.

# Now, we can see if there is any relationship between zip code demographics and the number of

# permits for a given area. Create a data frame named `grouped_by_zip` that is a grouping of

# the `inner_joined_data` data frame by the `OriginalZip` column and then summarized to give

# a column of counts (i.e. the number of permits in each zip code) and the mean population

# density of each zip code

# Is there any relationship that you can see between the number of permits in a zip code and

# the population density of that zip code? You do not need to do any formal tests of this -

# simply doing an "eyeball" test from looking at the data is enough.

#

# From here, we can see how joining multiple data sets would allow us to answer really

# interesting questions about the relationship between building permits and population density.

# Of course, in this instance, it comes with the caveat that the population demographic data is

# a bit older than the permit data. We can also expand this to see the relationship between

# population demographics (age, race, ethnicity, education) and the number of residential

# building permits in an area. We can also look at income and commercial building permits in

# an area. Being able to join data opens doors so we're no longer limited to a single data set!

Tidyr

filter the `permits` to only include instances where the

# estimated project cost value (`EstProjectCost`) is greater than 0. Also filter to only include

# new permits for building (Hint: use the `PermitTypeMapped` and `PermitTypeDesc` columns). Add

# a `Year` column by taking the first four characters from the date. Store the result in a data

# frame named `filtered_permits`

# Similar to what you did in a previous exercise: create a grouping to get the number of records

# for each year and `PermitClass` combination (i.e. group by both simultaneously and summarize

# by both simultaneously). Store the results in a data frame named 'grouped_by_year_and_class`.

# The data frame should have three columns: `PermitClass`, `Year`, and `N` (for the number of

# records). (Hint: use the `n()` function in your summarize to get counts)

# As you can see, the `grouped_by_year_and_class` data frame is a "long" data frame. Create a

# data frame named `grouped_by_year_and_class_wide` that is a wide version of the same data. Use

# tidyr's `pivote_wider()` function to do this. Your "wide" data frame should have a column for the

# Year and then a column for every category of `PermitClass` (8 columns in total). The values of

# the data frame should be counts (i.e. your values for `N`)

# Now, use the `grouped_by_year_and_class_wide` data frame to recreate the

# `grouped_by_year_and_class` data frame using tidyr's `pivot_longer()` function. The cols to pivot

# should be everything except the Year column (written as "-Year" , note the "-" symbol). The

# column names should go to a column called "PermitClass", and the values should go to a column

# called "N" (to match the orignal data frame). After creating the data frame, filter out rows with

# NA values. Store the results in a data frame named `grouped_by_year_and_class_long`. Check that

# the dimensions of the resulting data frame are the same as the `grouped_by_year_and_class` data frame

# Lastly, arrange both the `grouped_by_year_and_class` and `grouped_by_year_and_class_long` data

# frames by year and `PermitClass` (the order doesn't matter as long as it is consistent across)

# the two data frames). Ensure that the two data frames are identical by using the `identical()`

# function.

ggplot basics

# Read in the data/filtered_permits_dplyr.csv file (what was created during the previous module)

# as a data frame named `filtered_permits`. This data should be a filtered version of the permits

# data that has three columns: PermitClass, PermitClassMapped, and Year. The data should have

# 8,238 observations (rows).

# Group the data by year to get a count of the number of permits for each year. Store the result

# in a variable named `permits_by_year`. You should use dplyr to do this grouping (hint: use

# `group_by()` followed by a `summarize()` call using the `n()` function).

# Using ggplot, create a bar chart of the number of permits by year. Store the plot as `plot1`.

# After storing, you can show the plot using the `show` function (i.e. `show(plot1)`).

# Filter the `permits_by_year` data frame such that it only contains Years including and after

# 2010. Store the results in a data frame named `permits_by_year_filtered`. (hint: You can

# simply filter the `permits_by_year` data frame - you do not need to filter and perform the

# group by aggregation on the `permits` data frame, though that is an option.)

# Using ggplot, create a bar chart of the number of permits by year. Store the plot as `plot1`.

# After storing, you can show the plot using the `show` function (i.e. `show(plot1)`).

# Now, create a new plot (`plot2`) with axis labels. Name the bottom axis "Year" and the left

# (vertical) axis "Number of Permits". Change the title of the plot to

# "Number of Active Permits by Year".

# You can do make these changes by altering `plot1` (i.e. plot2 <- plot1="" +="" arguments)="">

# than recreating the plot from scratch Be sure the resulting plot is stored as `plot3`. You

# can use the `show` function to view the plot.

ggplot application

# Read in the data/filtered_permits_dplyr.csv file (what was created during the previous module)

# as a data frame named `filtered_permits`. This data should be a filtered version of the permits

# data that has three columns: PermitClass, PermitClassMapped, and Year. The data should have

# 8,238 observations (rows).

# As you did in the previous exercise, filter out permits prior to 2010. This time, however,

# instead of grouping by year, group by both year and the permit class to get a count of the

# number of permits for each class for each year. Store the result in a variable named

# `permits_by_year_by_class`. You should use dplyr pipes to do this filtering and grouping.

# Using ggplot, create a stacked bar chart that shows the number of permits each year with

# colors to indicate the `PermitClass` (see Figure 16.7 in the textbook for an example).

# Name the bottom axis "Year" and the left (vertical) axis "Number of Permits". Save the

# figure in a variable called `plot1`.

# Change the scale of the graph so it is colored using a ColorBrewer palette. You can see

# the available palettes here:

# https://www.r-graph-gallery.com/38-rcolorbrewers-palettes_files/figure-html/thecode-1.png

# Save the result as `plot1` (replacing the previous variable).

# Using ggplot, create a facet of the permits by year using `PermitClass` as your facet. This

# should result in a bar chart of the number of permits per year for each unique value of

# `PermitClass`. The facets should all be in a single plot that you should store as `plot2`.

Mohd · Accepted Answer

Answer Attached Below:

Data files# Read in the Building_Permits.csv file in the data/ folder and store it in a data frame named# `permits`. This file contains building permits from the City of Seattle. See the link below#...

Answer To: Data files# Read in the Building_Permits.csv file in the data/ folder and store it in a data frame...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment