Data files# Read in the Building_Permits.csv file in the data/ folder and store it in a data frame named# `permits`. This file contains building permits from the City of Seattle. See the link below#...

1 answer below ยป



Data files














# Read in the Building_Permits.csv file in the data/ folder and store it in a data frame named




# `permits`. This file contains building permits from the City of Seattle. See the link below




# for more info. Be sure set the working directory to the appropriate folder location.




# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr














permits <->














# How many unique values are in the `OriginalCity` column of the dataset? Store the unique values




# in a vector called `unique_cities`. As you can see - there is only one unique city but there are




# multiple variations on how it is stored in the data. (hint: you may want to look into the




# `unique()` function)
























# Suppose we only want information on closed permits that have a valid type description. Filter




# the permits data frame to only include rows where the `StatusCurrent` column is "Issued" and




# the `PermitTypeMapped` is "Building". Store the filtered data frame as `filtered_permits`. It is




# fine to do the filtering across successive steps
























# How many unique values for the `PermitClass` column are in `filtered_permits`? Store the unique




# values in a vector called `unique_permit_classes`.
























# We'll work with this filtered dataset in future exercises and look at the values in the




# `PermitClass` column. Store the entire `filtered_permits` data frame as a file named




# 'filtered_building_permits.csv'. Set `row.names` to FALSE when saving.
























# Read the filtered_building_permits.csv file that you just saved back in and store the data in a




# data frame named `test_read_write`. Ensure that test_read_write has the same dimensions as your




# `filtered_permits`. If it doesn't, you should check that `row.names` was appropriately set




# in the previous step and re-save.














Basic grammar functions














## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar




## sign ($) or bracket ([]) notation!




library(tidyr)




library(dplyr)




# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This




# file contains building permits from the City of Seattle. See the link below for more info. Be sure




# set the working directory to the appropriate folder location.




# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr




permits <->




permits




# Also load in the data/filtered_building_permits.csv file (what you created as part of the last




# exercise in Module 3). Store the data in a data frame called `filtered_permits`.














filtered_permits <->














# We're going to use dplyr to filter the permits data using the same steps as we did previously.




# Specifically, we're going to keep only entries that have a status of 'Issued' in the




# `StatusCurrent` column and have a 'Building' value for the `PermitTypeMapped` column. Use only




# functions from the `dplyr` library to perform the filtering. Store the filtered data in a data




# frame called `filtered_permits_dplyr`. Store the dimensions of the resulting data frame in a




# variable named `dimensions`. Then, print the value of the `dimensions` variable. Also ensure




# that `filtered_building_permits` and `filtered_permits_dplyr` are identical
























# Now, let's create a `Year` column that contains the year that the permit was issued. To do this,




# use the `mutate` dplyr function and extract the first 4 characters of the `AppliedDate` column.




# Add this `year` column to the `filtered_permits_dplyr` data frame.
























# Keep only the rows of the `filtered_permits_dplyr` that have a valid (non-blank) year. To do so,




# filter out all Year values that are `NaN` values (hint: see the `drop_na()` function from tidyverse)




# and then drop all entries that have "" as the Year. Then, select only the following columns to keep




# in the data frame: PermitClass, PermitClassMapped, and Year. Store the resulting data frame as




# `filtered_permits_dplyr`
























# Finally, arrange the `filtered_permits_dplyr` data frame such that the most recent years appear




# first and the earliest years appear last
























# Save the `filtered_permits_dplyr` data frame as "filtered_permits_dplyr.csv". Set `row.names` to




# FALSE. We will be using this data frame in the next exercise.














Pipes














# Now, we're going to filter/mutate/arrange the permits data frame in the same manner that we did




# in the previous exercise. However, instead of performing subsequent calls using dplyr functions,




# we are going to chain together dplyr calls using pipes. In a single pipe, create a




# `filtered_permits_piped` data frame that is identical to the `filtered_permits_dplyr` data frame.




# The steps in the process are:




# 1. FILTER permits such that the `StatusCurrent` column is 'Issued'




# 2. FILTER permits such that the `PermitTypeMapped` column is 'Building'




# 3. MUTATE the data frame to add a `Year` column that contains the first 4 characters of the




# `AppliedDate` column




# 4. FILTER permits such that they have a valid (non-blank) year (removing NaNs and "" entries)




# 5. SELECT only the following columns from the data frame: `PermitClass`, `PermitClassMapped`, and




# `Year`




# 6. ARRANGE the data frame so the most recent years are first and the earliest years are last
























# Ensure that your `filtered_permits_dplyr` and `filtered_permits_piped` data frames are identical.




# You will first need to MUTATE the `Year` column in your `filtered_permits_piped` data frame to




# be an an integer value (hint: use the `as.integer` function). This is because, when saving/loading




# data frames, R will default to storing the data using types that it sees most fit. As an example,




# though when you saved the `filtered_permits_dplyr` data frame, `Year` was a string (we literally




# took a substring to form the column), you can see that it is now a numeric/integer value because




# we filtered, saved, and then re-loaded.
























# Now, you can use the `identical()` function to compare You do not need to save the




# `filtered_permits_piped` data frame














Groups














## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar




## sign ($) or bracket ([]) notation!














# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This




# file contains building permits from the City of Seattle. See the link below for more info. Be sure




# set the working directory to the appropriate folder location.




# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr
























# Filter the `permits` to only include instances where the estimated project cost value




# (`EstProjectCost`) is greater than 0. Also filter to only include new permits for building




# (Hint: use the `PermitTypeMapped` and `PermitTypeDesc` columns). Also, as you did in the




# previous exercise, add a `Year` column by taking the first four characters from the `date.




# `AppliedDate`` column. However, do not filter on the `Year` column as you did previously.




# Store the result in a data frame named `filtered_permits`


































# Group the permits by Year to get the mean estimated project cost for each year. Name the




# mean estimated cost column `MeanEstCost`. Store the results in a data frame named




# `grouped_by_year`
























# Which year after 2010 had the highest mean estimated project cost? Use `dplyr` functions




# and pipes to find the value of interest (Hint: use filter operations). Store the value in




# a data frame with a single row that has `Year` and `MeanEstCost` columns. Name the data




# frame `highest_cost_by_year`.




# Note: if you wanted to extract the `Year` value, you could use dplyr's `pull()` function.




# However, you are not asked to do that here. Keep the result as a data frame with a single




# row
























# Now, create a second grouping to get the mean estimated project cost for each year and




# `PermitClass` as well as the number of records for each year and `PermitClass`




# combination (i.e. group by both simultaneously and summarize by both simultaneously).




# Store the results in a data frame named 'grouped_by_year_and_class`. The data frame




# should have four columns: `PermitClass`, `Year`, `N` (for the number of records) and




# `MeanEstCost`. (Hint: use the `n()` function in your summarize to get counts)
























# For year and permit type combinations that had at least 1000 permits, what was the




# minimum estimated cost? Store the result in a data frame called `min_cost_high_count`




# that has the same four columns as your `grouped_by_year_and_class` data frame.














Joins














## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar




## sign ($) or bracket ([]) notation!














# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This




# file contains building permits from the City of Seattle. See the link below for more info. Be sure




# set the working directory to the appropriate folder location.




# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr
























# Read in the data/zip_code_demos_2000.csv file and store it in a data frame named `zip_codes` This




# file contains demographic information for many of Washington State's zip codes from the year




# 2000.
























# In the `zip_code` data frame, you can see that both the `population` and `population_density`




# columns are stored as character columns while they contain numeric values. This is because of




# the commas present in the data source. Let's clean both columns and convert them to numeric




# values by first replacing the commas with "" characters (you can use functions from the




# `stringr` package to do this or use R's `gsub()` function) and then convert them to numeric




# values using the `as.numeric()` function. Instead of storing the results in a new variable,




# simply update the values of the existing `zip_codes` data frame.
























# Filter the `permits` data frame to only include instances where the `OriginalZip` variable is




# not NA (use tidyverse's `drop_na()` function). Store the filtered data frame as a data frame




# named `filtered_permits`
























# Which zip codes in the `filtered_permits` data frame do we not have information for in the




# `zip_codes` data frame? To answer this question, create a `not_found` vector that contains




# all the zip codes from `filtered_permits` that are not in `zip_codes`. One approach




# to create this variable is to first isolate all unique zip codes in `filtered_permits` and




# then use the `%in%` function to find which of the unique zip codes are NOT in the `zip_codes`




# data frame.
























# Without changing the names of any columns, perform a left join of the `filtered_permits` data




# frame and the `zip_codes` data frame. Have the `filtered_permits` data frame be the "left"




# side of the join. Store the resulting data frame in a variable named `left_joined_data`
























# Now, find which of the rows in the `filtered_permits` data frame were not matched to rows in




# the `zip_codes` data frame. To do so, filter to rows of `left_joined_data` where one of the




# columns from the `zip_codes` data frame is NA (you do not need to check for all columns, just




# pick one and use that). Store the result in a data frame named `not_matched_left_join`
























# Do an inner join of the `filtered_permits` data frame and the `zip_codes` data frame. Store




# the result in a data frame named `inner_joined_data`.
























# Now, we can see if there is any relationship between zip code demographics and the number of




# permits for a given area. Create a data frame named `grouped_by_zip` that is a grouping of




# the `inner_joined_data` data frame by the `OriginalZip` column and then summarized to give




# a column of counts (i.e. the number of permits in each zip code) and the mean population




# density of each zip code














# Is there any relationship that you can see between the number of permits in a zip code and




# the population density of that zip code? You do not need to do any formal tests of this -




# simply doing an "eyeball" test from looking at the data is enough.




#




# From here, we can see how joining multiple data sets would allow us to answer really




# interesting questions about the relationship between building permits and population density.




# Of course, in this instance, it comes with the caveat that the population demographic data is




# a bit older than the permit data. We can also expand this to see the relationship between




# population demographics (age, race, ethnicity, education) and the number of residential




# building permits in an area. We can also look at income and commercial building permits in




# an area. Being able to join data opens doors so we're no longer limited to a single data set!
























Tidyr














filter the `permits` to only include instances where the




# estimated project cost value (`EstProjectCost`) is greater than 0. Also filter to only include




# new permits for building (Hint: use the `PermitTypeMapped` and `PermitTypeDesc` columns). Add




# a `Year` column by taking the first four characters from the date. Store the result in a data




# frame named `filtered_permits`
























# Similar to what you did in a previous exercise: create a grouping to get the number of records




# for each year and `PermitClass` combination (i.e. group by both simultaneously and summarize




# by both simultaneously). Store the results in a data frame named 'grouped_by_year_and_class`.




# The data frame should have three columns: `PermitClass`, `Year`, and `N` (for the number of




# records). (Hint: use the `n()` function in your summarize to get counts)
























# As you can see, the `grouped_by_year_and_class` data frame is a "long" data frame. Create a




# data frame named `grouped_by_year_and_class_wide` that is a wide version of the same data. Use




# tidyr's `pivote_wider()` function to do this. Your "wide" data frame should have a column for the




# Year and then a column for every category of `PermitClass` (8 columns in total). The values of




# the data frame should be counts (i.e. your values for `N`)
























# Now, use the `grouped_by_year_and_class_wide` data frame to recreate the




# `grouped_by_year_and_class` data frame using tidyr's `pivot_longer()` function. The cols to pivot




# should be everything except the Year column (written as "-Year" , note the "-" symbol). The




# column names should go to a column called "PermitClass", and the values should go to a column




# called "N" (to match the orignal data frame). After creating the data frame, filter out rows with




# NA values. Store the results in a data frame named `grouped_by_year_and_class_long`. Check that




# the dimensions of the resulting data frame are the same as the `grouped_by_year_and_class` data frame
























# Lastly, arrange both the `grouped_by_year_and_class` and `grouped_by_year_and_class_long` data




# frames by year and `PermitClass` (the order doesn't matter as long as it is consistent across)




# the two data frames). Ensure that the two data frames are identical by using the `identical()`




# function.
























ggplot basics














# Read in the data/filtered_permits_dplyr.csv file (what was created during the previous module)




# as a data frame named `filtered_permits`. This data should be a filtered version of the permits




# data that has three columns: PermitClass, PermitClassMapped, and Year. The data should have




# 8,238 observations (rows).


































# Group the data by year to get a count of the number of permits for each year. Store the result




# in a variable named `permits_by_year`. You should use dplyr to do this grouping (hint: use




# `group_by()` followed by a `summarize()` call using the `n()` function).


































# Using ggplot, create a bar chart of the number of permits by year. Store the plot as `plot1`.




# After storing, you can show the plot using the `show` function (i.e. `show(plot1)`).


































# Filter the `permits_by_year` data frame such that it only contains Years including and after




# 2010. Store the results in a data frame named `permits_by_year_filtered`. (hint: You can




# simply filter the `permits_by_year` data frame - you do not need to filter and perform the




# group by aggregation on the `permits` data frame, though that is an option.)


































# Using ggplot, create a bar chart of the number of permits by year. Store the plot as `plot1`.




# After storing, you can show the plot using the `show` function (i.e. `show(plot1)`).


































# Now, create a new plot (`plot2`) with axis labels. Name the bottom axis "Year" and the left




# (vertical) axis "Number of Permits". Change the title of the plot to




# "Number of Active Permits by Year".




# You can do make these changes by altering `plot1` (i.e. plot2 <- plot1="" +="" arguments)="">




# than recreating the plot from scratch Be sure the resulting plot is stored as `plot3`. You




# can use the `show` function to view the plot.
























ggplot application














# Read in the data/filtered_permits_dplyr.csv file (what was created during the previous module)




# as a data frame named `filtered_permits`. This data should be a filtered version of the permits




# data that has three columns: PermitClass, PermitClassMapped, and Year. The data should have




# 8,238 observations (rows).


































# As you did in the previous exercise, filter out permits prior to 2010. This time, however,




# instead of grouping by year, group by both year and the permit class to get a count of the




# number of permits for each class for each year. Store the result in a variable named




# `permits_by_year_by_class`. You should use dplyr pipes to do this filtering and grouping.


































# Using ggplot, create a stacked bar chart that shows the number of permits each year with




# colors to indicate the `PermitClass` (see Figure 16.7 in the textbook for an example).




# Name the bottom axis "Year" and the left (vertical) axis "Number of Permits". Save the




# figure in a variable called `plot1`.


































# Change the scale of the graph so it is colored using a ColorBrewer palette. You can see




# the available palettes here:




# https://www.r-graph-gallery.com/38-rcolorbrewers-palettes_files/figure-html/thecode-1.png




# Save the result as `plot1` (replacing the previous variable).


































# Using ggplot, create a facet of the permits by year using `PermitClass` as your facet. This




# should result in a bar chart of the number of permits per year for each unique value of




# `PermitClass`. The facets should all be in a single plot that you should store as `plot2`.





























































Answered Same DayMar 24, 2023

Answer To: Data files# Read in the Building_Permits.csv file in the data/ folder and store it in a data frame...

Mohd answered on Mar 25 2023
36 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions ยป

Submit New Assignment

Copy and Paste Your Assignment Here