File attached
Villanova University: STAT4380 Section 2 Fall 2020 Data Science Project One Instructor: J. Gou October 1, 2020 Due: October 11, 2020 There are 67 counties that exist in the Commonwealth of Pennsylvania. In Novem- ber 1682 the first 3 Pennsylvania counties formed were Bucks, Chester and Philadelphia Counties. The Pennsylvania Colony lasted until December 12, 1787, when the State of Pennsylvania was created as the 2nd state. States bordering Pennsylvania are New York, New Jersey, Delaware, Maryland, Ohio and West Virginia. Pennsylvania has no counties that are lost, defunct or extinct. In this project, we will dig the history and look for some fun facts about Pennsyl- vania counties. Please use R Notebook to write your HTML report. An R Notebook is an R Mark- down document with chunks that can be executed independently and interactively. Dr. Yihui Xie wrote a great book on R Markdown, where R Notebook is introduced in Section 3.2 https://bookdown.org/yihui/rmarkdown/notebook.html. You may refer to this section for more details about R Notebook. 1. Visit the wikipedia page of “List of counties in Pennsylvania”, and find a table called “County list” at https://en.wikipedia.org/wiki/List_of_counties_ in_Pennsylvania. Import this Wikipedia HTML table (except the last image variable “Map”) which shows the County list in Pennsylvania. • Hint: Consider using R packages XML, htmltab, rvest, etc. 2. Form a data table to include four variables: County name, Year established, Population, Area. You may either form a new data frame with four variables from the Wikipedia table, or pull out unnecessary variables from the original table. Name this data frame “PAcounty”. Save County name as text, and save Year established, Population, Area (in square mile) as number. You may need to convert a text to numeric. Export the data table you’ve achieved either in a CSV format or a Excel format. • Hint: Consider using R packages readr, readxl, writexl, tibble, dplyr, stringr, etc. • Hint: When converting a text to numeric, base R function as.numeric() doesn’t always work. You may consider using the function parse number() from Package readr or other choices. 1 https://bookdown.org/yihui/rmarkdown/notebook.html https://en.wikipedia.org/wiki/List_of_counties_in_Pennsylvania https://en.wikipedia.org/wiki/List_of_counties_in_Pennsylvania # Parse numbers flexibly tmp_text <- "522 sq mi(1 ,352 km)" readr::parse_number(tmp_text) 3. create a new variable “density” (per square mile), where density = population area add this variable as the fifth variable in your data frame “pacounty”. • hint: you may do it directly, or consider using a base r function transform() or a dplyr function mutate(). 4. create a categorical variable by splitting the variable “year established” into (a) early: from 1682 to 1776 (b) mid: from 1777 to 1800 (c) late: from 1801 to 1878 you may call this new variable “est3”, as your sixth variable in your data frame “pacounty”. in 1776, the continental congress ratifies the declaration by the united states of its independence from the kingdom of great britain. in 1800, philly lost the nation’s capital to washington. create a frequency table of this categorical variable. compute the mean popu- lation for each category (early, mid and late). • hint: you may consider using a base r function cut() to convert numeric to factor. • hint: you may consider using another base r function tapply() to handle the split-apply-combine paradigm. 5. visit the wikipedia page of “list of pennsylvania counties by per capita income”, and find a table in section “pennsylvania counties ranked by per capita income” at https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_ capita_income. import this wikipedia html table, pick the variable “per capita income” and add it to your data frame “pacounty”. you may call this new variable “income”, as your seventh variable in your data frame “pacounty”. print out the data frame you create. • hint: when combine these two tables, one possible way is to sort these tables by county names first, and then put them together. you may consider other general methods, for example, you may use sql via an r package sqldf. 2 https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_capita_income https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_capita_income 6. use base r graphics or r package ggplot2 to generate a parallel boxplot plot to compare the distributions of “per capita income” where variable “est3” de- termining the grouping (early, mid and late). • hint: you may consider using graphics::boxplot() or ggplot2::geom boxplot(). 7. use base r graphics or r package ggplot2 to generate two frequency histograms. one histogram summarizes the distribution of area, the other summarizes the distribution of population. create another two relative frequency histograms, and add empirical density functions on them. • hint: you may consider using graphics::hist() and graphics::lines(). another option is ggplot2::geom histogram() and ggplot2::geom line(). 8. there are 21 counties in new jersey and 3 counties in delaware. visit the wikipedia page of “list of counties in new jersey” and “list of counties in delaware” and find tables at https://en.wikipedia.org/wiki/list_of_counties_ in_new_jersey and https://en.wikipedia.org/wiki/list_of_counties_in_ delaware. import these two wikipedia html table which shows the county lists. create a data table of counties in pennsylvania, new jersey and delaware. this data table includes five variables: county name, year established, population, area (in mi2), and state. the “state” variable is a category variable with three levels: pa, nj and de. name this data frame “panjdecounty”. print out this data frame and calculate the mean area of county for each state (pa, nj and de). • hint: the variables in three html tables are not exactly the same. 9. use r package ggplot2 to create a scatter plot describing the relationship be- tween population and year established. use three different colors for three states (pa, nj and de). • hint: you may consider using ggplot() along with geom point(). function qplot() is another option. 10. pick a state in the united states (except pa, nj and de). it can be the state you are from, you like, you plan to visit, where you met a person you respect, or chosen for any reason. find a data set or data table (not necessary from wikipedia, not necessary about the list of counties). raise a question you are interested in and make a graph or multiple graphs to present your idea. briefly describe the graph you make. 3 https://en.wikipedia.org/wiki/list_of_counties_in_new_jersey https://en.wikipedia.org/wiki/list_of_counties_in_new_jersey https://en.wikipedia.org/wiki/list_of_counties_in_delaware https://en.wikipedia.org/wiki/list_of_counties_in_delaware "522="" sq="" mi(1="" ,352="" km)"="" readr::parse_number(tmp_text)="" 3.="" create="" a="" new="" variable="" “density”="" (per="" square="" mile),="" where="" density="Population" area="" add="" this="" variable="" as="" the="" fifth="" variable="" in="" your="" data="" frame="" “pacounty”.="" •="" hint:="" you="" may="" do="" it="" directly,="" or="" consider="" using="" a="" base="" r="" function="" transform()="" or="" a="" dplyr="" function="" mutate().="" 4.="" create="" a="" categorical="" variable="" by="" splitting="" the="" variable="" “year="" established”="" into="" (a)="" early:="" from="" 1682="" to="" 1776="" (b)="" mid:="" from="" 1777="" to="" 1800="" (c)="" late:="" from="" 1801="" to="" 1878="" you="" may="" call="" this="" new="" variable="" “est3”,="" as="" your="" sixth="" variable="" in="" your="" data="" frame="" “pacounty”.="" in="" 1776,="" the="" continental="" congress="" ratifies="" the="" declaration="" by="" the="" united="" states="" of="" its="" independence="" from="" the="" kingdom="" of="" great="" britain.="" in="" 1800,="" philly="" lost="" the="" nation’s="" capital="" to="" washington.="" create="" a="" frequency="" table="" of="" this="" categorical="" variable.="" compute="" the="" mean="" popu-="" lation="" for="" each="" category="" (early,="" mid="" and="" late).="" •="" hint:="" you="" may="" consider="" using="" a="" base="" r="" function="" cut()="" to="" convert="" numeric="" to="" factor.="" •="" hint:="" you="" may="" consider="" using="" another="" base="" r="" function="" tapply()="" to="" handle="" the="" split-apply-combine="" paradigm.="" 5.="" visit="" the="" wikipedia="" page="" of="" “list="" of="" pennsylvania="" counties="" by="" per="" capita="" income”,="" and="" find="" a="" table="" in="" section="" “pennsylvania="" counties="" ranked="" by="" per="" capita="" income”="" at="" https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_="" capita_income.="" import="" this="" wikipedia="" html="" table,="" pick="" the="" variable="" “per="" capita="" income”="" and="" add="" it="" to="" your="" data="" frame="" “pacounty”.="" you="" may="" call="" this="" new="" variable="" “income”,="" as="" your="" seventh="" variable="" in="" your="" data="" frame="" “pacounty”.="" print="" out="" the="" data="" frame="" you="" create.="" •="" hint:="" when="" combine="" these="" two="" tables,="" one="" possible="" way="" is="" to="" sort="" these="" tables="" by="" county="" names="" first,="" and="" then="" put="" them="" together.="" you="" may="" consider="" other="" general="" methods,="" for="" example,="" you="" may="" use="" sql="" via="" an="" r="" package="" sqldf.="" 2="" https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_capita_income="" https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_capita_income="" 6.="" use="" base="" r="" graphics="" or="" r="" package="" ggplot2="" to="" generate="" a="" parallel="" boxplot="" plot="" to="" compare="" the="" distributions="" of="" “per="" capita="" income”="" where="" variable="" “est3”="" de-="" termining="" the="" grouping="" (early,="" mid="" and="" late).="" •="" hint:="" you="" may="" consider="" using="" graphics::boxplot()="" or="" ggplot2::geom="" boxplot().="" 7.="" use="" base="" r="" graphics="" or="" r="" package="" ggplot2="" to="" generate="" two="" frequency="" histograms.="" one="" histogram="" summarizes="" the="" distribution="" of="" area,="" the="" other="" summarizes="" the="" distribution="" of="" population.="" create="" another="" two="" relative="" frequency="" histograms,="" and="" add="" empirical="" density="" functions="" on="" them.="" •="" hint:="" you="" may="" consider="" using="" graphics::hist()="" and="" graphics::lines().="" another="" option="" is="" ggplot2::geom="" histogram()="" and="" ggplot2::geom="" line().="" 8.="" there="" are="" 21="" counties="" in="" new="" jersey="" and="" 3="" counties="" in="" delaware.="" visit="" the="" wikipedia="" page="" of="" “list="" of="" counties="" in="" new="" jersey”="" and="" “list="" of="" counties="" in="" delaware”="" and="" find="" tables="" at="" https://en.wikipedia.org/wiki/list_of_counties_="" in_new_jersey="" and="" https://en.wikipedia.org/wiki/list_of_counties_in_="" delaware.="" import="" these="" two="" wikipedia="" html="" table="" which="" shows="" the="" county="" lists.="" create="" a="" data="" table="" of="" counties="" in="" pennsylvania,="" new="" jersey="" and="" delaware.="" this="" data="" table="" includes="" five="" variables:="" county="" name,="" year="" established,="" population,="" area="" (in="" mi2),="" and="" state.="" the="" “state”="" variable="" is="" a="" category="" variable="" with="" three="" levels:="" pa,="" nj="" and="" de.="" name="" this="" data="" frame="" “panjdecounty”.="" print="" out="" this="" data="" frame="" and="" calculate="" the="" mean="" area="" of="" county="" for="" each="" state="" (pa,="" nj="" and="" de).="" •="" hint:="" the="" variables="" in="" three="" html="" tables="" are="" not="" exactly="" the="" same.="" 9.="" use="" r="" package="" ggplot2="" to="" create="" a="" scatter="" plot="" describing="" the="" relationship="" be-="" tween="" population="" and="" year="" established.="" use="" three="" different="" colors="" for="" three="" states="" (pa,="" nj="" and="" de).="" •="" hint:="" you="" may="" consider="" using="" ggplot()="" along="" with="" geom="" point().="" function="" qplot()="" is="" another="" option.="" 10.="" pick="" a="" state="" in="" the="" united="" states="" (except="" pa,="" nj="" and="" de).="" it="" can="" be="" the="" state="" you="" are="" from,="" you="" like,="" you="" plan="" to="" visit,="" where="" you="" met="" a="" person="" you="" respect,="" or="" chosen="" for="" any="" reason.="" find="" a="" data="" set="" or="" data="" table="" (not="" necessary="" from="" wikipedia,="" not="" necessary="" about="" the="" list="" of="" counties).="" raise="" a="" question="" you="" are="" interested="" in="" and="" make="" a="" graph="" or="" multiple="" graphs="" to="" present="" your="" idea.="" briefly="" describe="" the="" graph="" you="" make.="" 3="" https://en.wikipedia.org/wiki/list_of_counties_in_new_jersey="" https://en.wikipedia.org/wiki/list_of_counties_in_new_jersey="" https://en.wikipedia.org/wiki/list_of_counties_in_delaware="">- "522 sq mi(1 ,352 km)" readr::parse_number(tmp_text) 3. create a new variable “density” (per square mile), where density = population area add this variable as the fifth variable in your data frame “pacounty”. • hint: you may do it directly, or consider using a base r function transform() or a dplyr function mutate(). 4. create a categorical variable by splitting the variable “year established” into (a) early: from 1682 to 1776 (b) mid: from 1777 to 1800 (c) late: from 1801 to 1878 you may call this new variable “est3”, as your sixth variable in your data frame “pacounty”. in 1776, the continental congress ratifies the declaration by the united states of its independence from the kingdom of great britain. in 1800, philly lost the nation’s capital to washington. create a frequency table of this categorical variable. compute the mean popu- lation for each category (early, mid and late). • hint: you may consider using a base r function cut() to convert numeric to factor. • hint: you may consider using another base r function tapply() to handle the split-apply-combine paradigm. 5. visit the wikipedia page of “list of pennsylvania counties by per capita income”, and find a table in section “pennsylvania counties ranked by per capita income” at https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_ capita_income. import this wikipedia html table, pick the variable “per capita income” and add it to your data frame “pacounty”. you may call this new variable “income”, as your seventh variable in your data frame “pacounty”. print out the data frame you create. • hint: when combine these two tables, one possible way is to sort these tables by county names first, and then put them together. you may consider other general methods, for example, you may use sql via an r package sqldf. 2 https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_capita_income https://en.wikipedia.org/wiki/list_of_pennsylvania_counties_by_per_capita_income 6. use base r graphics or r package ggplot2 to generate a parallel boxplot plot to compare the distributions of “per capita income” where variable “est3” de- termining the grouping (early, mid and late). • hint: you may consider using graphics::boxplot() or ggplot2::geom boxplot(). 7. use base r graphics or r package ggplot2 to generate two frequency histograms. one histogram summarizes the distribution of area, the other summarizes the distribution of population. create another two relative frequency histograms, and add empirical density functions on them. • hint: you may consider using graphics::hist() and graphics::lines(). another option is ggplot2::geom histogram() and ggplot2::geom line(). 8. there are 21 counties in new jersey and 3 counties in delaware. visit the wikipedia page of “list of counties in new jersey” and “list of counties in delaware” and find tables at https://en.wikipedia.org/wiki/list_of_counties_ in_new_jersey and https://en.wikipedia.org/wiki/list_of_counties_in_ delaware. import these two wikipedia html table which shows the county lists. create a data table of counties in pennsylvania, new jersey and delaware. this data table includes five variables: county name, year established, population, area (in mi2), and state. the “state” variable is a category variable with three levels: pa, nj and de. name this data frame “panjdecounty”. print out this data frame and calculate the mean area of county for each state (pa, nj and de). • hint: the variables in three html tables are not exactly the same. 9. use r package ggplot2 to create a scatter plot describing the relationship be- tween population and year established. use three different colors for three states (pa, nj and de). • hint: you may consider using ggplot() along with geom point(). function qplot() is another option. 10. pick a state in the united states (except pa, nj and de). it can be the state you are from, you like, you plan to visit, where you met a person you respect, or chosen for any reason. find a data set or data table (not necessary from wikipedia, not necessary about the list of counties). raise a question you are interested in and make a graph or multiple graphs to present your idea. briefly describe the graph you make. 3 https://en.wikipedia.org/wiki/list_of_counties_in_new_jersey https://en.wikipedia.org/wiki/list_of_counties_in_new_jersey https://en.wikipedia.org/wiki/list_of_counties_in_delaware https://en.wikipedia.org/wiki/list_of_counties_in_delaware>