Programming language R ###############################################################################
### Assignment 3: Visualization ###
### Visual analysis of World Bank Indicators
###
### Below each prompt in the file, write the code necessary/indicated to generate
### the required plots. See the assignment page on Canvas for details.
###############################################################################
##### PART 1: Loading and Understanding the Data #####
### In this section you will load the data and necessary packages
# For this assignment, you'll be working with the database of World Bank Development
# Indicators. You can explore also explore all of this data online at
# https://data.worldbank.org/products/wdi
#
# For this assignment you will be using a separate R package called `wbstats` that
# will let you use basic functions to "access" the indicator database. (Note that
# you can look at the package documentation at
# https://cran.r-project.org/web/packages/wbstats/wbstats.pdf to see all of the
# different functions and options it offers as well as how to use them. This
# documentation is in the standard R package doc format, so is worth viewing at least
# once to have a sense for what these look like!
install.packages("ggplot2")
install.packages("dplyr")
# [1pt] Start by installing and loading (with library()) the `wbstats` package.
# DO NOT include the install.packages() command in your script file.
library(wbstats)
library(ggplot2)
# Also load in other required packages (`ggplot`, `dplyr`, etc) here.
# You can alternatively just load the whole `tidyverse` package
# Do not include any `install.package()` calls in this file
library(tidyverse)
library(dplyr)
# [2pt] The World Bank organizes data about countries into a different indicators
# (measures). For example: "Total Population" is an indicator, as is "Individuals
# using the Internet (% of population)". You can view a complete list of the
# indicators on the World Bank's website at https://data.worldbank.org/indicator?tab=all
# (you will be using this website repeatedly in this assignment).
#
# Get a listing of the available indicators by calling the package-provided
# `wb_indicators()` function, which returns a data frame of information about them.
# Print out the number of rows in this data frame to see how many indicators there
# are (and this listing is missing a few!) Also inspect the data frame (such as
# using View()) to see what information is about about each indicator.
#
# IMPORTANT: notice that each indicator contains an "Indicator ID", a special code
# used to refer to that indicator. This is because the names are so long and
# complex, so the World Bank uses codes to refer to each piece of data. Instead of
# "Individuals using the Internet...", you'd refer to indicator IT.NET.USER.ZS.
# In general, you will be using these IDs as identifiers, rather than the full text
# of the indicator's title.
list_assignment <->->
# [2pt] You can find the codes for different indicator on the World Bank's website.
# You can visit the full list of indicators at https://data.worldbank.org/indicator?tab=all
# and click on each one to get more information on that indicator (including seeing
# a sample visualization). You can find the indicator iD by clicking the "Details"
# button, or by looking in the URL (it's part of the path).
# See the `examples/indicators.png` file in this project for an example.
#
# Using this website, find the indicator ID for the "CO2 emissions (kt)" indicator.
# In a comment below, state the ID for this indicator to show that you looked it up.
#
# (It's also possible to look up indicator codes by using the provided `wb_search()`
# function, but it can be a bit less reliable than checking the website (and requires
# regular expressions to use well).
#EN.ATM.CO2E.KT
# [3pt] Once you've identified an indicator of interest and its ID, you can use
# use the `wbstats` package to access the data for that indicator. You get data
# from the World Bank by using the `wb_data()` function. This function expects at
# least two (named) arguments: `country` which should be a character vector of
# countries to get data on (with a few special options), and `indicator` which
# should be a character vector of indicator IDs to access. For example, you can
# get % Internet Users data for all countries with the following:
#
# wb_data(country = "all", indicator = c("IT.NET.USER.ZS"), mrv = 1, lang = "en")
#
# The `mrv = 1` argument ("most recent value") says to get just a single year's
# worth of data (the most recent year). It's also possible to give a specific
# range of years; see the `wbstats` documentation for details.
#
# Using the `wb_data()` function, get a data frame of the "CO2 emissions (kt)"
# for all countries for the 1 most recent year (use "countries_only" as the `country`
# argument to just get countries and not aggregations). Save (all) the data for
# the top 10 countries with highest carbon emissions in a data frame called
# `top_10_co2_countries`. You will need to do some light data wrangling to choose
# only these 10 rows.
#
# Note that for all data wrangling in this assignment, you can either use `dplyr`
# functions, base R syntax (dollar signs and brackets), or a mix of both. I
# strongly recommend you use `dplyr` primarily, but do what seems simplest and
# makes sense to you.
top_10_co2_countries <- wb_data(country="countries_only" ,="" indicator="c(" en.atm.co2e.kt"),"="" mrv="1," lang="en" )="" %="">%->
filter(!is.na("EN.ATM.CO2E.KT")) %>%
arrange(desc("EN.ATM.CO2E.KT")) %>%
head(10)
###############################################################################
##### PART 2: CO2 Emissions by Country #####
### In this section you will generate a bar chart of the total CO2 emissions of
### the top-10 countries with the highest emission levels
###
### You can see an example of this plot in `examples/top_10_co2_plot.png`
###
### The instructions below have multiple steps as a single comment; it is up to
### you to organize your code below that.
###
### Throughout this assignment, you are welcome to adjust the styling of the plots
### (e.g., make text different sizes, use different colors, etc), so long as you
### maintain the *effectiveness* and *expressiveness* of the plots.
# [2pt] Use the `ggplot2()` function to create the plot. The data will be your
# `top_10_co2_countries` from the previous section.
#
# [4pt] You will need to use column geometry (https://ggplot2.tidyverse.org/reference/geom_bar.html)
# to create the chart. The country's ISO3 code (the three-letter code used to
# refer to that country, such as "USA" or "IND") will go on the x-axis, and the
# emission amount will go on the y-axis.
# You can use the `reorder()` function to "sort" the country ISO3 codes (a factor,
# the first argument) by the indicator value column (the second argument), and
# then use that sorted list as the aesthetic mapping. See
# https://www.r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html#reorder
# for an example
ggplot(data = top_10_co2_countries, mapping = aes(x = reorder(iso3c, EN.ATM.CO2E.KT), EN.ATM.CO2E.KT)) +
geom_col()
#
# [2pt] Use the `labs()` function to specify the title and axis labels for your
# chart. The title should be "Top 10 Countries by CO2 Emissions", the x-axis
# should be labeled "Country (iso3)", and the y-axis should be labeled with the
# complete indicator name.
# Optionally, you can effectively adjust the formatting of the numbers on the
# y-axis using the scales package (https://scales.r-lib.org/). This will let you
# put e.g., commas in the large numbers. Note that this will also involve
# specifying a scale for youur plot!
#
# [1pt] Once completed, save your plot in a variable called `top_10_c02_plot.`
# Note that you can print() out this variable in order to see the plot generated
# when you run your script.
###############################################################################
##### PART 3: US Income Equality over Time #####
### In this section you will generate a line chart showing the change in income
### inequality (in the USA) over time.
###
### You can see an example of this plot in `examples/us_inequality_plot.png`
###
### The instructions below have multiple steps as a single comment; it is up to
### you to organize your code below that.
# You'll first need to access and wrangle the data in order to plot it. Save the
# wrangled data in a data frame called `us_income_years`.
#
# [2pt] Use the `wb_data()` function to access data for the following 3 indicators
# for the country "USA" (you'll need to look up their IDs):
# "Income share held by highest 10%", "Income share held by lowest 20%", and
# "Income share held by second 20%". Get the 20 most recent years worth of data.
us_income_years <- wb_data(country="USA" ,="" indicator="">->
"SI.DST.FRST.20",
"SI.DST.02ND.20"),
mrv = 20,
lang = "en")
# [1pt] You'll need to mutate the data frame and convert the `date` column into a
# numeric value (using `as.numeric()`) so that you can plot it easier.
#
us_income_years <- us_income_years="" %="">%->
mutate('date_number' = as.integer(date))
# [2pt] Also mutate the data frame and create columns for the "wealth of the top 10%"
# (e.g., `wealth_top_10`), and for the "wealth of the bottom 40%" (e.g., `wealth_bottom_40`)
# (which is the lowest and second lowest 20% combined).
us_income_years <- us_income_years="" %="">%->
mutate('wealth_top_10' = SI.DST.10TH.10, 'wealth_bottom_40'= SI.DST.02ND.20 + SI.DST.FRST.20)
#
# [3pt] You'll need to pivot this data into *long* format, gathering the values
# from the two columns ("top 10%" and "bottom 40%") into a single column. This
# will allow you to plot them as two separate lines using a single geometry.
# Optionally, so you can order your legend correctly, you should mutate the long
# data frame to convert the "category" column into a factor (using the `factor()`
# function), with the `wealth_top_10` as the first level.
#
us_income_years_long <- us_income_years="" %="">%->
pivot_longer(cols = c('wealth_top_10', 'wealth_bottom_40')
, names_to = 'Category'
, values_to = 'US_Income') %>%
arrange(desc(Category))
# In the end, your data frame should have 40 rows; 1 for each year-and-category
# (top 10% or bottom 40%).
# You can then create your line plot:
#
# [1pt] The plot will use your us_income_years data frame as a data source.
#
# [5pt] The plot should include both point geometry and smooth line geometry:
# https://ggplot2.tidyverse.org/reference/geom_point.html
# https://ggplot2.tidyverse.org/reference/geom_path.html
# (so you can see the points and the trend—the smoothed trend looks better).
# Each should have the date mapped to the x-axis, the value mapped to the y-axis,
# and the category (top 10% or 40%) mapped to the color.
ggplot_income <- ggplot_income=""><- ggplot(data="us_income_years_long," aes(x="Date_num" ,="" y="us_income_years" ,="" color="Category))">->->
geom_point(size = 1) +
geom_smooth() +
xlab('years') +
ylab('Percentage of income')
#
# [2pt] Specify appropriately detailed title and axis labels for your chart.
# Also use an appropriate *scale function* (for colors that are *discrete*) to
# customize the labels of the color mapping legend and making them readable
# (e.g., "Top 10% of Pop." and "Bottom 40% of Pop.")
#
# [1pt] Once completed, save your plot in a variable called `us_wealth_plot`.
# Note that you can print() out this variable in order to see the plot generated
# when you run your script.
###############################################################################
##### PART 4: Health Expenditures by Country #####
### In this section you will generate a plot showing the amount spent on
### healthcare across "high-income" countries.
###
### You can see an example of this plot in `examples/health_costs_plot.png`
###
### The instructions below have multiple steps as a single comment; it is up to
### you to organize your code below that.
# You will again need to access and wrangle the data in order to plot it. Note
# that this wrangling is more complex than the previous plots. Save your fully-
# wrangled data in a data frame called `health_costs` (though there will be some
# steps before you get there!)
countries <- wb_countries()="" %="">%->
filter(income_level_iso3c == "HIC") %>%
pull(iso3c)
#
# [2pt] You'll first need to a list of "high income" countries to get the data on.
# You can get access to general information about countries by calling the
# `wb_countries()` function. Filter this data frame for countries that are
# "High Income", and then extract (pull) a vector of the ISO3 codes for these
# countries (there should be around 80 of them).
countries <- wb_countries()="" %="">%->
filter(high_income) %>%
pull(iso3c)
#
# [2pt] Use the `wb_data()` function to access data on the following 4 indicators
# for the high-income countries (you will need to look up the indicator IDs):
# - "Current health expenditure per capita (current US$)"
# - "Domestic general government health expenditure per capita (current US$)"
# - "Domestic private health expenditure per capita (current US$)"
# - "Out-of-pocket expenditure per capita (current US$)"
# You should get the 1 most recent year.
#
health_costs <->->
country = countries,
indicator = c(
"SH.XPD.CHEX.PC.CD",
"SH.XPD.GHED.PC.CD",
"SH.XPD.PVTD.PC.CD",
"SH.XPD.OOPC.PC.CD"
),
mrv = 1,
lang = "en"
)
# [3pt] You will need to pivot this data into a *longer* format. You want a *names*
# column (e.g., `indicatorID`) of indicator names, and a *values* column of their values.
# After you pivot, you filter out any countries with `NA` values. The `drop_na()`
# function works great for this.
countries <->->
drop_na()
#
# [2pt] In order to make sure that your chart legend is readable, replace the ID
# codes with understandable text (e.g., "Total Spending", "Government Spending",
# "Private Spending", and "Out of Pocket Costs"). Note that I find it easier to
# do this replacement using base R syntax (bracket notation) than dplyr, as a
# separate set of 4 statements.
#
# [2pt] Additionally, you'll need a separate data frame (e.g., `total_health_costs`)
# of just the "Total Spending" data.
# Once you have your data ready, you can create your plot:
#
# [1pt] Your plot will use the `health_costs` data frame as the primary data source.
#
# [1pt] Your plot will include multiple geometries that will share aesthetics.
# Because of this, you will define your "default" aesthetic mapping as an argument
# to the ggplot() function. You should map the country's `iso3c` code to the x-axis
# (`reorder()` it by value, as you did in the first plot); the indicator value to
# the y-axis; and the indicator name/ID to the color.
#
# [3pt] Your plot's primary geometry will be point geometry. Specify that the
# `shape` of each point will be based on the indicator.
#
# [4pt] Your plot will also need to include lines from the bottom axis to the
# total cost point (for readability). You can do this by adding in a `linerange`
# geometry https://ggplot2.tidyverse.org/reference/geom_linerange.html. This
# geometry should use the `total_health_costs` as its data source, have a minimum-y
# aesthetic of 0 and a maximum-y aesthetic of the value column (which will come
# from the `total_health_costs`).
# Add the `linerange` geom *before* the point geom to have the points appear "on top".
#
# [2pt] Use a scale function to give different colors to the points. I used
# Colorbrewer's "Dark2" palette, but you can choose a different palette (or define
# your own set of colors).
#
# [2pt] Specify appropriately detailed title and axis labels for your chart.
# Remember to also provide an identical label for the color & shape aesthetics
# to style the legend.
#
# [2pt] Finally, use the `theme()` function (https://ggplot2.tidyverse.org/reference/theme.html)
# to specify a "theme" and styling of your plot. In particular, you can set the
# `axis.text.x` to be an `element_text()` value (https://ggplot2.tidyverse.org/reference/element.html)
# with a smaller size and an angle--this will make the labels not overlap each other.
# You can also specify the `legend.position` in order to place the legend somewhere
# else (like in the otherwise blank space. I used `c(.2,.8)` as a position--the
# numbers are the "ratio" of how far along the axis to place the plot).
# Search the documentation and other resources for examples of these (common) adjustments.
#
# [1pt] When completed, save your plot in a variable called `health_costs_plot`.
# Note that you can print() out this variable in order to see the plot generated
# when you run your script.
###############################################################################
##### PART 5: Map: Changes in Forestation around the World #####
### In this section you will generate a choropleth map of the forestry changes
### for each country country plotted on a global map.
###
### You can see an example of this plot in `examples/forested_map_plot.png`
###
### The instructions below have multiple steps as a single comment; it is up to
### you to organize your code below that.
###
### You may create this map using ggplot2 or using the `leaflet` package; use of
### other external mapping packages is not allowed. The below instructions cover
### how to do this using ggplot2.
# As before, you'll first need to wrangle the indicator data you need. Save your
# fully-wrangled data in a data frame called `forest_area`.
#
forest_area <>
wb_data("AG.LND.FRST.ZS", country = "countries_only", mrv = 20)
# [2pt] Use the `wb_data()` function to access data for the "Forest area (% of land area)"
# indicator (you'll need to look up its ID). Get data for "countries_only", and
# data for the most recent 20 years.
#
# [4pt] You'll need to calculate the change in forest area between the earliest
# and most recent years (1999 and 2018). To do this, first spread out (pivot_wider)
# the values--the ISO3 number will be the primary id, the names will come from the
# `date`, and the values will come from the indicator column. Then add a new column
# (e.g., `forest_change`) that is the difference between the 2018 value and the
# 1999 value (`2018 - 1999`).
# Because the column names will be strings that look like numbers (e.g., "1997"),
# it's easier to access the column values using double-bracket notation than using
# dplyr. Alternatively, you can rename the columns for easier access.
#
# [5pt] To make your choropleth map be readable and effective, you won't want to
# try and assign a different color to each of 260 different values (that will be
# a lot of colors and hard to distinguish!) Instead, you should break up (factor)
# the data in a small number of groups (called "bins"). Each "bin" will represent
# a range of values--for example, one bin might represent values from 0-5%, one bin
# values from 5%-10%, and so forth. You will then be able to give each "bin" a
# color, so that your map will only have 5 or 6 different colors representing
# different "levels" or "tiers" of forestry loss, rather than 260 colors.
# In short: you will color by a categorical value, rather than a continuous value!
#
# Use the `cut()` function to create a different column (i.e., `change_labels`) of
# "labels" representing each "bin" of data. This function takes as arguments a
# vector to divide (e.g., `forest_area$change`), and as a vector of breaks--the
# values that should act as dividing lines or "cut-offs" for each bin level.
# For example, you'd use "15%" as a break point to divide the data into bins of
# 0%-15%" and 15%-30%. Also specify a labels argument that is a vector of
# appropriate labels to use for naming each factor level (e.g., `c("0%-5%", "5%-10%")`).
# See https://rpubs.com/pierrelafortune/cutdocumentation (among others tutorials)
# for a more detailed example of using this function.
# You can look at the example image for a good set of "break points"
# You can assign this new factor (the result of the `cut()` function) to an
# additional column in your data frame (e.g., `forest_area$change_as_factor`)
# Once you have the indicator data, you'll need to prepare the map. For ggplot2,
# you'll need a set of polygons which represent each country in the world and
# can form the basis for a geometric object layer.
#
# [1pt] Get a data frame of these polygons by calling the `map_data("world")`
# function provided by ggplot2.
#
# [3pt] However, this data frame only lists countries by name, and country names
# are not standardized across data sets (e.g., different data sets may have
# "United States", "USA", "US", etc). Thus you need to provide the map data a
# three-letter country code (called an ISO3 code) based on their country name.
#
# You can find the ISO3 codes by using the `iso.alpha()` function from the `maps`
# package (https://cran.r-project.org/web/packages/maps/maps.pdf)
# (which you will need to install and load separately, at the top of your file).
# Pass the `iso.alpha()` function the `region` vector of the map data (the country
# names), and a `n = 3` argument to get the three-letter country codes. Mutate the
# map data frame to add a column of ISO3 codes for each country (the value returned
# from the `iso.alpha()` function).
# With the map data in hand, you can combine that with your indicator data to
# create a plottable data frame.
#
# [1pt] *left join* the world map data frame (on the left) to the `forest_area`.
# Join by ISO3 country code. This will create a giant data frame with a copy of
# the indicator value in each point of the polygon.
# Finally, the data is ready so you can create the choropleth map:
#
# [1pt] The data for your plot should be the joined map/forest-area data frame.
#
# [4pt] Use polygon geometry to create the plot. Map the `long` value to the
# x-coodinate, the `lat` value to the y-coordinate, and `group` points together.
#
# [3pt] Map the *fill* (not the color!) to the change value--this will color the
# inside of the polygons, not just the outlines. You must also specify a colorbrewer
# scale for the map fill; I used the "RdYlGn" palette (in reverse order).
#
# [2pt] Your plot should use a map coordinate system, such as `coord_quickmap()`
# https://ggplot2.tidyverse.org/reference/coord_map.html
#
# [2pt] You can easily get rid of the x and y axis labels by including a void theme
# https://ggplot2.tidyverse.org/reference/ggtheme.html
# You'll still need to add a title to the plot.
#
# [1pt] When completed, save your plot in a variable called `world_forest_plot`.
# Note that you can print() out this variable in order to see the plot generated
# when you run your script.
#
# This map will show a lot of data and geometry, so may take a couple seconds to
# generate. Be patient!
# Some countries may be missing values for some indicators or years if the data
# was unavailable. It's okay if these countries are left "blank" in your map).
###############################################################################
##### PART 6: Your Own Plot #####
### In this section you will create a visualization of something that is
### important to _you_ based on the World Bank data.
# For this visualization, you can choose to visualize any information from the
# World Bank data set that you wish. For example, you could visualize differences
# in Internet usage, economic development, or anything else. Look through the
# available indicators for topics that seem like they might be interesting.
# https://data.worldbank.org/indicator?tab=all
#
# Your visualization will need to use at least three "data features" (think: columns).
# This means you'll need to use either multiple indicators, use multiple years
# from a single indicator, or use multiple years from multiple indicators.
# Pro tip: you can often easily produce an "interesting" analysis by taking two
# seemingly unreleated topics and then comparing them to show that the are actually related!
#
# You will almost certainly need to do some light data wrangling to get the
# information you want. Think about what question your visualization will be able
# to answer, and then what data you'll need for that question.
# (But don't overthink this; the goal here is to practice making visualizations,
# not to be a time sink!)
#
# Your visualization must be created with the ggplot2 package. It will need to
# meet the following requirements:
# - At a minimum, it will need to include either 2 simple geometries (points,
# lines, columns) _or_ 1 "complex" geometry (e.g., polygons, hex bins, etc.).
# A visual element such as facets count as a simple geometry (so having a single
# point geometry with facets would be sufficient).
# - It will need to encode three (3) or more features (columns) to different aesthetics
# (e.g., x, y, and color).
# - It needs to include an adjusted scale for at least one of the aesthetics.
# Picking a color palette is sufficient.
# - (You are not required to specify position adjustments or coordinate systems,
# though you are welcome to if you wish)
# - It must include appropriate titles and labels. In particular, make sure that
# any "legend" labels are clear and understandable.
# When your designing your visualization, think about how you can make it both
# effective and expressive.
#
# When completed, save your plot in a variable with a descriptive name (and not
# just `my_plot`). Note that you can print() out this variable in order to see
# the plot generated when you run your script.