Answer To: the entire assignment
Naveen answered on Nov 05 2021
---
title: "Problem Set 5"
author: 'Naveen Kumar M.Sc.,'
# output: html_notebook
output: html_document
---
```{r setup,message = FALSE,cache=FALSE, echo=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
library("tidyverse")
library("nycflights13")
```
## 1 General Social Survey
We will work on the gss_cat dataset.
a. Make a bar chart for the `rincome` (reported income). What makes the default bar chart hard to understand?
My first attempt is to use `geom_bar()` with the default settings.
```{r echo=FALSE}
rincome_plot <-
gss_cat %>%
ggplot(aes(x = rincome)) +
geom_bar()
rincome_plot
```
The problem with default bar chart settings, are that the labels overlapping and impossible to read.
b. Let’s make this bar chart better:
1. remove the rows with value “Not applicable”
2. rename “Lt $1000” to “Less than $1000”
3. (optional) use color to distinguish non-response categories (“Refused”, “Don’t know”, and “No answer”) from income levels (“Lt $1000”, …)
4. add meaningful y- and x-axis titles
5. flip the coordinate
```{r echo=FALSE}
gss_cat %>%
filter(!rincome %in% c("Not applicable")) %>%
mutate(rincome = fct_recode(rincome,
"Less than $1000" = "Lt $1000"
)) %>%
mutate(rincome_na = rincome %in% c("Refused", "Don't know", "No answer")) %>%
ggplot(aes(x = rincome, fill = rincome_na)) +
geom_bar() +
coord_flip() +
scale_y_continuous("Number of Respondents", labels = scales::comma) +
scale_x_discrete("Respondent's Income") +
scale_fill_manual(values = c("FALSE" = "black", "TRUE" = "gray")) +
theme(legend.position = "None")
```
If I were only interested in non-missing responses, then I could drop all respondents who answered "Not applicable", "Refused", "Don't know", or "No answer".
# 2 who data
We will work on the who dataset (use the help page to get more details about the data). First, let’s transform the data based on what we did in class to get the who2 data.
```{r}
names(who) <- str_replace(names(who), "newrel", "new_rel")
who2 <- who %>%
gather("codes", "case", 5:60) %>%
select(-iso2, -iso3) %>%
separate(codes, c("new", "type", "sexage"), sep = "_") %>%
select(-new) %>%
separate(sexage, into = c("sex", "age"), sep = 1,convert=TRUE)
```
a. There are many missing values in the `case` variable. We need to think about how missing values are represented in this dataset. The main concern is whether a missing value means that there were no cases of Tuberculosis (TB) or whether it means that the WHO does not have data on the number of TB cases. Check the presence of zeroes in the `case` variable.
```{r echo=FALSE}
who2 %>%
filter(case == 0) %>%
nrow()
```
b. For each year and sex compute the total number of cases of TB (remove the missing values), and make the following time series plot of the total number of cases (Note the changes in the legend lables and the scientific...