Instructions & Rules
Use this Markdown document as your working copy of the exam, and edit it. Please use output optionpdf_document
orhtml_document
.
This test hasSixquestions. Attempt them all. The maximum number of points is110 (10 extra points).
Some standard writing considerations:
- Replace comments that instruct you to put code with your own code.
- Ensure your plots and output are visible and readable.
- Ensure you’ve typed up an explanation of your answers wherever required.
Format: delete comments and replace with your answers and code. Do not just place code, execute it, and expect the reader to be able to interpret the answer for themselves. Type a sentence saying what you just computed and what the reader should understand.
Name: do not forget to put your name on the exam under the ‘author’ heading.
Submission: your submission must consist of your copy of this Markdown documentanda knittedpdf
file (or save a knittedhtml
as apdf
). Any other type of submission will receive no credit and no opportunity for a re-submission. Late submissions are not accepted.
Honor code
I will not give or receive information to or from any other persons during this midterm. This document was edited and PDF knitted by me alone.
[TYPE YOUR NAME HERE IN PLACE OF YOUR SIGNATURE]
Getting started
Load the packages you will need for your code to run. Probably you need at least these two, but add others if needed.
(These were used on previous homework assignments, so you should not have to run the commandinstall.packages("....")
, but do run that first iflibrary
does not load.)
# library("ggplot2") library("tidyverse") # includes tibbles, ggplot2, dyplr, and more.
In addition, I’d like to askR
to print decimal numbers with 2 digits:
options(scipen=2)
Obtaining and Understanding the Data
For this exam, we will be using the cybersecurity breach report data downloaded 2015-02-26 from the US Health and Human Services.
To understand what the data represents, here is some information from theOffice for Civil Rightsof theU.S. Department of Health and Human Services:
- "As required by section 13402(e)(4) of the HITECH Act, the Secretary must post a list of breaches of unsecured protected health information affecting 500 or more individuals.
- “Since October 2009 organizations in the U.S. that store data on human health are required to report any incident that compromises the confidentiality of 500 or more patients / human subjects (45 C.F.R. 164.408). These reports are publicly available. Our data set was downloaded from the Office for Civil Rights of the U.S. Department of Health and Human Services, 2015-02-26.”
Load this data set and save it ascyberData
, using the following code:
cyberData"https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/HHSCyberSecurityBreaches.csv"))
Data Exploration
Question 1. (10 points)
Check the structure of the data using thestr
command. What type of object iscyberData
? How many observations are recorded? How many variables are recorded? List all of the types of random variables that are recorded based on the output (i.e.int/float etc.).
[Your code here. Make this code chunk an actual `r` chunk; or `python` chunk if you're using Python! ]
[Your text answer here.]The data set is stored as a ….. with …. rows and …. columns. There are …. observations of …. variables. All of the types of random variables are ….
Question 2. (15 points)
Extract the columnIndividuals.Affected
fromcyberData
and obtain the numerical summary of this column.
[Your answer here.]
Note that the data source says:
Individuals.Affected
is an integer giving the number of humans whose records were compromised in the breach. This is 500 or greater; U.S. law requires reports of breaches involving 500 or more records but not of breaches involving fewer.
Which of the following are TRUE/FALSE?
A. 50% of security breaches affected less than 35779 individuals.
B. The middle 50% of the security breaches affected a range of individuals between approximately 1000 and 7350.
C. The smallest breach affected 500 individuals and the largest 4900000.
[Your discussion here.]
StandardizeIndividuals.Affected
and indicate if there is any outlier
[Your code here. Make this code chunk an actual `r` chunk; or `python` chunk if you're using Python! ]
[Your text answer here.]
- Do a histogram of
Individuals.Affected
. What conclusion can you draw from the histogram graph?
[Your code here. Make this code chunk an actual `r` chunk; or `python` chunk if you're using Python! ]
[Your text answer here.]
Question 3. (25 points)
Let us compare the number of affected individuals across some states.
- Extract the subset of the data for Kansas and Arkansas; in other words, the subset of the data for which
State
column equals"KS"
or"AR"
. Name the new dataframetwoStates
.
Your code here.
[Your answer here]
- Make a boxplot for the column
Individuals.Affected
intwoStates
, separated by state.
[Your code here.]
What do the boxplots indicate?
[Your discussion here.]
- Add a third state to the dataframe, say, Illinois (i.e., where
State == "IL"
). Name the new dataframethreeStates
. What happens to the boxplot ofIndividuals.Affected
split across the three states?
[Your code here.]
[Your discussion here.]
The last plot should leave you wondering if Illinois is special, in that it contains some really large data breaches. Let’s investigate:
- How many observations in
cyberData
represent a cyber security breach that affected 100,000 individuals or more?
[Your code here.]
[Your discussion here.]
- How many of those are in Illinois?
[Your code here.]
[Your discussion here.]
- Remove the rows corresponding to breaches affecting 100,000 individuals or more from
threeStates
, and re-do the boxplot from above.
[Your code here.]
[Your discussion here.]
Small analyses across time
Let us now compare attacks before and after 2013. The goal is to see if there is a significant difference in mean number of affected individuals.
Question 4. (10 points)
Check the type of theBreach.Submission.Date
column: is it a numeric? What type is it?
[Your code here.]
Let us change it to a numeric and extractthe year only. The code that does this isas.numeric(format(as.Date(.....),"%Y"))
. Let us use this code to break up the data to before and after 2013, like this:
before2013 "%Y")) 2013 ) after2013 "%Y")) > 2013 )
How many observations are in each subset of the population?
Type your answer here, anduse an in-line r code chunkto write down the number of observations. (Hint: if you are working with the dataframe structure, the command you want to use isnrow
.)
Question 5. (30 points)
We’d like to see if there is adifferencein valuesIndividuals.Affected
among the two time periods, on average. Obtain a sample of 100 incidents frombefore2013
and 50 fromafter2013
using usingdatasetname[sample(1:norw(datasetname),size = sample size,replace=FALSE),]
.
- Generate two sample data sets and state the sample size(s).Note: You will get different samples when you rerun your rmarkdown file.
Your code here.
[Your answer here]
- State the population parameters (or function thereof) that you need to estimate in order to determine whether there is a difference.
[Your answer here]
- Using the two samples, construct a 95% confidence interval for determining if there is a difference between mean number of affected individuals in the two time periods.
Your code here.
Your answer here.
- One statement is “There is adifferencein variances of
Individuals.Affected
in the two time periods.” Can you use what you learn in class to reject or not reject the statement? Use the two samples to justify your answer. Hereα=0.05" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">α=0.05α=0.05.
Your code here.
[Your answer here]
- Based on the above result, construct a hypothesis test for the means of
Individuals.Affected
in the two time periods. Indicate the null and alternative hypotheses. Show your conclusion.
Your code here.
[Your answer here]
Specific type of security breaches
Question 6. (20 points)
- What proportion of data entries in
cyberData
haveType.of.Breach == "Hacking/IT Incident"
?
Your code here.
Your answer here.
- What proportion of data entries in
before2013
haveType.of.Breach == "Hacking/IT Incident"
?
Your code here.
Your answer here.
- What proportion of data entries in
after2013
haveType.of.Breach == "Hacking/IT Incident"
?
Your code here.
Your answer here.
- Can we make the statement “The proportions of data entries in
before2013
andafter2013
withType.of.Breach == "Hacking/IT Incident"
are not significantly different”? Justify your answer. Use the significant levelα=0.05" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">α=0.05α=0.05.
Your code here.
Your answer here.