Instructions & Rules Use this Markdown document as your working copy of the exam, and edit it. Please use output option pdf_document or html_document . Questions and Points This test hasSixquestions....








Instructions & Rules


Use this Markdown document as your working copy of the exam, and edit it. Please use output optionpdf_documentorhtml_document.




  • Questions and Points


This test hasSixquestions. Attempt them all. The maximum number of points is110 (10 extra points).




  • Some standard writing considerations:

    • Replace comments that instruct you to put code with your own code.

    • Ensure your plots and output are visible and readable.

    • Ensure you’ve typed up an explanation of your answers wherever required.




  • Format: delete comments and replace with your answers and code. Do not just place code, execute it, and expect the reader to be able to interpret the answer for themselves. Type a sentence saying what you just computed and what the reader should understand.


  • Name: do not forget to put your name on the exam under the ‘author’ heading.


  • Submission: your submission must consist of your copy of this Markdown documentanda knittedpdffile (or save a knittedhtmlas apdf). Any other type of submission will receive no credit and no opportunity for a re-submission. Late submissions are not accepted.


Honor code



I will not give or receive information to or from any other persons during this midterm. This document was edited and PDF knitted by me alone.




[TYPE YOUR NAME HERE IN PLACE OF YOUR SIGNATURE]




Getting started


Load the packages you will need for your code to run. Probably you need at least these two, but add others if needed.



(These were used on previous homework assignments, so you should not have to run the commandinstall.packages("...."), but do run that first iflibrarydoes not load.)



# library("ggplot2") library("tidyverse") # includes tibbles, ggplot2, dyplr, and more.


In addition, I’d like to askRto print decimal numbers with 2 digits:



options(scipen=2)

Obtaining and Understanding the Data


For this exam, we will be using the cybersecurity breach report data downloaded 2015-02-26 from the US Health and Human Services.


To understand what the data represents, here is some information from theOffice for Civil Rightsof theU.S. Department of Health and Human Services:



  • "As required by section 13402(e)(4) of the HITECH Act, the Secretary must post a list of breaches of unsecured protected health information affecting 500 or more individuals.

  • “Since October 2009 organizations in the U.S. that store data on human health are required to report any incident that compromises the confidentiality of 500 or more patients / human subjects (45 C.F.R. 164.408). These reports are publicly available. Our data set was downloaded from the Office for Civil Rights of the U.S. Department of Health and Human Services, 2015-02-26.”


Load this data set and save it ascyberData, using the following code:



cyberData"https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/HHSCyberSecurityBreaches.csv"))

Data Exploration


Question 1. (10 points)


Check the structure of the data using thestrcommand. What type of object iscyberData? How many observations are recorded? How many variables are recorded? List all of the types of random variables that are recorded based on the output (i.e.int/float etc.).



[Your code here. Make this code chunk an actual `r` chunk; or `python` chunk if you're using Python! ]



[Your text answer here.]The data set is stored as a ….. with …. rows and …. columns. There are …. observations of …. variables. All of the types of random variables are ….



Question 2. (15 points)


Extract the columnIndividuals.AffectedfromcyberDataand obtain the numerical summary of this column.



[Your answer here.]

Note that the data source says:
Individuals.Affectedis an integer giving the number of humans whose records were compromised in the breach. This is 500 or greater; U.S. law requires reports of breaches involving 500 or more records but not of breaches involving fewer.




  • Which of the following are TRUE/FALSE?


    A. 50% of security breaches affected less than 35779 individuals.


    B. The middle 50% of the security breaches affected a range of individuals between approximately 1000 and 7350.


    C. The smallest breach affected 500 individuals and the largest 4900000.




    [Your discussion here.]





  • StandardizeIndividuals.Affectedand indicate if there is any outlier





[Your code here. Make this code chunk an actual `r` chunk; or `python` chunk if you're using Python! ]



[Your text answer here.]




  • Do a histogram ofIndividuals.Affected. What conclusion can you draw from the histogram graph?



[Your code here. Make this code chunk an actual `r` chunk; or `python` chunk if you're using Python! ]



[Your text answer here.]



Question 3. (25 points)


Let us compare the number of affected individuals across some states.



  • Extract the subset of the data for Kansas and Arkansas; in other words, the subset of the data for whichStatecolumn equals"KS"or"AR". Name the new dataframetwoStates.



Your code here.


[Your answer here]




  • Make a boxplot for the columnIndividuals.AffectedintwoStates, separated by state.



[Your code here.]

What do the boxplots indicate?




[Your discussion here.]




  • Add a third state to the dataframe, say, Illinois (i.e., whereState == "IL"). Name the new dataframethreeStates. What happens to the boxplot ofIndividuals.Affectedsplit across the three states?



[Your code here.]



[Your discussion here.]



The last plot should leave you wondering if Illinois is special, in that it contains some really large data breaches. Let’s investigate:



  • How many observations incyberDatarepresent a cyber security breach that affected 100,000 individuals or more?



[Your code here.]



[Your discussion here.]




  • How many of those are in Illinois?



[Your code here.]



[Your discussion here.]




  • Remove the rows corresponding to breaches affecting 100,000 individuals or more fromthreeStates, and re-do the boxplot from above.



[Your code here.]



[Your discussion here.]



Small analyses across time


Let us now compare attacks before and after 2013. The goal is to see if there is a significant difference in mean number of affected individuals.


Question 4. (10 points)


Check the type of theBreach.Submission.Datecolumn: is it a numeric? What type is it?



[Your code here.]

Let us change it to a numeric and extractthe year only. The code that does this isas.numeric(format(as.Date(.....),"%Y")). Let us use this code to break up the data to before and after 2013, like this:



before2013 "%Y")) 2013 ) after2013 "%Y")) > 2013 )

How many observations are in each subset of the population?




Type your answer here, anduse an in-line r code chunkto write down the number of observations. (Hint: if you are working with the dataframe structure, the command you want to use isnrow.)



Question 5. (30 points)


We’d like to see if there is adifferencein valuesIndividuals.Affectedamong the two time periods, on average. Obtain a sample of 100 incidents frombefore2013and 50 fromafter2013using usingdatasetname[sample(1:norw(datasetname),size = sample size,replace=FALSE),].



  • Generate two sample data sets and state the sample size(s).Note: You will get different samples when you rerun your rmarkdown file.



Your code here.


[Your answer here]




  • State the population parameters (or function thereof) that you need to estimate in order to determine whether there is a difference.



[Your answer here]




  • Using the two samples, construct a 95% confidence interval for determining if there is a difference between mean number of affected individuals in the two time periods.



Your code here.



Your answer here.




  • One statement is “There is adifferencein variances ofIndividuals.Affectedin the two time periods.” Can you use what you learn in class to reject or not reject the statement? Use the two samples to justify your answer. Hereα=0.05" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">α=0.05α=0.05.



Your code here.


[Your answer here]




  • Based on the above result, construct a hypothesis test for the means ofIndividuals.Affectedin the two time periods. Indicate the null and alternative hypotheses. Show your conclusion.



Your code here.


[Your answer here]



Specific type of security breaches


Question 6. (20 points)



  • What proportion of data entries incyberDatahaveType.of.Breach == "Hacking/IT Incident"?



Your code here.



Your answer here.




  • What proportion of data entries inbefore2013haveType.of.Breach == "Hacking/IT Incident"?



Your code here.



Your answer here.




  • What proportion of data entries inafter2013haveType.of.Breach == "Hacking/IT Incident"?



Your code here.



Your answer here.




  • Can we make the statement “The proportions of data entries inbefore2013andafter2013withType.of.Breach == "Hacking/IT Incident"are not significantly different”? Justify your answer. Use the significant levelα=0.05" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">α=0.05α=0.05.



Your code here.



Your answer here.






Nov 08, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here