data assignment
DATA 303 Assignment 4 DATA 303 Assignment 4 Due: 5:00 PM Friday 26 June 2020 Intructions • Prepare your assignment using Rmarkdown • Submit your solutions in two files: an Rmarkdown file named assignment4.Rmd and the PDF file named assignment4.pdf that results from knitting the Rmd file. • The YAML header of your Rmarkdown file must contain your name and ID number in the author field, and should have the output format set to pdf_document. For example: --- title: "DATA 303 Assignment 4" author: "Ryan Admiraal, 12345678" date: "26 June 2020" output: pdf_document --- • While you are developing your code you may find it easiest to have the output set to html_document but change it to pdf_document when you submit. • In your submission, embed any executable R code in code chunks, and make sure both the R code and the output is displayed correctly when you knit the document. • If there are any R code errors, then the Rmarkdown file will not knit, and no output will be created at all. If you cannot get your code to work but want to show your attempted code, then put error = TRUE in the header of the R code chunk that is failing. ```{r, error = TRUE} your imperfect R code ``` • Where appropriate, make sure you include your comments in the output within the Rmarkdown document. • You will receive an email confirming your submission. Check the email to be sure it shows that both the Rmd file and the PDF file have been submitted. 1 Background and Data Heart disease is the annual leading cause of death worldwide, accounting for more than 25% of deaths in 2016 (World Health Organization 2018). It is also a significant economic burden for the healthcare system with Nichols et al. (2010) estimating that heart disease and other cardiovascular diseases cost an average of roughly USD $19,000 per patient, according to a study in the United States over the period of 2000-2005. Early detection of heart disease (along with many other diseases) is important in terms of reducing both mortality and costs to the healthcare system. We will examine data on 4,240 participants in the Framingham Heart Study (Boston University and the National Heart, Lung, & Blood Institute 2020), an ongoing study that began in 1948 and has been instrumental in the identification of a number of risk factors for heart disease and other cardiovascular diseases. The data are available in the file Heart Disease.xlsx, which can be read into R using the code below but with the path changed to point to the location of the file on your computer. A full list of variables contained in the dataset and descriptions of these variables is also provided, both here and in the Excel file. # Load the "readxl" package to read in data from an Excel file. library(readxl) # Read in the heart disease dataset. hd <- read_xlsx("~/documents/dropbox/courses/data303/data/heart="" disease.xlsx",="" sheet="Data" ,="" na="NA" )="" table="" 1:="" variables="" and="" their="" descriptions="" for="" data="" contained="" in="" the="" file="" heart="" disease.xlsx.="" variable="" description="" sex="" sex="" of="" the="" individual="" (0="“Female”," 1="“Male”)." age="" age="" (in="" years)="" of="" the="" individual="" at="" the="" time="" of="" the="" health="" exam.="" educ="" highest="" level="" of="" education="" of="" the="" individual="" (1="“Some" high="" school”,="" 2="“High" school="" or="" graduate="" equivalency="" diploma”,="" 3="“Some" university="" or="" vocational="" school”,="" 4="“University”)." smoker="" indicator="" of="" whether="" or="" not="" the="" individual="" is="" a="" current="" smoker="" (0="“No”," 1="“Yes”)." cig="" average="" number="" of="" cigarettes="" that="" the="" individual="" smokes="" each="" day.="" bp_med="" indicator="" of="" whether="" or="" not="" the="" individual="" is="" on="" blood="" pressure="" medication="" (0="“No”," 1="“Yes”)." stroke="" indicator="" of="" whether="" or="" not="" the="" individual="" previously="" had="" a="" stroke="" (0="“No”," 1="“Yes”)." hyper="" indicator="" of="" whether="" or="" not="" the="" individual="" was="" hypertensive="" (0="“No”," 1="“Yes”)." diab="" indicator="" of="" whether="" or="" not="" the="" individual="" is="" diabetic="" (0="“No”," 1="“Yes”)." chol="" total="" cholesterol="" level="" (in="" mg/dl).="" sbp="" systolic="" blood="" pressure="" (in="" mmhg).="" dbp="" diastolic="" blood="" pressure="" (in="" mmhg).="" bmi="" body="" mass="" index.="" hr="" resting="" heart="" rate="" (in="" beats="" per="" minute)="" gluc="" glucose="" level="" (in="" mg/dl)="" hd_risk="" indicator="" of="" whether="" the="" individual="" has="" 10-year="" risk="" of="" future="" coronary="" heart="" disease="" (0="“No”," 1="“Yes”)" our="" focus="" will="" be="" on="" 10-year="" risk="" of="" coronary="" heart="" disease="" (chd).="" ten-year="" risk="" of="" chd="" is="" a="" predicted="" risk="" (i.e.,="" a="" probability="" ranging="" between="" 0="" and="" 1)="" of="" developing="" chd="" within="" the="" next="" 10="" years.="" although="" this="" is="" not="" an="" observed="" outcome="" but="" rather="" an="" estimated="" value,="" 10-year="" risk="" of="" chd="" is="" a="" well-established="" measure="" in="" the="" medical="" community.="" we="" will="" consider="" a="" binary="" version="" of="" this="" variable="" which="" indicates="" whether="" or="" not="" a="" person="" would="" be="" considered="" as="" at="" risk="" of="" developing="" chd="" within="" the="" next="" 10="" years.="" 2="" assignment="" questions="" 1.="" missing="" data="" and="" variable="" recode:="" (10="" marks)="" although="" our="" objective="" will="" be="" to="" consider="" inferential="" and="" predictive="" models="" for="" 10-year="" risk="" of="" chd,="" we="" will="" first="" ensure="" that="" we="" understand="" aspects="" of="" the="" underlying="" data="" as="" well="" as="" create="" a="" new="" variable="" that="" may="" prove="" useful="" in="" producing="" comparisons="" of="" 10-year="" risk="" of="" chd="" for="" medically-meaningful="" blood="" pressure="" ranges.="" (in="" practice,="" we="" would="" want="" to="" examine="" each="" relevant="" variable="" to="" identify="" extreme="" observations="" and="" be="" sure="" that="" there="" are="" not="" any="" erroneous="" values.="" as="" this="" dataset="" has="" already="" been="" cleaned,="" we="" will="" not="" do="" so="" for="" this="" assignment.)="" a.="" (2="" marks)="" first,="" perform="" an="" analysis="" of="" the="" level="" of="" missing="" data="" for="" each="" variable.="" for="" only="" those="" variables="" for="" which="" there="" are="" missing="" data,="" produce="" a="" table="" of="" the="" form="" shown="" below,="" where="" variable_i="" is="" the="" name="" of="" the="" variable="" with="" missing="" data,="" ni="" is="" the="" count="" for="" number="" of="" missing="" observations="" for="" that="" variable,="" and="" pi="" is="" the="" proportion="" (to="" 5dp)="" of="" missing="" observations="" for="" that="" variable.="" which="" variable="" has="" the="" highest="" level="" of="" missing="" data?="" table="" 2:="" frequency="" and="" proportion="" of="" missing="" values="" for="" variables="" with="" missing="" data.="" variable="" variable_1="" variable_2="" .="" .="" .="" variable_k="" frequency="" (n)="" n1="" n2="" .="" .="" .="" nk="" proportion="" (p)="" p1="" p2="" .="" .="" .="" pk="" b.="" (3="" marks)="" create="" a="" new="" data="" frame="" called="" hd.complete,="" which="" only="" keeps="" people/observations="" that="" have="" no="" missing="" data.="" in="" total,="" what="" proportion="" (to="" 5dp)="" of="" people="" have="" been="" removed="" from="" the="" original="" dataset="" to="" produce="" this="" final="" data="" frame?="" c.="" (3="" marks)="" add="" a="" variable="" to="" the="" data="" frame="" hd.complete="" called="" sbp_cat,="" which="" converts="" systolic="" blood="" pressure="" (sbp)="" from="" a="" numeric="" variable="" to="" a="" categorical="" variable="" according="" to="" the="" blood="" pressure="" ranges="" specified="" by="" madell="" and="" cherney="" (2018).="" (see="" references="" listed="" at="" the="" end="" of="" the="" assignment.)="" for="" the="" purposes="" of="" coding="" sbp_cat,="" you="" can="" assume="" that="" the="" values="" for="" each="" blood="" pressure="" category="" go="" to="" just="" below="" that="" of="" the="" next="" category,="" as="" our="" dataset="" does="" not="" consist="" of="" blood="" pressures="" that="" are="" rounded="" to="" the="" nearest="" whole="" number.="" this="" means="" that,="" for="" instance,="" the="" systolic="" blood="" pressure="" range="" of="" 120="" –="" 129="" should="" in="" fact="" be="" interpreted="" as="" 120="" –="">->< 130.="" this="" should="" produce="" five="" levels="" (i.e.,="" blood="" pressure="" ranges)="" for="" sbp_cat.="" (note="" that="" the="" final="" level="" corresponds="" to="" systolic="" blood="" pressure="" above="" 180="" mmhg.)="" produce="" a="" table="" for="" sbp_cat="" which="" shows="" how="" many="" observations="" fall="" into="" each="" blood="" pressure="" range.="" d.="" (2="" marks)="" explain="" when="" we="" would="" expect="" that="" using="" the="" categorical="" variable="" sbp_cat="" rather="" than="" the="" numeric="" variable="" sbp="" would="" lead="" to="" a="" better="" fit="" for="" a="" regression="" model="" (whether="" logistic="" regression,="" linear="" regression,="" or="" poisson="" regression).="" 2.="" inferential="" analysis:="" (25="" marks)="" now="" we="" will="" focus="" on="" 10-year="" risk="" of="" chd="" and="" look="" at="" the="" role="" that="" blood="" pressure="" may="" play="" in="" whether="" or="" not="" someone="" is="" considered="" to="" be="" at="" risk="" of="" developing="" chd="" within="" the="" next="" 10="" years.="" a.="" (3="" marks)="" we="" will="" first="" consider="" a="" logistic="" regression="" model="" of="" 10-year="" risk="" of="" chd="" (hd_risk)="" on="" systolic="" blood="" pressure="" (sbp)="" and="" diastolic="" blood="" pressure="" (dbp).="" previous="" research="" suggests="" that="" the="" following="" variables="" are="" potential="" confounders="" for="" the="" true="" relationship="" between="" blood="" pressure="" and="" 10-year="" risk="" of="" chd="" and="" should="" also="" be="" included="" in="" the="" logistic="" regression="" model:="" •="" sex="" of="" the="" individual="" (sex)="" •="" age="" of="" the="" individual="" (age)="" •="" highest="" level="" of="" education="" of="" the="" individual="" (educ)="" •="" average="" number="" of="" cigarettes="" smoked="" per="" day="" (cig)="" 3="" •="" total="" cholesterol="" level="" (chol)="" •="" body="" mass="" index="" (bmi)="" •="" glucose="" level="" (gluc)="" for="" this="" logistic="" regression="" model,="" calculate="" the="" variance="" inflation="" factors="" for="" predictors="" (to="" 3dp)="" to="" determine="" whether="" or="" not="" there="" is="" evidence="" of="" significant="" multicollinearity="" among="" the="" predictors="" in="" the="" model.="" if="" so,="" comment="" on="" which="" predictor(s)="" should="" be="" removed,="" and="" use="" this="" model="" for="" subsequent="" parts="" of="" this="" question.="" b.="" (3="" marks)="" using="" your="" model="" from="" part="" (a),="" produce="" a="" table="" of="" logistic="" regression="" model="" output="" and="" write="" out="" the="" estimated="" logistic="" regression="" equation="" using="" the="" form="" log="" (="" p̂="" 1="" −="" p̂="" )="β̂0" +="" β̂1x1="" +="" ·="" ·="" ·="" +="" β̂kxk,="" where="" you="" clearly="" define="" the="" variables="" x1,="" x2,="" .="" .="" .="" ,="" xk="" and="" replace="" β̂0,="" β̂1,="" .="" .="" .="" ,="" β̂k="" with="" their="" estimated="" values="" (to="" 4dp).="" c.="" (6="" marks)="" carry="" out="" wald="" tests="" for="" the="" coefficients="" for="" •="" systolic="" blood="" pressure="" and="" •="" diastolic="" blood="" pressure.="" for="" each="" coefficient,="" clearly="" state="" i.="" the="" hypotheses="" you="" are="" testing,="" ii.="" the="" value="" of="" the="" test="" statistic,="" iii.="" the="" p-value,="" and="" iv.="" your="" conclusion="" in="" terms="" of="" whether="" the="" “effect”="" of="" the="" predictor="" on="" the="" response="" is="" statistically="" significant.="" d.="" (3="" marks)="" for="" any="" significant="" wald="" tests="" in="" part="" (c),="" provide="" a="" precise="" interpetation="" of="" what="" the="" estimated="" coefficient="" suggests="" about="" the="" “effect”="" of="" the="" predictor="" on="" the="" response,="" and="" calculate="" a="" corresponding="" 95%="" confidence="" interval="" (to="" 3dp)="" for="" the="" estimated="" “effect”.="" e.="" (4="" marks)="" a="" 2015="" study="" by="" wu="" et="" al.="" (2015)="" found="" that="" “cardiovascular="" and="" expanded-cardiovascular="" mortality="" risks="" were="" lowest="" when="" systolic="" blood="" pressures="" were="" 120="" to="" 129="" mm="" hg,="" and="" increased="" significantly="" when="" systolic="" blood="" pressures="" (sbps)="" were="" ≥="" 160="" mm="" hg.="" .="" .="" .”="" although="" wu="" et="" al.="" (2015)="" considered="" different="" ranges="" of="" systolic="" blood="" pressures="">< 120, 120—129, 130—139, 140—149, 150—159, ≥ 160 mmhg) than madell and cherney (2018), we will use those specified by madell and cherney (2018) in investigating whether ranges of blood pressures may differ in terms of associated 10-year risk of chd. fit the same model as before, but replace sbp with sbp_cat. i. produce a table of logistic regression model output for this model. ii. based strictly on p-values, comment on what conclusions you would make for wald tests based on coefficients for sbp_cat. (note that you do not need to state hypotheses or values for test statistics. you simply need to use the p-values to explain what these results mean about comparisons of systolic blood pressure ranges.) 120,="" 120—129,="" 130—139,="" 140—149,="" 150—159,="" ≥="" 160="" mmhg)="" than="" madell="" and="" cherney="" (2018),="" we="" will="" use="" those="" specified="" by="" madell="" and="" cherney="" (2018)="" in="" investigating="" whether="" ranges="" of="" blood="" pressures="" may="" differ="" in="" terms="" of="" associated="" 10-year="" risk="" of="" chd.="" fit="" the="" same="" model="" as="" before,="" but="" replace="" sbp="" with="" sbp_cat.="" i.="" produce="" a="" table="" of="" logistic="" regression="" model="" output="" for="" this="" model.="" ii.="" based="" strictly="" on="" p-values,="" comment="" on="" what="" conclusions="" you="" would="" make="" for="" wald="" tests="" based="" on="" coefficients="" for="" sbp_cat.="" (note="" that="" you="" do="" not="" need="" to="" state="" hypotheses="" or="" values="" for="" test="" statistics.="" you="" simply="" need="" to="" use="" the="" p-values="" to="" explain="" what="" these="" results="" mean="" about="" comparisons="" of="" systolic="" blood="" pressure=""> 120, 120—129, 130—139, 140—149, 150—159, ≥ 160 mmhg) than madell and cherney (2018), we will use those specified by madell and cherney (2018) in investigating whether ranges of blood pressures may differ in terms of associated 10-year risk of chd. fit the same model as before, but replace sbp with sbp_cat. i. produce a table of logistic regression model output for this model. ii. based strictly on p-values, comment on what conclusions you would make for wald tests based on coefficients for sbp_cat. (note that you do not need to state hypotheses or values for test statistics. you simply need to use the p-values to explain what these results mean about comparisons of systolic blood pressure ranges.)>