Follow requirements
Lab 11. Logistic Regression Lab 11. Logistic Regression Introduction Logistic regression uses one or more numeric variables to predict the probability of a binomial y variable - e.g. does the price of a shirt predict if people will (1) or won’t (0) buy a shirt, does the amount of water you give a parrot every day determine if it will (1) or will not (0) curse at you, etc. In this lab you will walk through interpreting regression with a fake data set, and then you will use a real dataset to solve the mystery of why so many people in Florida let their pet reptiles go loose in the wild. Well, okay, the United States as a whole but like… it’s mostly a Florida problem TBH. Learning Outcomes By the end of this lab you should be able to: · Use glm() to make a binomial regression · Use predict() and round() to predict on new data · Use the confusionMatrix() function from the caret package to determine accuracy of your predictions · Determine if a continuous variable can predict a binomial variable · Interpret the meaning of a positive or negative slope in a logistic regression · Find good statistical talking points to yell at the pet trade industry? Part 1: Fake Data Part 1.1. Make The Data For a logistic regression we need to make two groups - one that is a “positive” result (1) and one that is a “negative” result (0). We also need some sort of predictor x variable. positive <- data.frame(y="1," x="rnorm(n" =="" 50,="" mean="50," sd="3))" negative="">-><- data.frame(y="0," x="rnorm(n" =="" 50,="" mean="42," sd="3))" together="">-><- rbind(positive,="" negative)="" let’s="" take="" a="" look="" at="" them="" first,="" using="" a="" density="" diagram.="" use="" the="" as.character()="" function="" to="" remind="" r="" that="" 0/1="" is="" a="" category,="" and="" the="" alpha=".7" argument="" to="" make="" things="" see-through.="" library(ggplot2)="" ggplot(together,="" aes(x="x," fill="as.character(y)))+" geom_density(alpha=".7)" this="" data="" could="" be="" anything,="" but="" you="" can="" see="" pretty="" clearly="" that="" the="" 0="" and="" 1="" categories="" are="" different.="" some="" examples="" of="" what="" this="" data="" could="" be:="" ·="" people="" who="" get="" paid="" more="" are="" more="" likely="" to="" be="" happy="" (1)="" than="" unhappy="" (0).="" ·="" reptiles="" that="" are="" bigger="" are="" more="" likely="" to="" be="" released="" to="" the="" wild="" (1)="" than="" kept="" forever="" (0).="" ·="" people="" with="" longer="" hair="" are="" more="" likely="" to="" be="" hippies="" (1)="" than="" not="" hippies="" (0).="" ·="" greater="" amounts="" of="" vitamin="" d="" intake="" during="" the="" winter="" is="" more="" likely="" to="" make="" you="" happy="" (1)="" than="" unhappy="" (0).="" in="" this="" sense,="" a="" logistic="" regression="" is="" very="" much="" like="" a="" t-test,="" but="" instead="" of="" saying="" “these="" are="" different”="" you’re="" asking="" “can="" i="" use="" x="" to="" predict="" these="" categories?”="" part="" 1.2="" plot="" the="" data="" as="" a="" bivariate="" distribution="" you="" can="" also="" use="" ggplot="" to="" view="" this="" data="" as="" a="" scatterplot="" just="" like="" you="" would="" otherwise,="" but="" to="" use="" the="" geom_smooth()="" function="" you="" will="" need="" to="" do="" a="" little="" bit="" of="" manipulation.="" specifically,="" you="" are="" using="" the="" glm()="" function="" to="" build="" this="" model.="" this="" is="" a="" general="" function="" that="" is="" similar="" to="" lm()="" but="" more="" “general”="" hence="" the="" name="" general="" linear="" model.="" because="" glm()="" can="" take="" more="" arguments,="" you="" have="" to="" specify="" that="" this="" is="" a="" binomial="" function,="" where="" you="" only="" have="" two="" options.="" ggplot(together,="" aes(x="x," y="y))+" geom_point()+="" stat_smooth(method="glm" ,="" method.args="list(family="binomial"))" part="" 1.3="" model="" building="" now="" you="" can="" see="" that="" there="" is="" a="" relationship!="" let’s="" build="" a="" model="" to="" test="" that.you="" can="" use="" the="" glm()="" function="" for="" real="" here,="" making="" sure="" to="" specify="" that="" this="" is="" a="" binomial="" model.="" just="" like="" before,="" you="" can="" use="" the="" summary()="" function="" to="" get="" more="" information.="" model="">-><- glm(y~x,="" data="together," family="binomial)" summary(model)="" ##="" ##="" call:="" ##="" glm(formula="y" ~="" x,="" family="binomial," data="together)" ##="" ##="" deviance="" residuals:="" ##="" min="" 1q="" median="" 3q="" max="" ##="" -2.10460="" -0.05077="" 0.00001="" 0.07536="" 1.79695="" ##="" ##="" coefficients:="" ##="" estimate="" std.="" error="" z="" value="" pr(="">|z|) ## (Intercept) -66.5614 17.4818 -3.807 0.000140 *** ## x 1.4596 0.3832 3.809 0.000139 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 138.629 on 99 degrees of freedom ## Residual deviance: 23.478 on 98 degrees of freedom ## AIC: 27.478 ## ## Number of Fisher Scoring iterations: 8 Now we can see that there is a statistically significant relationship - the p value for x is very small. Additionally, the slope for x is positive - that means that as x increased, so did y. Part 1.4 How Accurate? You’ll notice that R doesn’t give you an R2 value for logistic regressions. Instead, it is helpful to look at the actual accuracy of the model. To start with, use the predict() function to create a new column of what the model would predict for each x value. You have to add type = "response" here to get a logistic regression result. together$predict <- predict.glm(model,="" newdata="together," type="response" )="" you’ll="" notice="" it="" hasn’t="" rounded="" the="" values.="" that’s="" fine,="" it="" likes="" to="" tell="" you="" how="" certain="" it="" is="" essentially="" of="" being="" a="" 1.="" you="" can="" use="" the="" round()="" function="" to="" make="" a="" column="" of="" rounded="" values.="" together$predict2="">-><- round(together$predict)="" now,="" we’re="" going="" to="" make="" a="" confusion="" matrix="" -="" that’s="" a="" fancy="" term="" for="" tallying="" up="" how="" many="" of="" these="" predictions="" were="" wrong="" in="" either="" direction.="" to="" do="" this,="" you="" will="" need="" to="" use="" the="" install.packages()="" function="" to="" install="" caret="" and="" e1071.="" once="" installed,="" load="" them="" using="" library():="" library(caret)="" library(e1071)="" for="" the="" confusion="" matrix,="" we="" have="" to="" do="" a="" little="" bit="" of="" wiggling.="" the="" function="" gets="" fussy="" when="" given="" what="" it="" thinks="" is="" the="" wrong="" kind="" of="" data="" -="" it="" wants="" factors,="" not="" numbers.="" use="" the="" as.factor()="" function="" to="" get="" it="" to="" stop="" being="" silly.="" confusionmatrix(data="as.factor(together$predict2)," reference="as.factor(together$y))" ##="" confusion="" matrix="" and="" statistics="" ##="" ##="" reference="" ##="" prediction="" 0="" 1="" ##="" 0="" 47="" 2="" ##="" 1="" 3="" 48="" ##="" ##="" accuracy="" :="" 0.95="" ##="" 95%="" ci="" :="" (0.8872,="" 0.9836)="" ##="" no="" information="" rate="" :="" 0.5="" ##="" p-value="" [acc=""> NIR] : <2e-16 ##="" ##="" kappa="" :="" 0.9="" ##="" ##="" mcnemar's="" test="" p-value="" :="" 1="" ##="" ##="" sensitivity="" :="" 0.9400="" ##="" specificity="" :="" 0.9600="" ##="" pos="" pred="" value="" :="" 0.9592="" ##="" neg="" pred="" value="" :="" 0.9412="" ##="" prevalence="" :="" 0.5000="" ##="" detection="" rate="" :="" 0.4700="" ##="" detection="" prevalence="" :="" 0.4900="" ##="" balanced="" accuracy="" :="" 0.9500="" ##="" ##="" 'positive'="" class="" :="" 0="" ##="" first="" off,="" overall="" accuracy="" here="" is="" pretty="" high!="" it="" has="" calculated="" the="" confidence="" intervals="" for="" you,="" and="" the="" accuracy="" is="" between="" 0.887="" and="" 0.984="" (your="" results="" will="" slightly="" differ).="" as="" far="" as="" which="" groups="" were="" wrong,="" look="" at="" the="" little="" square="" of="" values="" at="" the="" top.="" 2="" values="" were="" predicted="" as="" 0="" when="" they="" were="" actually="" 1,="" and="" 3="" values="" were="" predicted="" as="" 0="" when="" they="" should="" have="" been="" 1.="" part="" 1.5="" summarize="" if="" you="" were="" asked="" to="" summarize="" this="" model,="" you="" might="" say="" something="" like:="" the="" x="" value="" was="" a="" strong="" predictor="" of="" y,="" and="" was="" statistically="" significant="" (p="">2e-16><0.001). the overall model accuracy was high (0.95), and it seems that this continuous variable has the="" overall="" model="" accuracy="" was="" high="" (0.95),="" and="" it="" seems="" that="" this="" continuous="" variable="">0.001). the overall model accuracy was high (0.95), and it seems that this continuous variable has>->->