Hi
Homework 2 Due: before 12:00 pm (noon) on Tuesday, March 30. Please do not include your name on your write-up, since these documents will be reviewed by anonymous peer graders. For probability derivations, show your work and/or explain your reasoning. Do not include your raw R code in your write-up unless we explicitly ask for it. You will submit your R script as a separate document to the write-up itself. On Canvas, you will see two assignments pages corresponding to Homework 2: (1) to upload your write-up PDF file and (2) to upload the R script that you used to generate your write-up. Your write-up is what will be peer graded. The R script will not be graded, but you must submit it to receive credit on the write-up. If you use tables or figures, make sure they are formatted professionally. Figures and tables should have informative captions. Numbers should be rounded to a sensible number of digits (you’re at UT and therefore a smart cookie; use your judgment for what’s sensible depending on the level of precision that is appropriate for the problem context). Problem 1 - NHANES The American National Health and Nutrition Examination Surveys (NHANES) are collected by the US National Center for Health Statistics, which has conducted a series of health and nutrition surveys since the early 1960s. Since 1999, approximately 5,000 individuals of all ages are interviewed each year. For this problem you will need to install the NHANES package in RStudio with a built-in data frame called NHANES. library(NHANES) library(mosaic) data(NHANES) Part A: Create a histogram for the distribution of SleepHrsNight for individuals aged 18-22 (inclusive) via the bootstrap. Use at least 10000 iterations. Include the plot and report the mean sleep hours for this age group. Optional: how does your sleep compare? Part B: Now we want to build a confidence interval for the proportion of women we think are pregnant at any given time. Bootstrap a confidence interval with 10000 iterations. Include in your write-up a histogram of your simulation results, along with a 95% confidence interval for the proportion. To speed things up, you can use this code to subset the NHANES data frame to one with only women. Let’s get rid of the N/A values for our variable of interest (PregnantNow) in our filtered data frame: NHANES_women <- nhanes="" %="">% filter(Gender=="female", !is.na(PregnantNow) ) Problem 2 - Iron Bank The Securities and Exchange Commission (SEC) is investigating the Iron Bank, where a cluster of employees have recently been identified in various suspicious patterns of securities trading. Of the last 2021 trades, 70 were flagged by the SEC’s detection algorithm. Trades are flagged periodically even when no illicit market activity has taken place. For that reason, the SEC often monitors individual and institutional trading but does not investigate detected incidents that may be consistent with random variability in trading patterns. SEC data suggest that the overall baseline rate of suspicious securities trades is 2.4%. Are the observed data (70 flagged trades out of 2021) consistent with the SEC’s null hypothesis that, over the long run, securities trades from the Iron Bank are flagged at the same baseline rate as that of other traders? Use Monte Carlo simulation (with at least 100000 simulations) to calculate a p-value under this null hypothesis. Include the following items in your write-up: 1 • the null hypothesis that your are testing; • the test statistic you used to measure evidence against the null hypothesis; • a plot of the probability distribution of the test statistic, assuming that the null hypothesis is true; • the p-value itself; • and a one-sentence conclusion about the extent to which you think the null hypothesis looks plausible in light of the data. This one is open to interpretation! Make sure to defend your conclusion. Problem 3 - Armfold A professor at an Australian university ran the following experiment with her students in a data science class. Everyone in the class stood up, and the professor asked everyone to fold their arms across their chest. Students then filled out an online survey with two pieces of information: 1) Did they fold their arms with the left arm on top of right, or with the right arm on top of the left? 2) Did they identify as male or female? The professor then asked her students to assess whether, in light of the data from the survey, there was support for the idea that males and females differed in how often they folded their arms with their left arm on top of the right. The survey data indicated that males folded their arms with their left arms on top more frequently. But how much more frequently? And was this just a “small-sample” difference? Or did it accurately reflect a population-level trend? The data from this experiment are in armfold.csv. There are two relevant variables: • LonR_fold: a binary (0/1) indicator, where 1 indicates left arm on top, and 0 indicates right arm on top. • Sex: a categorical variable with levels male and female. (There’s also a third variable indicating which hand the student writes with, but we’re not using that here.) Your task (quite similar to what we did with the recidivism R walkthrough) is to assess support for any male/female differences in the population-wide rate of “left arm on top” folding. Make sure to quantify your uncertainty about how much more often males fold their left arms on top. (That is, it’s not enough to just report the estimate for this sample; you have to provide a confidence interval that tells us how we can expect this number to generalize to the wider population. In doing so, you can treat this sample as if it were a random sample from the relevant population, in this case university students.) Your write-up should include four sections: 1) Question: What question are you trying to answer? 2) Approach: What modeling approach did you use to answer the question? 3) Results: What evidence/results did your modeling approach provide to answer the question? This might include numbers, figures, and/or tables as appropriate depending on your approach. 4) Conclusion: What is your conclusion about your question? You will want to provide a short written interpretation of your confidence interval. Note: for a relatively simple problem like this, each of these four sections will likely be quite short. Nonetheless, these sections reflect a good general organization for a data-science write-up. So we’ll start practicing with this organization on a simple problem, even if it seems a bit overkill at first. (It is certainly possibly in this case for each of them to be only 1 or 2 sentences long. Although you might feel you need more, and although nobody on our end is breaking out a word counter, it shouldn’t be too much longer than that.) Problem 4 - Ebay In this problem, you’ll analyze data from an experiment run by EBay in order to assess whether the company’s paid advertising on Google’s search platform was improving EBay’s revenue. (It was certainly improving Google’s revenue!) Google Ads, also known as Google AdWords, is Google’s advertising search system, and it’s the primary way the company made its $162 billion in revenue in fiscal year 2019. The AdWords system has advertisers bid on certain keywords (e.g., “iPhone” or “toddler shoes”) in order for their clickable ads to appear at the top of 2 the page in Google’s search results. These links are marked as an “Ad” by Google, and they’re distinct from the so-called “organic” search results that appear lower down the page. Nobody pays for the organic search results; pages get featured here if Google’s algorithms determine that they’re among the most relevant pages for a given search query. But if a customer clicks on one of the sponsored “Ad” search results, Google makes money. Suppose, for example, that EBay bids $0.10 on the term “vintage dining table” and wins the bid for that term. If a Google user searches for “vintage dining table” and ends up clicking on the EBay link from the page of search results, EBay pays Google $0.10 (the amount of their bid). 1 For a small company, there’s often little choice but to bid on relevant Google search terms; otherwise their search results would be buried. But a big site like EBay doesn’t necessarily have to pay in order for their search results to show up prominently on Google. They always have the option of “going organic,” i.e. not bidding on any search terms and hoping that their links nonetheless are shown high enough up in the organic search results to garner a lot of clicks from Google users. So the question for a business like EBay is, roughly, the following: does the extra traffic brought to our site from paid search results—above and beyond what we’d see if we “went organic”—justify the cost of the ads themselves? To try to answer this question, EBay ran an experiment in May of 2013. For one month, they turned off paid search in a random subset of 70 of the 210 designated market areas (DMAs) in the United States. A designated market area, according to Wikipedia, is “a region where the population can receive the same or similar television and radio station offerings, and may also include other types of media including newspapers and Internet content.” Google allows advertisers to bid on search terms at the DMA level, and it infers the DMA of a visitor on the basis of that visitor’s browser cookies and IP address. Examples of DMAs include “New York,” “Miami-Ft. Lauderdale,” and “Beaumont-Port Arthur.” In the experiment, EBay randomly assigned each of the 210 DMAs to one of two groups: • the treatment group, where advertising on Google AdWords for the whole DMA was paused for a month, starting on May 22. • the control group, where advertising on Google AdWords continued as before. In ebay.csv you have the results of the experiment. The columns in this data set are: • DMA: the name of the designated market area, e.g. New York • rank: the rank of that DMA by population • tv_homes: the number of homes in that DMA with a television, as measured by the market research firm Nielsen (who defined the DMAs in the first place) • adwords_pause: a 0/1 indicator, where 1 means that DMA was in the treatment group, and 0 means that DMA was in the control group. • rev_before: EBay’s revenue in dollars from that DMA in the 30 days before May 22, before the experiment started. • rev_after: EBay’s revenue in dollars from that DMA in the 30 days beginning on May 22, after the experiment started. The outcome of interest is the revenue ratio at the DMA level, i.e. the ratio of revenue after to revenue before for each DMA. If EBay’s paid search advertising on Google was driving extra revenue, we would expect this revenue ratio to be systematically lower in the treatment-group DMAs versus the control-group DMAs. On the other hand, if paid search->