Lab #3: Distributions, Expected Value, Variance
Jennifer Townsend
April 21, 2017
Introduction:
Read the intro to PS3 and the blurb below, which adds info about using the random sampling methods for distributions:
Suppose you want to sample 4000 times from the binomial distribution (to simulate 4000 repeated samples, each sample recording the number of successes from taking 80 attempts with a probability 1/4 of success on each attempt). We can use therbinom(N,n,p)
function for this.
samples=rbinom(4000,80,1/4) #I'm going to plot the simulated distribution as a histogram, and then overlay the theoretical distribution (pmf). hist(samples,freq=FALSE,xlim=c(0,80),xlab="Number of successes",ylab="Frequency") X=0:80 #(Blue) theoretical distribution points(X,dbinom(X,80,1/4),col="blue")
Task #1: Use appropriate distributions and functions in R
Choose the correct distribution (based on context of the problem) to answer each question– use the built-in R functions of
- Binomial:
dbinom,pbinom,qbinom,rbinom
- Geometric:
dgeom,pbinom,qbinom,rbinom
- Poisson:
dpois,ppois,qpois,rpois
- Negative binomial:
dnbinom,pnbinom,qnbinom,rnbinom
- Hypergeometric:
dhypr,phyper,qhyper,rhyper
In order to learn about the parameters of each of these functions (when needed) temember to use the question mark info command: e.g?qhyper
.
- Create the plot of the cumulative distribution function for theX=#" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">X=#X=#of emails that arrive in a given day, when you expect an average of 11 emails to arrive in your inbox each day.
Also, compute the probability that at most 8 emails arrive in your inbox in a given day.
- You want to simulate an experiment where researchers go up to strangers until they find 5 willing to give them directions, and the number of people asked up to this first “success” is recorded. This is repeated 1000 times. Suppose the probability that each person will give the researchers directions is 0.6. Go through the following steps:
Create 1000 samples that simulate the result of the experiment 1000 times. This will user------
(choose the proper distribution).
Plot the histogram of the simulated experiment values. Then overlay the theoretical distribution, as done for Binomial in the preface.
What is the average of your simulated experiment for the number of people needed to ask?NOTE THIS IS SLIGHTLY DIFFERENT THAN ASKING THE NUMBER OF FAILURES BEFORE 5 SUCCESSES.What is the theoretical expected value (use our table from class!).
What is the probability that someone needs to askmore than12 people to successfully get directions from 5? Your answer should be 3.21%. Show the code that leads to this result, and explain why.
- ConsiderX=#" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">X=#X=#College Grads in a random sample of 100 people from a pool where 300 are college grads and 150 are not.
Plot the pmf, cdf, and quantile function forX" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">XX. Change the titles for each as appropriate.
Answer the following question: theoretically, 90% of samples of 100 people would have at least _______ college grads.
Use simulation by taking a random sample from the appropriate distribution (your resulting vector should represent the # of college grads in different samples of 100 people) and computing the 90th percentile of this vector by using thequantile
function. Your result should be close to b.
The middle 90% of samples would have between _____ and _____ college grads.
Task #2 Finding the best parameter from data.
The built-in data setwarpbreaks
counts the number of yarn breaks that occur in a fixed length of yarn while using a loom.
First, for your own purposes,View(warpbreaks)
. Note that there’s two types of yarn in the data set and three levels of tension. We’re going to focus on one case: wool type A, and medium (M) tension.
To restrict the data set to this case, I’m using thesubset
function on the data frame:
AMYarn=subset.data.frame(warpbreaks, wool=="A" & tension=="M") View(AMYarn)
- Decide what type of discrete distribution might be reasonable as a model for
AMYarn
.
- Use the sample data set to estimate any parameter(s) you need in order to define the distribution, based on the definition of those parameters.
Once you have decided what distribution is appropriate, we’ll plot the cmf from your theoretical distribution on the same plot as an “empirical” cmf– which is just found by findingP(X≤k)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(X≤k)P(X≤k)in the sample data alone. If your distribution is a reasonable choice, the two should have a similar shape, and if your parameter is correct they should be well aligned (centered at the same point).
To create this plot, modify the code below by replacing the discrete uniform distribution (which is wrong) with your choice of a model.
plot(ecdf(AMYarn$breaks),main="CMF of Yarn Break Data",ylab="Cumulative Probabilities", xlab="Number of breaks") #ecdf creates an "empirical" (from a sample) cumulative distribution function. Xvals=0:40 lines(Xvals,punif(Xvals,min=0,max=48),"b",col="blue") # you'll replace "punif( )" in this line
Once you’re happy with your distribution model, you can use the following to check that you chose well (or at least, chose the same as Jen did– no guarantee that those are always equivalent!).
According to Jen’s model, the probability that another trial is run and the yarn breaks 15 times is 1.46%, and the probability that there are at most 30 breaks is 90.42%.
Task 3: Expected Value and Variance in R
- Create
X
: a vector of values from 0:1000 representing the number of trials before the first success when the probability of success for each trial is 0.2. Compute the vectorpX
giving the corresponding pmf values for the correct distribution.
Use basic vector operations and R to compute the expected valueE(X)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">E(X)E(X). Ensure that this corresponds to our theoretical expected value (use the table).
Use basic vector operations and R to compute the varianceV(X)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">V(X)V(X). Ensure that this corresponds to our theoretical variance (use the table).
Sample 12000 times from the appropriate distribution. Compute the sample mean viamean()
and the sample variance viavar()
. Compare these to your answers in (i) and (ii) to ensure they are “close”.
Imagine a situation where you are paid$200x+1" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">$200x+1$200x+1if you fail exactlyx" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">xxtimes before succeeding. What is the expected value of your earnings from this situation?
Same situation as above – compute the variance.
Take the sample from (iii) and convert each value into a dollar amount you’d be paid. Then compute the sample mean viamean()
and the sample variance viavar()
. Show these to check they are “close” to your answers in (iv) and (v).
Task 4: Some basic Continuous Distribution stuff
Suppose we have a continuous random variableX" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">XXrepresenting the weight of a watermelon in kilograms.X" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">XXtakes on values in[9,16]" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">[9,16][9,16]with probability density functionf(x)=374x" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">f(x)=374x−−√f(x)=374x.
Show that the total probability is 1. Use the equations in LaTeX to set up your integrals and show your math.
Compute the expected value, using integrals. Show your work in LaTeX equations.
Suppose you decide to estimate the continous random distribution with a discrete one measured every hundredth. You takeX = seq(9,16,0.01)
andfX = 3/74 * sqrt(X)
What issum(fX)
– and why does this throw a red flag?
Define a new vectorpX
to correspond to the values ofX
so thatsum(pX)
is approximately correct.
- You try to computeE(X)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">E(X)E(X)by taking
sum(fX * X)
orfX%*% X
. What is the result?
Fix the computation above using yourpX
from 3. You should end close to the value from your computation in 2.
Task 5 – More Realistic Bayes Applicaiton – optional, extra credit for 15 lab points.
Suppose that a pond contains a population of trout (population size unknown). You catch and tag 100 trout. A day later, you catch 100 trout again – there are 60 tagged trout in your sample.
Which seems more likely (no computation necessary – just explain your reasoning): that there are more than 200 trout in the pond, or that there are less than 200 trout in the pond?
Compute the probabilityP(60 tagged fish in the sample∣there are 200 total fish in the pond)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(60 tagged fish in the sample∣there are 200 total fish in the pond)P(60 tagged fish in the sample∣there are 200 total fish in the pond).
Now Suppose you know from earlier study that populations of similar ponds take on total population valuesH
With probabilitiespH
defining the probability mass function.
H = 150:300 pH = (H-200)^2/sum((H-200)^2) pFgivenH = dhyper(60,100,H-100,100) pFandH = pFgivenH * pH pHgivenF = pFandH/sum(pFandH) which.max(pHgivenF)
## [1] 14
which.max(pFgivenH)
## [1] 17
Consider as your alternative hypotheses:Hi:" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">Hi:Hi:there are exactlyi" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">iifish in the pond. I want you to try to use Bayes’ Theorem to decide which of these (151) hypotheses are most likely, given that you caught exactly 60 tagged fish in the sample of 100. So you will be asked to computeP(there are i fish in the pond∣60 of the 100 sampled fish were tagged)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(there areifish in the pond∣60 of the 100 sampled fish were tagged)P(there areifish in the pond∣60 of the 100 sampled fish were tagged)fori" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">iibetween 151 and 300.**
The Bayesian priorsP(Hi)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(Hi)P(Hi)are located in in the vectorpH
. What is the index forP(H400)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(H400)P(H400)(The chance of there being exactly 400 fish total in the pond)?
Use thedhyper(x,m,n,k)
command– where x,m,k are known from the problem, and n is determined usingX
, to compute a vector ofP(60 tagged fish in the sample∣there are i fish in the pond)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(60 tagged fish in the sample∣there are i fish in the pond)P(60 tagged fish in the sample∣there are i fish in the pond).
Also usewhich.max()
to identify the index inH
that produces the highest of these values. What number of total fish does this correspond to?
ComputeP(60 tagged fish in the sample∩there are i fish in the pond)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(60 tagged fish in the sample∩there are i fish in the pond)P(60 tagged fish in the sample∩there are i fish in the pond)for eachHi" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">HiHiby using the fact that this equalsP(there are i fish in the pond)⋅P(60 tagged fish in the sample∣there are i fish in the pond)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(there are i fish in the pond)⋅P(60 tagged fish in the sample∣there are i fish in the pond)P(there are i fish in the pond)⋅P(60 tagged fish in the sample∣there are i fish in the pond).
ComputeP(60 tagged fish in the sample)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(60 tagged fish in the sample)P(60 tagged fish in the sample)by taking the sum of all the probabilities in (iv).
ComputeP(there are i fish in the pond∣60 tagged fish in the sample)" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">P(there are i fish in the pond∣60 tagged fish in the sample)P(there are i fish in the pond∣60 tagged fish in the sample)using Bayes’ Theorem and your results from (iv) and (v).
Find the index in the array above which produces the maximal probability. Usewhich.max()
to accomplish this. Then convert this into thei" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">iiso thatHi" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">HiHiis the most likely hypothesis: that is,i" role="presentation" style="display: inline; line-height: normal; word-spacing: normal; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">iiis equal to the number of fish that is most likely, given the data that 60 of the sampled fish are tagged.
This will be close to, but not exactly, the value from 4.
[This is called the “maximum likelihood estimate” for the number of fish in the pond. Congrats– you’ve done a very realistic application of Bayes’ Theorem].