Stat and R Programming
Integrity Agreement Problem 1 (40 points) Problem 2 (20 points) Problem 3 (40 points) Penn State STAT 440 Final Exam Assessment Guideline Please read the following instructions carefully. This assessment is take-home and open-book. You must complete this assessment independently, and you cannot seek help from anyone else including, but not limited to, the course staff, classmates, relatives, colleagues, teachers and internet forums. Prior to submissions, students can ask for clarifications about exam problems at the Canvas Discussion Page. To receive credit, you must show your work and/or explain your reasoning clearly, legibly and concisely. Problems marked [A] require you to derive analytic mathematical expressions for the solution, whereas problems marked [C] require you to show computer codes and corresponding output. There are 3 problems, and the total point is 100. Problems are not ordered or weighted by difficulty. Please follow our online submission guidelines in the syllabus. Please submit your work to Canvas by 8 AM EST on 2022-12-14. Late submissions are not graded. Integrity Agreement Please complete and add the following agreement at the beginning of your submitted solution. Submissions without the complete agreement are not graded. Assessment Guideline I, [Your Printed Name and Penn State User ID] agree to complete this take-home, open-book assessment independently, and I agree not to seek help from anyone else including, but not limited to, the course staff, classmates, relatives, colleagues, teachers and internet forums. I agree not to share any copy of solutions with any person or organization. I agree not to distribute any copy of solutions in any public or private domain. I understand that if I am found to have violated any agreement listed above, I will be subject to disciplinary action including the possibility of failing STAT 440. Problem 1 (40 points) Let and let . Let . Suppose we observe that and . [A] Find the maximum likelihood estimator (MLE) by analytically maximizing the likelihood of . [A] Find the conditional probability distribution of and compute the conditional expectation . [A] Let be the quantile function for the conditional probability distribution of . Find the probability distribution of . [C] Estimate using importance sampling with the proposal distribution being . Use 100,000 proposal samples and set.seed(440) in your simulation. [A] & [C] Find the shortest possible 95% credible interval for . [A] Suppose you want to use Metropolis-Hastings to sample from . Suppose you use a symmetric transition kernel, your current position is , and your proposed position is . What is the probability of accepting this proposal? [A] Suppose you want to use rejection sampling to sample from , your proposal distribution is , and you again propose . What is the probability of accepting this proposal? [A] & [C] Find the maximum a posteriori probability estimator by analytically maximizing the density of the conditional probability distribution of . Derive a Newton’s method algorithm to find . Then write your own R codes to find with a convergence tolerance of . , … , ∣ θ Bernoulli(θ)X1 Xn ∼ iid θ ∼ Uniform(0, 1) ≜Sn ∑ n i=1 Xn n = 12 = 4Sn θ̂MLE θ θ ∣ Sn E(θ ∣ )Sn Q θ ∣ Sn Q(θ) E(θ ∣ )Sn Beta(2, 2) θ θ ∣ Sn = 1/2θ0 = 3/5θ∗ θ ∣ Sn Unif(0, 1) = 3/5θ∗ θ̂MAP θ ∣ Sn θ̂MAP θ̂MAP ϵ = 0.001 Problem 2 (20 points) Consider a simple random sample from the following two-class mixture model: where denotes the density value of normal distribution with mean and variance at . We assume that is known throughout this problem. [A] We assume that both and are known and is unknown in this part. Mimic the arguments in our lecture notes and derive an EM algorithm to find MLE of the unknown parameter . [A] We assume that both and are unknown and is known in this part. We further assume a normal prior on both and where is also known. Derive an Gibbs sampling algorithm to find the posterior means of and . Problem 3 (40 points) Brushtail possum is a marsupial that lives in Australia and New Guinea Researchers (Lindenmayer et al, Australian Journal of Zoology, 1995 (https://doi.org/10.1071/ZO9950449)) captured 104 of these animals and took body measurements before releasing the animals back into the wild. In this problem we consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum’s head. You can download this dataset and see the data format here (https://www.openintro.org/data/index.php?data=possum). Each possum provides two measurements , where is the total length (cm) of this possum and is the head length (mm) of this possum. Now consider the following model: where and is an unknown scalar. Let , which is an unknown two-dimensional vector. [C] Compute sample correlation between the total length (cm) and the head length (mm) across 104 possums. Create a scatter plot of total length (cm) and head length (mm), and then discuss if this plot is consistent with the sample correlation. , , … ,X1 X2 Xn f(x) = π ⋅ N (x; , ) + (1 − π) ⋅ N (x; , ),μ1 σ 2 μ2 σ 2 N (x; μ, )σ2 μ σ2 x σ2 μ1 μ2 π π μ1 μ2 π N (0, /τ)σ2 μ1 μ2 τ > 0 μ1 μ2 i ( , )Xi Yi Xi Yi = + + ,Yi β0 Xiβ1 ϵi N(0, )ϵi ∼ i.i.d. σ2 σ2 β = ( ,β0 β1) ⊤ https://doi.org/10.1071/ZO9950449 https://www.openintro.org/data/index.php?data=possum [A] & [C] Derive the the least square estimator of , which is denoted as . Based on the same mathematical operations as you use in your derivations, write your own codes to compute on this dataset. [C] Use R built-in function lm to compute and the standard errors. [C] Use QR decomposition to compute and the standard errors. [A] & [C] Use singular value decomposition to compute and the standard errors. [C] Use R built-in function optim with method BFGS to compute . [C] Estimate the bias of . [C] Use non-parametric bootstrap (10,000 replications) to estimate and find the corresponding 95% confidence interval. Use set.seed(440) in your simulation. [C] Use parametric bootstrap (10,000 replications) based on multivariate normal distribution to estimate and find the corresponding 95% confidence interval. Use set.seed(440) in your simulation. [C] Use permutation to test if or not. [C] Use leave-one-out and 3-fold cross validations to compare the following two models: Session information β β̂ β̂ β̂ β̂ β̂ β̂ β̂ β1 β1 = 0β1 Model 1: = + + versus Model 2: = + + + .Yi β0 Xiβ1 ϵi Yi β0 Xiβ1 X 2 i β2 ϵi