If we would like to conduct the standard bootstrap approach in the lecture notes for a dataset with...

Question

If we would like to conduct the standard bootstrap approach in the lecture notes for a dataset with n observations, and generate B bootstrap datasets, please prove the statement “On average, each bootstrap dataset makes use of around two-thirds of the n observations as n is large”. Please GIVE the necessary mathematical details about how to show this result in the answer sheet. Hint: (1 1/n)n ! 1/e as n ! 1, and we say (1 1/n)n is around 1/e as n is large. Question 2 (5 points) In a classification problem, suppose we are interested in a pair (X, Y ), and X takes values in Rd while Y takes values in {1, 0}, where Y is the class label of X. The conditional distribution of X, given that the label Y takes the value k is given by X | (Y = k) ? fk for k = 1, 0, where “?” means “is distributed as”, and where fk denotes a probability density function. A classifier is a rule that assigns to an observation X = x a guess or estimate of what the unobserved label Y = k actually was. In theoretical terms, a classifier is a function C : Rd ! {1, 0}, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification of a classifier C is defined as

Question 1 (5 points) If we would like to conduct the standard bootstrap approach in the lecture notes for a dataset with n observations, and generate B bootstrap datasets, please prove the statement “On average, each bootstrap dataset makes use of around two-thirds of the n observations as n is large”. Please GIVE the necessary mathematical details about how to show this result in the answer sheet. Hint: (1� 1/n)n ! 1/e as n ! 1, and we say (1� 1/n)n is around 1/e as n is large. Question 2 (5 points) In a classification problem, suppose we are interested in a pair (X, Y ), and X takes values in Rd while Y takes values in {1, 0}, where Y is the class label of X. The conditional distribution of X, given that the label Y takes the value k is given by X | (Y = k) ⇠ fk for k = 1, 0, where “⇠” means “is distributed as”, and where fk denotes a probability density function. A classifier is a rule that assigns to an observation X = x a guess or estimate of what the unobserved label Y = k actually was. In theoretical terms, a classifier is a function C : Rd ! {1, 0}, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification of a classifier C is defined as Err(C) = Pr{C(X) 6= Y }. In the lectures, we have introduced the Bayes optimal classifier, which can be denoted by C⇤(x). The Bayes optimal classifier is a useful benchmark in statistical classifica- tion, due to the statement “The Bayes optimal classifier has the smallest probability of misclassification, namely Err(C⇤)  Err(C) for all classifiers C”. In fact, showing the above statement is not easy. Instead, please show the following inequality Pr{C⇤(X) 6= Y }  Pr{C⇤(X) = Y }, in the answer sheet, whose interpretation is that the Bayes optimal classifier always has a smaller probability of misclassification compared to the probability of correct classification. Please GIVE the necessary mathematical details about how to show this result in the answer sheet. Hint: The probability equality Pr(A1 [ · · · [ AK) = KX k=1 Pr(Ak) for disjoint sets Ak, k = 1, · · · , K, may be useful. Question 3 (14 points) Following Question 2, as a special case, we suppose that the two classes of X are distributed as d-dimensional multivariate Gaussians, namely X | (Y = k) ⇠ N (µk,⌃) for k = 1, 0, (1) 2 respectively, di↵ering only in their mean vectors. In addition, let us assume that the two classes are equally likely a priori. Let (x) = ⌧ µ1 � µ0,⌃�1 ✓ x� µ1 + µ0 2 ◆� , (2) where h·, ·i denotes the Euclidean inner product in Rd, namely (x) = (µ1 � µ0)>⌃�1 ✓ x� µ1 + µ0 2 ◆ . (a) Please show that the Bayes optimal classifier in this case is C⇤(x) = ( 1, if (x) > 0; 0, if (x)  0. (b) As is mentioned in Question 2, we can evaluate the quality of the above decision rule C⇤ by computing the probability of misclassification. Please give the expression of Err(C⇤) by using the Gaussian cumulative distribution function �, µ1, µ0 and ⌃ in the answer sheet, where �(u) = 1p 2⇡ Z u �1 e�t 2/2dt. Please GIVE the necessary mathematical details about how to obtain this result in the answer sheet. Hint: See “STAT3008 Review of Matrix Algebra.pdf” and “STAT8002 Multivariate Normal Distribution.pdf” on the Wattle. (c) In practice, the class conditional distributions are not known, but instead one observes a collection of labeled samples, say {X1, . . . , Xn1} drawn independently from N(µ1,⌃), and {Xn1+1, . . . , Xn1+n0} drawn independently from N(µ0,⌃). A natural approach is to use these samples in order to estimate the class conditional distribu- tions, and then “plug” these estimates to estimate the Bayes optimal classifier. In the Gaussian case, estimating the distributions is equivalent to estimating the mean vectors µ1 and µ0, as well as the covariance matrix ⌃, and standard estimates are the samples means µ̂1 = 1 n1 n1X i=1 Xi and µ̂0 = 1 n0 n1+n0X i=n1+1 Xi, as well as the pooled sample covariance matrix b⌃ = 1 n1 + n0 � 2 ( n1X i=1 (Xi � µ̂1) (Xi � µ̂1)> + n1+n0X i=n1+1 (Xi � µ̂0) (Xi � µ̂0)> ) . 3 Substituting these estimates into (2) yields function b (x) = ⌧ µ̂1 � µ̂0, b⌃�1 ✓ x� µ̂1 + µ̂0 2 ◆� . Here we have assumed that the sample covariance is invertible, and hence are assuming implicitly that n1 + n0 > d. Subsequently, the Bayes optimal classifier is estimated by bC⇤(x) = ( 1, if b (x) > 0; 0, if b (x)  0. Please use the result in (b) to give the expression of Err( bC⇤) by using the Gaussian cumulative distribution function �, µ̂1, µ̂0 and b⌃ in the answer sheet. Note that this error probability is itself a random variable, since b is a function of the samples {Xi}n1+n0i=1 . (d) In the 1960s, Kolmogorov analyzed a simple version of classifiers, so that the linear function (2) simplifies to id(x) = ⌧ µ1 � µ0, x� µ1 + µ0 2 � . Working under the assumption of Gaussian (1) in this question, he analyzed the behavior of the classifier Cid(x) = ( 1, if id(x) > 0; 0, if id(x)  0. Please give the expression of Err(Cid) by using the Gaussian cumulative distribution function �, µ1, µ0 and ⌃ in the answer sheet. Please GIVE the necessary math- ematical details about how to obtain this result in the answer sheet. Hint: See “STAT3008 Review of Matrix Algebra.pdf” and “STAT8002 Multivariate Normal Distribution.pdf” on the Wattle. (e) Working under the assumption of Gaussian (1) in this question, we next con- sider the Naive Bayes classifier CNaive(x) introduced in the lectures. Please give the expression of Err(CNaive) by using the Gaussian cumulative distribution function �, µ1, µ0 and ⌃ = 0 B@ �11 · · · �1d ... . . . ... �d1 · · · �dd 1 CA in the answer sheet. Please GIVE the necessary mathematical details about how to obtain this result in the answer sheet. Hint: (i) you may need to use some elements 4 �j1j2 in matrix ⌃ in the expression, where j1, j2 2 {1, · · · , d}; (ii) see “STAT3008 Re- view of Matrix Algebra.pdf” and “STAT8002 Multivariate Normal Distribution.pdf” on the Wattle; (iii) due to assumption (1) with equal covariance matrix ⌃, the Naive Bayes classifier CNaive(x) is slightly di↵erent from the one introduced in the lecture notes. Question 4 (5 points) Following Question 3, as a special case, we suppose that the dimension of X is d = 1. In practice, suppose one observes a collection of labeled sam- ples, say {X1, . . . , Xn1} drawn independently fromN(µ1,⌃), and {Xn1+1, . . . , Xn1+n0} drawn independently from N(µ0,⌃). As a consequence, the observed class labels are {Y1 = 1, . . . , Yn1 = 1} and {Yn1+1 = 0, . . . , Yn1+n0 = 0}, correspondingly. We addi- tionally assume that n1 = n0 = n. In the lecture notes, we have considered a linear regression classifier bClm(x) = ( 1, if ŷ > 0.5; 0, if ŷ  0.5; where ŷ = �̂0 + �̂1x, and �̂0 and �̂1 are the ordinary least squares (OLS) estimates if we regress Yi on Xi for i = 1, 2, · · · , (n1 + n0). Please show that the classifier bClm(x) is approximately the same as the classifier bC⇤(x) in (c) of Question 3 if n is large. Hint: using µ̂1 and µ̂0 defined in Question 3, and the law of large numbers (LLN), we obtain that µ̂1 ! µ1 and µ̂0 ! µ0 in probability if n ! 1. Thus, we say µ̂1 and µ̂0 are approximately the same as µ1 and µ0, respectively, if n is large. Question 5 (8 points) Consider the linear regression model yi = �0 + �1xi1 + · · · + �pxip + "i, where "i are independently and identically distributed from the normal distribution for i = 1, · · · , n. Let Y = 0 B@ y1 ... yn 1 CA , X = 0 B@ 1 x11 · · · x1p ... ... . . . ... 1 xn1 · · · xnp 1 CA , � = 0 BBB@ �0 �1 ... �p 1 CCCA , xi = 0 BBB@ 1 xi1 ... xip 1 CCCA for i = 1, · · · , p, p⇥ p identity matrix Ip = 0 BBBBB@ 1 0 · · · 0 0 0 1 · · · 0 0 ... ... . . . ... ... 0 0 · · · 1 0 0 0 · · · 0 1 1 CCCCCA whose definition can be found in “STAT3008 Review of Matrix Algebra.pdf” , and ⇧p+1 = ✓ 0 01⇥p 0p⇥1 Ip ◆ , 5 where 0k1⇥k2 is a k1⇥k2 matrix of zeros. Please answer the following questions in the answer sheet. (a) With ordinary least squares (OLS) regression, an amazing shortcut makes the cost of leave-one out cross-validation (LOOCV) the same as that of a single model fit! Specifically, consider the LOOCV error rate CV(n) which is defined based on the K-fold cross-validation error rate CV(K) = KX k=1 nk n MSEk for 2  K  n in the lecture notes. Let hi be the leverage which is the i-th diagonal element of the “hat” matrix X(X>X)�1X> for 1  i  n (or equivalently, hi = x>i (X >X)�1xi), �̂OLS = (�̂OLS,0, · · · , �̂OLS,p)> minimize the objective function Q(�) = nX i=1 yi � �0 � pX j=1 xij�j !2 , and ŷi = �̂OLS,0 + �̂OLS,1xi1 + · · ·+ �̂OLS,pxip. Please show that CV(n) = 1 n nX i=1 ✓ yi � ŷi 1� hi ◆2 . Please GIVE the necessary mathematical details about how to show this result in the answer sheet. Hint: the Woodbury matrix identity (A+ UCV )�1 = A�1 � A�1U �

005_syifgz0-swxnb1pe.pdf

If we would like to conduct the standard bootstrap approach in the lecture notes for a dataset with n observations, and generate B bootstrap datasets, please prove the statement “On average, each...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment