The homework is a blend of theory and programming. Please submit all answers as a single word document
Homework 1 MSA 8150 - Machine Learning for Analytics (Spring 2021) Instructor: Alireza Aghasi Due date: See iCollege Deadline January 16, 2021 Please revise the homework guidelines reviewed in the first lecture. Specifically note that: • Start working on the homework early • Late homework is not accepted and will receive zero credit. • Each student must write up and turn in their own solutions • (IMPORTANT) If you solve a question together with another colleague, each need to write up your own solution and need to list the name of people who you discussed the problem with on the first page of the material turned in • The homework is a blend of theory and programming. Please submit all answers as a single word document as explained in the class. 1 Q1. The goal is to find the optimal values of β1 and β2 which fit the model y = β1x1 + β2x2 (1) to some data points. The data points are in the form (x1,1, x2,1, y1), (x1,2, x2,2, y2) · · · , (x1,n, x2,n, yn), where x1,i and x2,i are our input features for sample i, and yi is the response variable for sample i. (a) Write the RSS formulation for this problem. (b) Let’s do a quick review of basic linear algebra. Consider the following system of two equations, where β1 and β2 are the unknowns: { aβ1 + bβ2 = c dβ1 + eβ2 = f . (2) Show that if ae− bd 6= 0, then simultaneously solving the system above for β1 and β2 yields β1 = ce− bf ae− bd , β2 = af − cd ae− bd . (c) Minimize the RSS you obtained in part (a) and conclude that the optimal values of β1 and β2 are β̂1 = ( ∑n i=1 yix1,i) (∑n i=1 x 2 2,i ) − ( ∑n i=1 yix2,i) ( ∑n i=1 x1,ix2,i)(∑n i=1 x 2 1,i ) (∑n i=1 x 2 2,i ) − ( ∑n i=1 x2,ix1,i) 2 , (3) β̂2 = ( ∑n i=1 yix2,i) (∑n i=1 x 2 1,i ) − ( ∑n i=1 yix1,i) ( ∑n i=1 x1,ix2,i)(∑n i=1 x 2 1,i ) (∑n i=1 x 2 2,i ) − ( ∑n i=1 x2,ix1,i) 2 . (4) Hint: At some point you find the result in part (b) useful. (d) Assume that the data follows the model yi = β ∗ 1x1,i + �i, where � is a zero mean noise. In other words, the original God’s model (regression function) is β∗1x1 and does not use x2, but for whatever reason we have also included x2 in our model. Show that, despite this wrong inclusion, still β̂1 is an unbiased estimate of β∗1 , i.e., E(β̂1) = β∗1 . (e) Suppose that we want to minimize the RSS, but at the same time want to enforce that β1 and β2 are close to each other. So we consider the following modified RSS ˜RSS = (β1 − β2)2 + n∑ i=1 (yi − β1x1,i − β2x2,i)2. Show that minimizing ˜RSS yields the following optimal estimates for β1 and β2: β̃1 = ( ∑n i=1 yix1,i) ( 1 + ∑n i=1 x 2 2,i ) − ( ∑n i=1 yix2,i) (−1 + ∑n i=1 x1,ix2,i)( 1 + ∑n i=1 x 2 1,i ) ( 1 + ∑n i=1 x 2 2,i ) − (−1 + ∑n i=1 x2,ix1,i) 2 β̃2 = ( ∑n i=1 yix2,i) ( 1 + ∑n i=1 x 2 1,i ) − ( ∑n i=1 yix1,i) (−1 + ∑n i=1 x1,ix2,i)( 1 + ∑n i=1 x 2 1,i ) ( 1 + ∑n i=1 x 2 2,i ) − (−1 + ∑n i=1 x2,ix1,i) 2 . 2 Hint: It is totally fine to take a similar strategy as part (c), but if you think a little outside the box, there might be an easier way of getting to these results based on the results of part (c). (f) Lets use an example and see if the result you got in part (c) can also be obtained in R via the lm function (feel free to use Python for linear regression if you are more comfortable). For this purpose assume that n = 10 and our data are as follows: x1 x2 y 89.0900 78.4800 113.2700 84.2400 70.5600 109.7700 98.7700 93.5200 130.0800 95.4400 86.7200 120.4500 90.9800 79.2000 115.0900 97.3900 91.3600 125.3700 89.2700 80.0000 116.2200 88.5100 76.9600 112.0800 97.0600 92.5600 127.8500 84.4500 66.4000 107.6100 Write a program that takes the data as indicated in the table and calculates the values of β̂1 and β̂2 as suggested in part (c). Attach the code and results. (g) Write a program that takes the data as indicated in the table and calculates the values of β̂1 and β̂2 using the linear model function in R (or Python). If you write your program correctly, you should get identical results as part (f). Attach the code and results. Hint: When you use the lm command in R, you need to somehow enforce the intercept to be zero. Q2. The goal of this question is to mathematically show that for linear models the expected mean squared error (MSE) for the training is always less than the expected mean squared error for the test data. While the idea of proof in general is similar to what we will do, to avoid complications, lets work with a simple model. (a) Suppose that the God’s model is y = β∗x and our observations are in the form of y = β∗x+�, where � is a zero mean noise with variance σ2. Consider a training set of size n, as (x1, y1), (x2, y2), . . . , (xn, yn). We define the training MSE function as M(α) = 1 n n∑ i=1 (yi − αxi)2. Mathematically show that E(M(β∗)) = σ2. (5) 3 (b) Suppose that β̂ is obtained by minimizing the MSE associated with the training data. Mathemat- ically, or in simple words, discuss why we should have M(β̂) ≤M(β∗). (6) (-) I do this part for you. We know that if for two random variables u and v we always have u ≤ v, then we also have E(u) ≤ E(v). Using the result of part (b) this fact implies that E(M(β̂)) ≤ E(M(β∗)). (7) (c) Consider a test set of size n, as (x̃1, ỹ1), (x̃2, ỹ2), . . . , (x̃n, ỹn). We define the test MSE function as M̃(α) = 1 n n∑ i=1 (ỹi − αx̃i)2. Mathematically show that E(M̃(β̂)) = σ2 + 1 n n∑ i=1 (β̂x̃i − β∗x̃i)2. (8) (d) Now by comparing (5) and (7) and (8) explain why you can immediately conclude that E(M(β̂)) ≤ E(M̃(β̃)). Q3. To discover a physical law, we have collected 240 data samples, where p1, p2 and d are the input parameters, and F is the response variable. You can access the data in the homework folder and in a file named PhysicalLaw.csv. – Read the data file and split it into two sets. Set 1 includes the first 200 rows of the data (do not count the row associated with the feature/response names), and set 2, which includes the last 40 rows of the data. Name the first set train and the second set test. (a) Using the training data, fit a linear regression model as F = β0 + β1p1 + β2p2 + β3d, (9) report the fitted parameters, the 95% confidence interval for each estimated parameter, the p-values and the R2 statistic. Explain what the R2 statistic tells you. (b) Based on the p-values and α = 0.05, do you see any significance problem with any of the features? 4 (c) Use the fitted model and pass the features in your test file to generate the corresponding response F pred (a vector of length 40). Now compare this quantity with the original responses in your test file F test using the test root mean squared error (RMSE) defined as: RMSE = √√√√ 1 40 40∑ i=1 ( F testi − F pred i )2 . Q4. Read the data file ModelSelection.csv, which contains 1500 pairs of (x, y). These data are acquired from a model in the form y = β0 + β1x n1 + β2x n2 + β3x n3 . (10) We neither know the quantities β0, β1, β2, β3, nor know the exponents n1, n2, n3. All we know are the following: • β0, β1, β2, β3 are real numbers • n1, n2, n3 are integers not less than 1 and no greater than 10. In other words: ni ∈ {1, 2, 3, . . . , 10}, i = 1, 2, 3. Read the data and split it into two sets. Set 1 includes the first 1000 rows of the data (do not count the row associated with the x, y names), and set 2, which includes the last 500 rows of the data. Name the first set train and the second set test. Using the linear model function in R (or the counterpart in Python), write a program that explores all the models in the form (10), trains them on the train and tests them on test. The output of your program should be the values β0, β1, β2, β3 and n1, n2, n3 which correspond to the model with the best (i.e., minimum) MSE value. Please provide your code and the values β0, β1, β2, β3 and n1, n2, n3 that your code returns. 5