Machine learning project I have attached the file
Assignment 1 ML Class: CS 65375 September 18, 2021 1 Assignment Policies for CS 6375 The following are the policies regarding this assignment. 1. This assignment needs be done individually by everyone. 2. You are expected to work on the assignments on your own. If I find the assignments of a group of (two or more) students very similar, the group will get zero points towards this assignment. You may possibly also be reported to the judiciary committee. 3. Please use Python for writing code. You can submit the code as a Jupyter notebook 4. For the theory questions, please use Latex 5. This Assignment is for 25 points. 6. This will be due on September 24th. 2 Questions 1. Loss Functions for Linear Regression (6 points): Assume that the hypothesis function is f(w, b, x) = wTx + b. In the standard linear regression case, given an instance xi, yi on the training set, the loss function is defined as Li(w, b) = [f(w, b, xi) − yi]2. Imagine that we are still interested in posing the optimization problem (over the dataset) as: min w,b n∑ i=1 Li(w, b) (1) What if we were to use some slightly different loss functions? Can you answer if the following loss functions make sense? (a) Li(w, b) = [f(w, b, xi)− yi]3 (b) Li(w, b) = [f(w, b, xi)− yi]4 (c) Li(w, b) = exp[f(w, b, xi)− yi] (d) Li(w, b) = max(0,−yif(w, b, xi)) 1 Part 1: Please answer exactly why these loss functions may or may not be a good choice for regression. Part 2: Also, compute the gradients of the loss function in each of the cases above. (Bonus Question): Wherever the loss function makes sense, can you answer how the resulting solution will be different from the standard squared loss? what are the pros and cons of these loss functions compared to the squared loss. Part 1 is for four points and part two is for two points. The bonus question will be for two additional points. 2. Loss Functions for Classification (6 Points): Again, assume that the function is f(w, b, x) = wTx + b. In the case of SVM and Perceptrons, we saw the following two loss functions: Li(w, b) = max(0,−yif(w, b, xi)) for Perceptron and Li(w, b) = max(0, 1− yif(w, b, xi)) for Hinge Loss (SVM). Similar to question 1, let us see if the following loss functions are good choices: (a) Li(w, b) = max(0, 1− yif(w, b, xi))2 (b) Li(w, b) = [yi − f(w, b, xi]4 (c) Li(w, b) = exp[f(w, b, xi)− yi] (d) Li(w, b) = exp[−yif(w, b, xi)] Part 1: Please answer exactly why these loss functions may or may not be a good choice for classifi- cation. Part 2: Also, compute the gradients of the final loss function in each of the cases above. 3. Polynomial and Higher order Features (3 Points): Let us use polynomial features with the Perceptron. Consider the dataset shown below. [Hint: The dataset is not separable]. Note that this Figure 1: Two Dimensional Data dataset consists of 2-dimensional points x = [x1, x2]. Part 1: Write down the perceptron loss function with quadratic features. First write down what will be the features, the dimensionality of the expanded (quadratic) feature set and the loss function. Is the dataset linearly separable with quadratic features? 2 Part 2: Draw approximately the output of the perceptron algorithm on this dataset. 4. Lagrangian and Dual: (5 Points) Part 1: Let us start with Linear Regression. Recall that the loss function of linear regression is min w M∑ i=1 [wTxi + b− yi]2. Suppose, we add a constraint that the L2-distance between w and a prior weight vector w0 is less than k – i.e. suppose we have some reason to believe that the weight vector should be close to w0. How can you pose this as an optimization problem? What is the Lagrangian formulation? Can you put down the steps involved to get the dual? It may be a little difficult to exactly write down the dual, but even if you can write down the steps involved, you will get the points. Getting the dual exactly will be 2 bonus points. Part 2: Let us do this for SVM. For simplicity, let us consider linear separability. Recall that the SVM formulation becomes: minw,b 1 2 ||w|| 2 subject to: yi(w Txi + b) ≥ 1,∀i = 1, · · · ,M (2) Similar to part 1, what if we add the constraint that the L2-distance between w and a prior weight vector w0 is less than k? How can you pose this as an optimization problem? What is the Lagrangian formulation? Obtain the dual solution from the Lagrangian. 5. Implement Linear Regression from Scratch (5 Points): Implement a Linear Regression from scratch. Implement a loss function (squared loss) and implement a very simple gradient descent algorithm in python. Test your implementation on the house price prediction dataset from the demo on Linear Regression. Finally, compare your solution to the solution obtained by sklearn (from the demo) and comment on it briefly. You do not need to consider the more complicated quadratic case. Just the simple linear regression on the features is enough for this. You can use sklearn to load the dataset and do the splits but not do anything more (e.g., form the loss function, optimize it, or directly call linear regression from sklearn – you will be comparing to it but you should compare your implementation to sklearn’s implementation. 3