file attached
Introduction to Deep Learning Sheet 1 — COMP 433 Fall 2022 Assignment 1 Due Date: November 9 Submission Instructions: Follow the submission instructions carefully or your as- signment may not be fully graded. 1) Submit a SINGLE pdf of your written answers and code and output from programming portions. Make sure to include all 3 in the pdf, do not asssume we will re-run your code, do not submit jpeg files, do not submit multiple pdfs. 2) It is suggested (but not required) you submit a runnable ipynb (single file) including all answers to programming portions of the questions. This will be used if there are doubts but assume this ipynb might not be evaluated thus all your answers should be readable from the single pdf. 3) Do not zip your file, submit a pdf directly to moodle and a ipynb 4) Clearly delineate each question in your answer Note1: The assignments are to be done individually. You may discuss high level ideas regarding questions with others but should not share answers. List the names of anyone you have extensively discussed a question with at the top of your final submission. Note2: Do not post questions about the assignment on MS teams. Note3: For clarifications or if you suspect there is a mistake/typo contact Instructor or TAs. Department of Computer Science and Software Engineering, Concordia University Eugene Belilovsky:
[email protected] Page 1 of 6 mailto:
[email protected] Introduction to Deep Learning: Sheet 1— COMP 433 Fall 2022 1. (a) (10 points) Consider the 1-hidden layer neural network ŷi = wT2 σ(W1x i + b1) + b2 where W1 is 20 × 10 and w2 is a vector of size 20. x is a vector of size 10. The absolute loss given l(ŷ, y) = |ŷ − y| Consider the cost function J = 1 N ∑N i=1 l(ŷ i, yi) Derive an expression for ∂J ∂W1 , ∂J ∂w2 , ∂J ∂b1 , ∂J ∂b2 . (b) (10 points) • Implement in PyTorch this network and loss (using torch tensors and function- alize, but not nn.module). • Validate with test cases the gradients computed by PyTorch match those of your answer in (a). To do this, generate random {x, y} samples with compo- nents from either uniform or Gaussian distribution. • Show the difference between the corresponding matrices (gradient calculated by hand, and gradient from torch.autograd) using torch.linalg.norm. (c) (10 point) Train this model on the sklearn California Housing Prices datasets. • For this you may use the optimizer and learning rates of your choice and train for 20-50 epochs. • Take half the data for training and half for testing. • Create a validation set from the training set and use it to select a good learning rate. • You might want to use the convenient Xavier initialization. • You are free to use the torch.optim package for this part. • To speed up things, run the training loop by batches (e.g. 4, 8, 32, 64, etc.). PyTorch’s DataLoader would be a useful tool to easily fetch a predefined set of batches per training iteration. • Report the mean squared error on the train and test set after each epoch. • You will need to adjust the size of W1 to fit the size of this data. Department of Computer Science and Software Engineering, Concordia University Eugene Belilovsky:
[email protected] Page 2 of 6 https://pytorch.org/docs/stable/autograd.html https://pytorch.org/docs/stable/generated/torch.linalg.norm.html https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_ https://pytorch.org/docs/stable/optim.html https://pytorch.org/tutorials/beginner/basics/data_tutorial.html mailto:
[email protected] Introduction to Deep Learning: Sheet 1— COMP 433 Fall 2022 2. (a) (15 points) Consider the neural network f(x) = WFρ ◦WL . . . ρ ◦Wi . . . ρ ◦W2ρ ◦W1x where W1 is K×D, Wi is K×K for i > 1, and WF is P ×K. Note f : RD → RP . Take ρ(x) = tanh(x). Consider the case D = 2, K = 30, P = 10 use torch tensors to write a function which computes the Jacobian, ∂f(x) ∂x , using backward mode automatic differentiation for a given value of x andW1, . . . ,Wi, . . . ,WF where the given matrices are specified by a dictionary of torch tensors. Specifically your function should be able to take different values for L. Implement and test this for L = 3. Your function should only make use of basic matrix operations (e.g. torch.matmul(), torch.tanh(), etc). (b) (Extra Credit +5 points) Implement a function using torch tensors and forward mode automatic differentiation to compute ∂f(x) ∂x . Validate (with assert statements) for several test cases that your answer matches the function (b) for L = 3. Hint: You must calculate the derivatives and the network’s output in the same forward pass (unlike backward differentiation where you need two loops, one for the forward pass and one for calculating the gradient). (c) (5 points) Benchmark the Jacobian computation of (a) for L=3,5,10. Report speed of these answers on test cases using GPU and CPU. (d) (5 points) Assume matrix multiply operations between sizesM1×M2 andM2×M3 result in M1 ∗M2 ∗M3 ops. Briefly discuss the theoretical speed comparisons of (a) backward and (b) forward mode differentiation. Note the actual implementation of (b) is extra credit. Department of Computer Science and Software Engineering, Concordia University Eugene Belilovsky:
[email protected] Page 3 of 6 mailto:
[email protected] Introduction to Deep Learning: Sheet 1— COMP 433 Fall 2022 3. For the following functions find, by hand, the parameters of a neural network that can fit these functions. You should use either a 1 or 2 hidden layer network and may use either sigmoid or ReLU non-linearities. In each case justify your answer and how you arrived at it (without using numerical tools/software packages). 1. (5 points) f(x) = 3x when 0 ≤ x ≤ 1/2, 3(1− x) when 1/2 ≤ x ≤ 1, 0 otherwise, 2. (5 points) f(x) = max(|x| − s, 0) ∗ sign(x) where s is a constant greater than 0. Hint : Consider a simpler example f(x) = |x|, a valid relu neural network that can give this function is NN(x) = ReLU(w1 ∗ x+ b1) + ReLU(w2 ∗ x+ b2) where w1 = 1, w2 = −1, b1 = b2 = 0 Department of Computer Science and Software Engineering, Concordia University Eugene Belilovsky:
[email protected] Page 4 of 6 mailto:
[email protected] Introduction to Deep Learning: Sheet 1— COMP 433 Fall 2022 4. (a) (5 points) We will study different ways to initialize models. First create a nn.module with a variable that defines the number of feedforward layers, while taking MNIST digits as inputs. The module should allow constructing a network as follows: net = my model(depth). It may be helpful to use the nn.ModuleList construction. For the rest of the exercise we will use a width of 50 hidden units. The input layer of your network should take minibatches of data sized B × 784 where B is the batch size and the output should have 10 values. For this network use a tanh non-linearity. (b) (2 points) Write a function to initialize your model w ∼ U(−d, d) and biases to zero. We will study the 4 cases d = 0.01, 0.1, 2.0, √ 6 ni+no . Note the final value corresponds to Xavier initialization with ni, no the number of input and output connections to a unit. You will perform part (c) and (d) for all 4 values of d and depth 8. This function can be a member of the your model class. You may make use of the built in torch.nn.init functions to achieve this. (c) (8 points) Using a cross-entropy loss at the end of the network: forward and backward a minibatch of 256 MNIST digits through the network with depth 8. Compute and visualize the gradient norm at each layer. Specifically this refers to the ∥∥∂L ∂a ∥∥, where a are the post-activation outputs. Your plots should have layer on the x-axis and gradient norm on the y-axis. Note that to get the gradient norms at each layer you can use retain grad on the layer outputs in the forward pass to keep the gradient buffer from clearing at each layer on the backward. Perform this for each of the 4 initializations to obtain 4 curves. Note: in this question you do not need to train or update the models. (d) (8 points) For each of the initialization settings train the model for 5 epochs on MNIST, using the cross-entropy loss. You may use SGD with learning rate of 0.01 and minibatch sizes of 128. Record the training accuracy and testing accuracy after each epoch and plot them versus epochs. (e) (2 points) Briefly (max 1 paragraph) discuss your findings: how does depth and initialization affect the activations, gradients, and convergence? Department of Computer Science and Software Engineering, Concordia University Eugene Belilovsky:
[email protected] Page 5 of 6 mailto:
[email protected] Introduction to Deep Learning: Sheet 1— COMP 433 Fall 2022 5. (a) (5 points) Consider the squared loss L(X,w, y) = 1 2 ∥Xw − y∥2 for data matrix X of size N ×D and parameters w; y is a vector of labels size N . Find the expression for gradient ∇wL(X,w, y) and minimizer of this loss, argminw L(X,w, y) (b) (5 points) Take w0 as the initialization for gradient descent with step size α and show an expression for the first and second iterates w1 and w2 only in terms of α,w0, X, y. (c) (+1 points Extra Credit) Generalize this to show an expression for wk in terms of α,w0, X, y, k Department of Computer Science and Software Engineering, Concordia University Eugene Belilovsky:
[email protected] Page 6 of 6 mailto:
[email protected]