CAP 5625: Programming Assignment 3
1 CAP 5625: Programming Assignment 3 Due on Canvas by Friday, December 2, 2022 at 11:59pm Preliminary instructions You may consult with other students currently taking CAP 5625 in your section (section 001 for in-person, section 003 for fully online) at FAU on this programming assignment. If you do consult with others, then you must indicate this by providing their names with your submitted assignment. However, all analyses must be performed independently, all source code must be written independently, and all students must turn in their own independent assignment. Note that for this assignment, you may choose to pair up with one other student in your section of CAP 5625 and submit a joint assignment. If you choose to do this, then both your names must be associated with the assignment and you will each receive the same grade. Though it should be unnecessary to state in a graduate class, I am reminding you that you may not turn in code (partial or complete) that is written or inspired by others, including code from other students, websites, past code that I release from prior assignments in this class or from past semesters in other classes I teach, or any other source that would constitute an academic integrity violation. All instances of academic integrity violations will receive a zero on the assignment and will be referred to the Department Chair and College Dean for further administrative action. You may choose to use whatever programming language you want. However, you must provide clear instructions on how to compile and/or run your source code. I recommend using a modern language, such as Python, R, or Matlab as learning these languages can help you if you were to enter the machine learning or artificial intelligence field in the future. All analyses performed and algorithms run must be written from scratch. That is, you may not use a library that can perform gradient descent, cross validation, ridge regression, logistic (multinomial) regression, optimization, etc. to successfully complete this assignment (though you may reuse your relevant code from Programming Assignments 1 and 2). The goal of this assignment is not to learn how to use particular libraries of a language, but it is to instead understand how key methods in statistical machine learning are implemented. With that stated, I will provide 10% extra credit if you additionally implement the assignment using built- in statistical or machine learning libraries (see Deliverable 7 at end of the document). Note, credit for deliverables that request graphs, discussion of results, or specific values will not be given if the instructor must run your code to obtain these graphs, results, or specific values. Brief overview of assignment In this assignment you will still be analyzing human genetic data from ? = 183 training observations (individuals) sampled across the world. The goal is to fit a model that can predict 2 (classify) an individual’s ancestry from their genetic data that has been projected along ? = 10 top principal components (proportion of variance explained is 0.2416) that we use as features rather than the raw genetic data, as the training would take too long to complete for this assignment with the raw data. I recognize that we have not covered precisely what principal components are, but we will get to this in Module 11, and for now you should just treat them as highly informative features as input to your classifier. Specifically, you will perform a penalized (regularized) logistic (multinomial) regression fit using ridge regression, with the model parameters obtained by batch gradient descent. Your predictions will be based on ? = 5 continental ancestries (African, European, East Asian, Oceanian, or Native American). Ridge regression will permit you to provide parameter shrinkage (tuning parameter ? ≥ 0) to mitigate overfitting. The tuning parameter ? will be chosen using five-fold cross validation, and the best- fit model parameters will be inferred on the training dataset conditional on an optimal tuning parameter. This trained model will be used to make predictions on new test data points. Training data Training data for these observations are given in the attached TrainingData_N183_p10.csv comma-separated file, with individuals labeled on each row (rows 2 through 184), and input features (PC1, PC2, …, PC10) and ancestry label given on the columns (with the first row representing a header for each column). Test data Test data are given in the attached TestData_N111_p10.csv comma-separated file, with individuals labeled on each row (rows 2 through 112), and input features (PC1, PC2, …, PC10), and ancestry label given on the columns (with the first row representing a header for each column). There are five individuals with Unknown ancestry, 54 individuals with Mexican ancestry, and 52 individuals with African American ancestry. Each of the five Unknown individuals belong to one of the five ancestries represented in the training set, and each ancestry is represented once in the five Unknown individuals. The Mexican and African American individuals have a range of ancestry proportions based on historical mixing of ancestors of diverse ancestry. You will use the class probabilities from logistic (multinomial) regression to predict the ancestry proportion of each of these putatively mixed samples. Detailed description of the task Recall that the task of performing a ridge-penalized logistic regression fit to training data {(?1, ?1), (?2, ?2), … , (?? , ??)} is to minimize the cost function ?(?, ?) = − log ℒ(?) + ?∑∑??? 2 ? ?=1 ? ?=1 where 3 ? = [ ?01 ?02 ⋯ ?0? ?11 ?12 ⋯ ?1? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] is the (? + 1) × ? matrix of model parameters, where log ℒ(?) = ∑[∑??? (?0? +∑?????? ? ?=1 ) ? ?=1 − log(∑exp(?0ℓ +∑?????ℓ ? ?=1 ) ? ℓ=1 )] ? ?=1 is the likelihood function, where ???, ? = 1,2, … ,? and ? = 1,2, … , ?, is an indicator variable defined as ??? = { 1 if ?? = ? 0 if ?? ≠ ? and where the input ? features are standardized (i.e., centered and divided by their standard deviation). Moreover, recall that batch gradient descent updates each parameter ???, ? = 0,1, … , ? and ? = 1,2, … , ?, as follows: ??? ≔ ??? − ? ? ???? ?(?, ?) where ? is the learning rate and where the partial derivative of the cost function with respect to parameter ??? is ? ???? ?(?, ?) = { −∑[??? − ??(??; ?)] ? ?=1 if ? = 0 −∑???[??? − ??(??; ?)] ? ?=1 + 2???? if ? > 0 where ??(??; ?) = exp(?0? + ∑ ?????? ? ?=1 ) ∑ exp (?0ℓ + ∑ ?????ℓ ? ?=1 ) ? ℓ=1 To implement this algorithm, depending on whether your chosen language can quickly compute vectorized operations, you may implement batch gradient descent using either Algorithm 1 or Algorithm 2 below (choose whichever you are more comfortable implementing). Note that in languages like R, Python, or Matlab, Algorithm 2 (which would be implemented by several nested loops) may be much slower than Algorithm 1. Also note that if you are implementing Algorithm 1 using Python, use numpy arrays instead of Pandas data frames for computational speed. For this assignment, assume that we will reach the minimum of the cost function within a fixed number of steps, with the number of iterations being 10,000. 4 You may need to play with different learning rate values in order to identify a learning rate that is not too large and not too small, such that it is likely for the algorithm to converge in a reasonable period of time. I would consider values of the learning rate ? such that 2?? < 1 for your maximum tuning parameter ?, and see if any of those gets you close to convergence. i have achieved good results with a learning rate on the order of 10−5 for this assignment where maximum ? = 104. for this assignment, assume that we will reach the minimum of the cost function within a fixed number of steps, with the number of iterations being 10,000 (a large number as we have many parameters). keep in mind that due to this large number of iterations, it could take a long time to train your classifier. algorithm 1 (vectorized): step 1. choose learning rate ? and fix tuning parameter ? step 2. generate ? × (? + 1) augmented design matrix ? = [ 1 ?11 ?12 ⋯ ?1? 1 ?21 ?22 ⋯ ?2? ⋮ ⋮ ⋮ ⋱ ⋮ 1 ??1 ??2 ⋯ ???] where column ? + 1 has been centered and standardized such that feature ?, ? = 1,2, … , ?, has mean zero and standard deviation one, and generate ? × ? indicator response matrix ? = [ ?11 ?12 ⋯ ?1? ?21 ?22 ⋯ ?2? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] where ??? = 1 if observation ? is from class ?, and 0 otherwise. step 3. initialize the (? + 1) × ?-dimensional parameter matrix ? = [ ?01 ?02 ⋯ ?0? ?11 ?12 ⋯ ?1? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] to all zeros, so that initially each class has the same probability. step 4. compute ? × ? unnormalized class probability matrix as ? = [ ?11 ?12 ⋯ ?1? ?21 ?22 ⋯ ?2? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] = exp(??) where exp(??) indicates exponentiation of each element of the ?? matrix, and not the matrix exponential of ??. step 5. compute ? × ? normalized class probability matrix as 5 ? = [ ?11 ?12 ⋯ ?1? ?21 ?22 ⋯ ?2? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] where ??(??; ?) = ??? = ??? ∑ ??ℓ ? ℓ=1 step 6. for ease of vectorization, generate (? + 1) × 1="" for="" your="" maximum="" tuning="" parameter="" ,="" and="" see="" if="" any="" of="" those="" gets="" you="" close="" to="" convergence.="" i="" have="" achieved="" good="" results="" with="" a="" learning="" rate="" on="" the="" order="" of="" 10−5="" for="" this="" assignment="" where="" maximum="" =="" 104.="" for="" this="" assignment,="" assume="" that="" we="" will="" reach="" the="" minimum="" of="" the="" cost="" function="" within="" a="" fixed="" number="" of="" steps,="" with="" the="" number="" of="" iterations="" being="" 10,000="" (a="" large="" number="" as="" we="" have="" many="" parameters).="" keep="" in="" mind="" that="" due="" to="" this="" large="" number="" of="" iterations,="" it="" could="" take="" a="" long="" time="" to="" train="" your="" classifier.="" algorithm="" 1="" (vectorized):="" step="" 1.="" choose="" learning="" rate="" and="" fix="" tuning="" parameter="" step="" 2.="" generate="" ×="" (?="" +="" 1)="" augmented="" design="" matrix="" =="" [="" 1="" 11="" 12="" ⋯="" 1?="" 1="" 21="" 22="" ⋯="" 2?="" ⋮="" ⋮="" ⋮="" ⋱="" ⋮="" 1="" 1="" 2="" ⋯="" ]="" where="" column="" +="" 1="" has="" been="" centered="" and="" standardized="" such="" that="" feature="" ,="" =="" 1,2,="" …="" ,="" ,="" has="" mean="" zero="" and="" standard="" deviation="" one,="" and="" generate="" ×="" indicator="" response="" matrix="" =="" [="" 11="" 12="" ⋯="" 1?="" 21="" 22="" ⋯="" 2?="" ⋮="" ⋮="" ⋱="" ⋮="" 1="" 2="" ⋯="" ]="" where="" =="" 1="" if="" observation="" is="" from="" class="" ,="" and="" 0="" otherwise.="" step="" 3.="" initialize="" the="" (?="" +="" 1)="" ×="" -dimensional="" parameter="" matrix="" =="" [="" 01="" 02="" ⋯="" 0?="" 11="" 12="" ⋯="" 1?="" ⋮="" ⋮="" ⋱="" ⋮="" 1="" 2="" ⋯="" ]="" to="" all="" zeros,="" so="" that="" initially="" each="" class="" has="" the="" same="" probability.="" step="" 4.="" compute="" ×="" unnormalized="" class="" probability="" matrix="" as="" =="" [="" 11="" 12="" ⋯="" 1?="" 21="" 22="" ⋯="" 2?="" ⋮="" ⋮="" ⋱="" ⋮="" 1="" 2="" ⋯="" ]="exp(??)" where="" exp(??)="" indicates="" exponentiation="" of="" each="" element="" of="" the="" matrix,="" and="" not="" the="" matrix="" exponential="" of="" .="" step="" 5.="" compute="" ×="" normalized="" class="" probability="" matrix="" as="" 5="" =="" [="" 11="" 12="" ⋯="" 1?="" 21="" 22="" ⋯="" 2?="" ⋮="" ⋮="" ⋱="" ⋮="" 1="" 2="" ⋯="" ]="" where="" (??;="" )="???" =="" ∑="" ℓ="" ℓ="1" step="" 6.="" for="" ease="" of="" vectorization,="" generate="" (?="" +="" 1)=""> 1 for your maximum tuning parameter ?, and see if any of those gets you close to convergence. i have achieved good results with a learning rate on the order of 10−5 for this assignment where maximum ? = 104. for this assignment, assume that we will reach the minimum of the cost function within a fixed number of steps, with the number of iterations being 10,000 (a large number as we have many parameters). keep in mind that due to this large number of iterations, it could take a long time to train your classifier. algorithm 1 (vectorized): step 1. choose learning rate ? and fix tuning parameter ? step 2. generate ? × (? + 1) augmented design matrix ? = [ 1 ?11 ?12 ⋯ ?1? 1 ?21 ?22 ⋯ ?2? ⋮ ⋮ ⋮ ⋱ ⋮ 1 ??1 ??2 ⋯ ???] where column ? + 1 has been centered and standardized such that feature ?, ? = 1,2, … , ?, has mean zero and standard deviation one, and generate ? × ? indicator response matrix ? = [ ?11 ?12 ⋯ ?1? ?21 ?22 ⋯ ?2? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] where ??? = 1 if observation ? is from class ?, and 0 otherwise. step 3. initialize the (? + 1) × ?-dimensional parameter matrix ? = [ ?01 ?02 ⋯ ?0? ?11 ?12 ⋯ ?1? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] to all zeros, so that initially each class has the same probability. step 4. compute ? × ? unnormalized class probability matrix as ? = [ ?11 ?12 ⋯ ?1? ?21 ?22 ⋯ ?2? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] = exp(??) where exp(??) indicates exponentiation of each element of the ?? matrix, and not the matrix exponential of ??. step 5. compute ? × ? normalized class probability matrix as 5 ? = [ ?11 ?12 ⋯ ?1? ?21 ?22 ⋯ ?2? ⋮ ⋮ ⋱ ⋮ ??1 ??2 ⋯ ??? ] where ??(??; ?) = ??? = ??? ∑ ??ℓ ? ℓ=1 step 6. for ease of vectorization, generate (? + 1) ×>