1. Linearly separable datasets. Consider two sets of points in R p , X 0 = {x 0 1 , . . . , x0 N0 } ? R p and X 1 = {x 1 1 , . . . , x1 N1 } ? R p . If X 0 and X 1 are linearly separable, then it is possible to draw a hyperplane H ? R p so that all the points in X 0 lie on one side of H and all the points in X 1 lie on the other side of H. When p = 2, the hyperplane H is just a line and two linearly separable sets (black and red) are depicted in the figure below.
1. Linearly separable datasets. Consider two sets of points in Rp, X 0 = {x01, . . . , x0N0} ⊆ R p and X 1 = {x11, . . . , x1N1} ⊆ R p. If X 0 and X 1 are linearly separable, then it is possible to draw a hyperplane H ⊆ Rp so that all the points in X 0 lie on one side of H and all the points in X 1 lie on the other side of H. When p = 2, the hyperplane H is just a line and two linearly separable sets (black and red) are depicted in the figure below. Linearly separability is an important concept in classification because if the training data from two classes are linearly separable, then it is always possible to construct a linear classifier with zero training errors. More formally, we have the following definition: The sets X 0 and X 1 are linearly separable if there exists a vector w ∈ Rp and a real number a ∈ R such that w>x > a for all x ∈ X 0 and w>x < a for all x ∈ x 1. another important concept in machine learning and statistics is the convex hull of a set. if x ⊆ rp, then the convex hull of x is the smallest convex set containing x ; more precisely, the convex hull of x is defined to be co(x ) = { p∑ k=1 αkxk; x1, . . . , xp ∈ x , α1, . . . , αp ≥ 0, p∑ k=1 αk = 1 } prove that x 0 and x 1 are linearly separable if and only if co(x 0) ∩ co(x 1) = ∅. 2. predicting the nikkei 225. in this problem the goal is to predict the direction of change in the nikkei 225 index, which is composed of 225 highly capitalized stocks trading on the tokyo stock exchange, representing a broad cross-section of japanese industries. two inputs will be used to make our predictions. economic growth in japan has a close relationship with japanese exports. the largest export target for japan is the usa, so we use the return of s&p 500 index as one input. the other input is chosen as the change in the exchange rate of us dollars against japanese yen. the relevant datasets are available on canvas: nikkei_daily.txt, sp_daily.txt, and ex_daily.txt. (a) using the nikkei_daily.txt data, for each day i from jan 03 2005 to dec 31 2007 for which the data are available, create an output variable gi that is labeled as +1 if the nikkei index value increased from day i-1 to day i, and -1 otherwise. also create the two input variables: the first input xi1 is the log return of the s&p 500 index for day i (in the sp_daily.txt data, the return is already logged); the second input xi2 is the difference between the log exchange rates of day i-1 and day i-2 (use the ex_daily.txt data to create this variable). reserve the last 60 days in the dataset you have created for testing; the remaining days will be used for training. (b) fit (i) logistic regression, (ii) linear discriminant analysis (lda), (iii) classification tree, (iv) svm with linear kernel (v) svm with radial kernel. for (iii),(iv) and (v), write you own program to carry out a 10-fold cross validation to choose tuning parameters. more precisely, for classification tree, use 10-fold cv to choose tree size. for linear svm, use 10-fold cv to choose c (the sum of the slack variables). for radial svm, use 10-fold cv to choose choose c and σ (recall the radial kernel k(x,x′) = exp(−‖x− x′‖2/σ2)). for each classifier, report the training and testing error rate. a="" for="" all="" x="" ∈="" x="" 1.="" another="" important="" concept="" in="" machine="" learning="" and="" statistics="" is="" the="" convex="" hull="" of="" a="" set.="" if="" x="" ⊆="" rp,="" then="" the="" convex="" hull="" of="" x="" is="" the="" smallest="" convex="" set="" containing="" x="" ;="" more="" precisely,="" the="" convex="" hull="" of="" x="" is="" defined="" to="" be="" co(x="" )="{" p∑="" k="1" αkxk;="" x1,="" .="" .="" .="" ,="" xp="" ∈="" x="" ,="" α1,="" .="" .="" .="" ,="" αp="" ≥="" 0,="" p∑="" k="1" αk="1" }="" prove="" that="" x="" 0="" and="" x="" 1="" are="" linearly="" separable="" if="" and="" only="" if="" co(x="" 0)="" ∩="" co(x="" 1)="∅." 2.="" predicting="" the="" nikkei="" 225.="" in="" this="" problem="" the="" goal="" is="" to="" predict="" the="" direction="" of="" change="" in="" the="" nikkei="" 225="" index,="" which="" is="" composed="" of="" 225="" highly="" capitalized="" stocks="" trading="" on="" the="" tokyo="" stock="" exchange,="" representing="" a="" broad="" cross-section="" of="" japanese="" industries.="" two="" inputs="" will="" be="" used="" to="" make="" our="" predictions.="" economic="" growth="" in="" japan="" has="" a="" close="" relationship="" with="" japanese="" exports.="" the="" largest="" export="" target="" for="" japan="" is="" the="" usa,="" so="" we="" use="" the="" return="" of="" s&p="" 500="" index="" as="" one="" input.="" the="" other="" input="" is="" chosen="" as="" the="" change="" in="" the="" exchange="" rate="" of="" us="" dollars="" against="" japanese="" yen.="" the="" relevant="" datasets="" are="" available="" on="" canvas:="" nikkei_daily.txt,="" sp_daily.txt,="" and="" ex_daily.txt.="" (a)="" using="" the="" nikkei_daily.txt="" data,="" for="" each="" day="" i="" from="" jan="" 03="" 2005="" to="" dec="" 31="" 2007="" for="" which="" the="" data="" are="" available,="" create="" an="" output="" variable="" gi="" that="" is="" labeled="" as="" +1="" if="" the="" nikkei="" index="" value="" increased="" from="" day="" i-1="" to="" day="" i,="" and="" -1="" otherwise.="" also="" create="" the="" two="" input="" variables:="" the="" first="" input="" xi1="" is="" the="" log="" return="" of="" the="" s&p="" 500="" index="" for="" day="" i="" (in="" the="" sp_daily.txt="" data,="" the="" return="" is="" already="" logged);="" the="" second="" input="" xi2="" is="" the="" difference="" between="" the="" log="" exchange="" rates="" of="" day="" i-1="" and="" day="" i-2="" (use="" the="" ex_daily.txt="" data="" to="" create="" this="" variable).="" reserve="" the="" last="" 60="" days="" in="" the="" dataset="" you="" have="" created="" for="" testing;="" the="" remaining="" days="" will="" be="" used="" for="" training.="" (b)="" fit="" (i)="" logistic="" regression,="" (ii)="" linear="" discriminant="" analysis="" (lda),="" (iii)="" classification="" tree,="" (iv)="" svm="" with="" linear="" kernel="" (v)="" svm="" with="" radial="" kernel.="" for="" (iii),(iv)="" and="" (v),="" write="" you="" own="" program="" to="" carry="" out="" a="" 10-fold="" cross="" validation="" to="" choose="" tuning="" parameters.="" more="" precisely,="" for="" classification="" tree,="" use="" 10-fold="" cv="" to="" choose="" tree="" size.="" for="" linear="" svm,="" use="" 10-fold="" cv="" to="" choose="" c="" (the="" sum="" of="" the="" slack="" variables).="" for="" radial="" svm,="" use="" 10-fold="" cv="" to="" choose="" choose="" c="" and="" σ="" (recall="" the="" radial="" kernel="" k(x,x′)="exp(−‖x−" x′‖2/σ2)).="" for="" each="" classifier,="" report="" the="" training="" and="" testing="" error=""> a for all x ∈ x 1. another important concept in machine learning and statistics is the convex hull of a set. if x ⊆ rp, then the convex hull of x is the smallest convex set containing x ; more precisely, the convex hull of x is defined to be co(x ) = { p∑ k=1 αkxk; x1, . . . , xp ∈ x , α1, . . . , αp ≥ 0, p∑ k=1 αk = 1 } prove that x 0 and x 1 are linearly separable if and only if co(x 0) ∩ co(x 1) = ∅. 2. predicting the nikkei 225. in this problem the goal is to predict the direction of change in the nikkei 225 index, which is composed of 225 highly capitalized stocks trading on the tokyo stock exchange, representing a broad cross-section of japanese industries. two inputs will be used to make our predictions. economic growth in japan has a close relationship with japanese exports. the largest export target for japan is the usa, so we use the return of s&p 500 index as one input. the other input is chosen as the change in the exchange rate of us dollars against japanese yen. the relevant datasets are available on canvas: nikkei_daily.txt, sp_daily.txt, and ex_daily.txt. (a) using the nikkei_daily.txt data, for each day i from jan 03 2005 to dec 31 2007 for which the data are available, create an output variable gi that is labeled as +1 if the nikkei index value increased from day i-1 to day i, and -1 otherwise. also create the two input variables: the first input xi1 is the log return of the s&p 500 index for day i (in the sp_daily.txt data, the return is already logged); the second input xi2 is the difference between the log exchange rates of day i-1 and day i-2 (use the ex_daily.txt data to create this variable). reserve the last 60 days in the dataset you have created for testing; the remaining days will be used for training. (b) fit (i) logistic regression, (ii) linear discriminant analysis (lda), (iii) classification tree, (iv) svm with linear kernel (v) svm with radial kernel. for (iii),(iv) and (v), write you own program to carry out a 10-fold cross validation to choose tuning parameters. more precisely, for classification tree, use 10-fold cv to choose tree size. for linear svm, use 10-fold cv to choose c (the sum of the slack variables). for radial svm, use 10-fold cv to choose choose c and σ (recall the radial kernel k(x,x′) = exp(−‖x− x′‖2/σ2)). for each classifier, report the training and testing error rate.>