Data Preprocessing (30 points)Preprocess the data you are given to your liking. This may include dropping some columns you wont use,addressing noisy or missing data etc.Use Pandas as a dataframe...

1 answer below »
Please help with this assignment. See attachment for questions, resources, and data to be used.




The dataset that should be used is from this link:

https://www.kaggle.com/competitions/avazu-ctr-prediction/data








Data Preprocessing (30 points) Preprocess the data you are given to your liking. This may include dropping some columns you wont use, addressing noisy or missing data etc. Use Pandas as a dataframe abstraction for this task. You can learn about Pandas here: [J » Ed Pandas Tutorial 2021 [EEE ARN NDAS IN ONE VIDEO B 4 ULTIMATE Watch on EB YouTube Ultimately this task has to result to a dataframe that you will use for the training and testing the classifier. Logistic Regression (40 points) Write from scratch the logistic regression solution to the prediction problem that can work with Stochastic Gradient Descent (SGD). You can only use pandas and numpy to do so. Show clearly all equations of the gradient and include comments in either markdown or Python (inline to code) explaining every stage of processing. Also, highlight any enhancements you may have done to improve performance such as regularization. Performance Results (20 points) Plot the precision vs recall curve of your classifier. Clearly explain the tradeoff between the two quantities and the shape of the curve. Chapter16.LogisticRegression Alotofpeoplesaythere’safinelinebetweengeniusandinsanity.Idon’tthinkthere’sa fineline,Iactuallythinkthere’sayawninggulf. BillBailey InChapter1,webrieflylookedattheproblemoftryingtopredictwhichDataSciencester userspaidforpremiumaccounts.Herewe’llrevisitthatproblem. TheProblem Wehaveananonymizeddatasetofabout200users,containingeachuser’ssalary,her yearsofexperienceasadatascientist,andwhethershepaidforapremiumaccount (Figure16-1).Asisusualwithcategoricalvariables,werepresentthedependentvariable aseither0(nopremiumaccount)or1(premiumaccount). Asusual,ourdataisinamatrixwhereeachrowisalist[experience,salary, paid_account].Let’sturnitintotheformatweneed: x=[[1]+row[:2]forrowindata]#eachelementis[1,experience,salary] y=[row[2]forrowindata]#eachelementispaid_account Anobviousfirstattemptistouselinearregressionandfindthebestmodel: Figure16-1.Paidandunpaidusers Andcertainly,there’snothingpreventingusfrommodelingtheproblemthisway.The resultsareshowninFigure16-2: rescaled_x=rescale(x) beta=estimate_beta(rescaled_x,y)#[0.26,0.43,-0.43] predictions=[predict(x_i,beta)forx_iinrescaled_x] plt.scatter(predictions,y) plt.xlabel("predicted") plt.ylabel("actual") plt.show() Figure16-2.Usinglinearregressiontopredictpremiumaccounts Butthisapproachleadstoacoupleofimmediateproblems: We’dlikeforourpredictedoutputstobe0or1,toindicateclassmembership.It’sfine ifthey’rebetween0and1,sincewecaninterprettheseasprobabilities—anoutputof 0.25couldmean25%chanceofbeingapaidmember.Buttheoutputsofthelinear modelcanbehugepositivenumbersorevennegativenumbers,whichit’snotclear howtointerpret.Indeed,herealotofourpredictionswerenegative. Thelinearregressionmodelassumedthattheerrorswereuncorrelatedwiththe columnsofx.Buthere,theregressioncoefficentforexperienceis0.43,indicatingthat moreexperienceleadstoagreaterlikelihoodofapremiumaccount.Thismeansthat ourmodeloutputsverylargevaluesforpeoplewithlotsofexperience.Butweknow thattheactualvaluesmustbeatmost1,whichmeansthatnecessarilyverylarge outputs(andthereforeverylargevaluesofexperience)correspondtoverylarge negativevaluesoftheerrorterm.Becausethisisthecase,ourestimateofbetais biased. Whatwe’dlikeinsteadisforlargepositivevaluesofdot(x_i,beta)tocorrespondto probabilitiescloseto1,andforlargenegativevaluestocorrespondtoprobabilitiesclose to0.Wecanaccomplishthisbyapplyinganotherfunctiontotheresult. TheLogisticFunction Inthecaseoflogisticregression,weusethelogisticfunction,picturedinFigure16-3: deflogistic(x): return1.0/(1+math.exp(-x)) Figure16-3.Thelogisticfunction Asitsinputgetslargeandpositive,itgetscloserandcloserto1.Asitsinputgetslarge andnegative,itgetscloserandcloserto0.Additionally,ithastheconvenientproperty thatitsderivativeisgivenby: deflogistic_prime(x): returnlogistic(x)*(1-logistic(x)) whichwe’llmakeuseofinabit.We’llusethistofitamodel: wherefisthelogisticfunction. Recallthatforlinearregressionwefitthemodelbyminimizingthesumofsquarederrors, whichendedupchoosingthe thatmaximizedthelikelihoodofthedata. Herethetwoaren’tequivalent,sowe’llusegradientdescenttomaximizethelikelihood directly.Thismeansweneedtocalculatethelikelihoodfunctionanditsgradient. Givensome ,ourmodelsaysthateach shouldequal1withprobability and0 withprobability . Inparticular,thepdffor canbewrittenas: sinceif is0,thisequals: andif is1,itequals: Itturnsoutthatit’sactuallysimplertomaximizetheloglikelihood: Becauselogisstrictlyincreasingfunction,anybetathatmaximizestheloglikelihoodalso maximizesthelikelihood,andviceversa. deflogistic_log_likelihood_i(x_i,y_i,beta): ify_i==1: returnmath.log(logistic(dot(x_i,beta))) else: returnmath.log(1-logistic(dot(x_i,beta))) Ifweassumedifferentdatapointsareindependentfromoneanother,theoveralllikelihood isjusttheproductoftheindividuallikelihoods.Whichmeanstheoverallloglikelihoodis thesumoftheindividualloglikelihoods: deflogistic_log_likelihood(x,y,beta): returnsum(logistic_log_likelihood_i(x_i,y_i,beta) forx_i,y_iinzip(x,y)) Alittlebitofcalculusgivesusthegradient: deflogistic_log_partial_ij(x_i,y_i,beta,j): """hereiistheindexofthedatapoint, jtheindexofthederivative""" return(y_i-logistic(dot(x_i,beta)))*x_i[j] deflogistic_log_gradient_i(x_i,y_i,beta): """thegradientoftheloglikelihood correspondingtotheithdatapoint""" return[logistic_log_partial_ij(x_i,y_i,beta,j) forj,_inenumerate(beta)] deflogistic_log_gradient(x,y,beta): returnreduce(vector_add, [logistic_log_gradient_i(x_i,y_i,beta) forx_i,y_iinzip(x,y)]) atwhichpointwehaveallthepiecesweneed. ApplyingtheModel We’llwanttosplitourdataintoatrainingsetandatestset: random.seed(0) x_train,x_test,y_train,y_test=train_test_split(rescaled_x,y,0.33) #wanttomaximizeloglikelihoodonthetrainingdata fn=partial(logistic_log_likelihood,x_train,y_train) gradient_fn=partial(logistic_log_gradient,x_train,y_train) #pickarandomstartingpoint beta_0=[random.random()for_inrange(3)] #andmaximizeusinggradientdescent beta_hat=maximize_batch(fn,gradient_fn,beta_0) Alternatively,youcouldusestochasticgradientdescent: beta_hat=maximize_stochastic(logistic_log_likelihood_i, logistic_log_gradient_i, x_train,y_train,beta_0) Eitherwaywefindapproximately: beta_hat=[-1.90,4.05,-3.87] Thesearecoefficientsfortherescaleddata,butwecantransformthembacktothe originaldataaswell: beta_hat_unscaled=[7.61,1.42,-0.000249] Unfortunately,thesearenotaseasytointerpretaslinearregressioncoefficients.Allelse beingequal,anextrayearofexperienceadds1.42totheinputoflogistic.Allelsebeing equal,anextra$10,000ofsalarysubtracts2.49fromtheinputoflogistic. Theimpactontheoutput,however,dependsontheotherinputsaswell.Ifdot(beta, x_i)isalreadylarge(correspondingtoaprobabilitycloseto1),increasingitevenbyalot cannotaffecttheprobabilityverymuch.Ifit’scloseto0,increasingitjustalittlemight increasetheprobabilityquiteabit. Whatwecansayisthat—allelsebeingequal—peoplewithmoreexperiencearemore likelytopayforaccounts.Andthat—allelsebeingequal—peoplewithhighersalaries arelesslikelytopayforaccounts.(Thiswasalsosomewhatapparentwhenweplottedthe data.) GoodnessofFit Wehaven’tyetusedthetestdatathatweheldout.Let’sseewhathappensifwepredict paidaccountwhenevertheprobabilityexceeds0.5: true_positives=false_positives=true_negatives=false_negatives=0 forx_i,y_iinzip(x_test,y_test): predict=logistic(dot(beta_hat,x_i)) ify_i==1andpredict>=0.5:#TP:paidandwepredictpaid true_positives+=1 elify_i==1:#FN:paidandwepredictunpaid false_negatives+=1 elifpredict>=0.5:#FP:unpaidandwepredictpaid false_positives+=1 else:#TN:unpaidandwepredictunpaid true_negatives+=1 precision=true_positives/(true_positives+false_positives) recall=true_positives/(true_positives+false_negatives) Thisgivesaprecisionof93%(“whenwepredictpaidaccountwe’reright93%ofthe time”)andarecallof82%(“whenauserhasapaidaccountwepredictpaidaccount82% ofthetime”),bothofwhichareprettyrespectablenumbers. Wecanalsoplotthepredictionsversustheactuals(Figure16-4),whichalsoshowsthat themodelperformswell: predictions=[logistic(dot(beta_hat,x_i))forx_iinx_test] plt.scatter(predictions,y_test) plt.xlabel("predictedprobability") plt.ylabel("actualoutcome") plt.title("LogisticRegressionPredictedvs.Actual") plt.show() Figure16-4.Logisticregressionpredictedversusactual SupportVectorMachines Thesetofpointswheredot(beta_hat,x_i)equals0istheboundarybetweenour classes.Wecanplotthistoseeexactlywhatourmodelisdoing(Figure16-5). Thisboundaryisahyperplanethatsplitstheparameterspaceintotwohalf-spaces correspondingtopredictpaidandpredictunpaid.Wefounditasaside-effectoffinding themostlikelylogisticmodel. Analternativeapproachtoclassificationistojustlookforthehyperplanethat“best” separatestheclassesinthetrainingdata.Thisistheideabehindthesupportvector machine,whichfindsthehyperplanethatmaximizesthedistancetothenearestpointin eachclass(Figure16-6). Figure16-5.Paidandunpaiduserswithdecisionboundary Findingsuchahyperplaneisanoptimizationproblemthatinvolvestechniquesthataretoo advancedforus.Adifferentproblemisthataseparatinghyperplanemightnotexistatall. Inour“whopays?”datasettheresimplyisnolinethatperfectlyseparatesthepaidusers fromtheunpaidusers. Wecan(sometimes)getaroundthisbytransformingthedataintoahigher-dimensional space.Forexample,considerthesimpleone-dimensionaldatasetshowninFigure16-7. Figure16-6.Aseparatinghyperplane It’sclearthatthere’snohyperplanethatseparatesthepositiveexamplesfromthenegative ones.However,lookatwhathappenswhenwemapthisdatasettotwodimensionsby sendingthepointxto(x,x**2).Suddenlyit’spossibletofindahyperplanethatsplitsthe data(Figure16-8). Thisisusuallycalledthekerneltrickbecauseratherthanactuallymappingthepointsinto thehigher-dimensionalspace(whichcouldbeexpensiveiftherearealotofpointsandthe mappingiscomplicated),wecanusea“kernel”functiontocomputedotproductsinthe higher-dimensionalspaceandusethosetofindahyperplane. Figure16-7.Anonseparableone-dimensionaldataset It’shard(andprobablynotagoodidea)tousesupportvectormachineswithoutrelyingon specializedoptimizationsoftwarewrittenbypeoplewiththeappropriateexpertise,so we’llhavetoleaveourtreatmenthere. Figure16-8.Datasetbecomesseparableinhigherdimensions ForFurtherInvestigation scikit-learnhasmodulesforbothLogisticRegressionandSupportVectorMachines. libsvmisthesupportvectormachineimplementationthatscikit-learnisusingbehind thescenes.Itswebsitehasavarietyofusefuldocumentationaboutsupportvector machines.
Answered 2 days AfterOct 19, 2022

Answer To: Data Preprocessing (30 points)Preprocess the data you are given to your liking. This may include...

Mukesh answered on Oct 21 2022
67 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here