All the instructions are mentioned in the attached file.
COMP 5070 Exam SP5 2018 COMP 5070 Statistical Programming for Data Science TakeHomeExam DUE:by11:55PM(CST),Friday23rd November • Thetake---homeexamisworth30%ofyouroverallgrade.Theexamisoutof100marks. • Theexamistobesubmittedonlineasacompressedfile(e.g..zip,.tar.gz,.gz).This compressedfileshouldincludeALLcodeneededtorunyourprogramandanyotherfilesyou createdyourself.YoudoNOTneedtoincludeanydatafilesprovidedtoyou,asitwillbe assumedItoohavethemJ • Toobtainthemaximumavailablemarksyoushouldaimto: 1. Codeallrequestedcomponents(30%). 2. Useaclearstyleofcodepresentation(10%).Codeclarityisanimportantpartofyour submission.Thusyoushouldchoosemeaningfulvariablenamesandadopttheuseof comments---youdon'tneedtocommenteverysingleline,asthiswillaffectreadability--- howeveryoushouldaimtocommentatleasteachsectionofcode. 3. Havethecoderunsuccessfully(5%). 4. Outputtheinformationinapresentablemannerandpresentyourwrittenanalysisofthe output.(55%). • Plagiarismisaspecificformofacademicmisconduct.AlthoughtheUniversityencourages discussingworkwithothersandtheSocialForumwillsupportthis,ultimatelythissubmissionis torepresentyourindividualwork.Ifplagiarismisfound,allpartieswillbepenalised.Youshould retaincopiesofallassignmentcomputerfilesusedduringdevelopment.Thesefilesmustremain unchangedaftersubmission,forthepurposeofcheckingifrequired. • Forthepurposeofthisexam,a“paragraph”isconsideredtoconsistofapproximately6---8lines. YouarewelcometoexceedthisamountJ • Thisexamappearslongerthanitactuallyis–explanationsaregiventohelpyouunderstand therequestedanalysesandIhavealsoprovidedhints. • Youdonotneedtowritespecialisedcodeasyoudidfortheassignments.Youshouldbeable tofindnearlyallthecodeyouneedfromtheRfilesprovidedthroughoutthecourse,viacase studiesandotherexamples.Ifyoucopy/pastecodefromtheRcodeIhaveprovided,this shouldgiveyounearly100%ofthecodeneededforthisexam,withafewalterationsonyour behalf(e.g.filenames,variablenamesetc). Question1(60Marks) It’s All in the Taste ExpertsvsAmateurs Whoisbetteratdiscerningthetastesof supermarketchocolate?Doyoureallyneed trainingtoknowifyoulikeit?Ordoesitall justtastereallygood? TheExpertsbattleitoutagainstagroupof dedicatedchocolate-eatingAmateurs! IwouldreallyliketohavethatjobJ Thedataforthisquestionaretheresponsestothesensometricqualitiesofchocolatethatcanbepurchasedin supermarkets.Twogroupswereaskedtoratethequalitiesofthechocolates:thefirstgroupcontainedapanel ofsensometricexpertswithresponsesrecordedover9differenttastingsessions.Theaccompanyingdataisin chocolate_experts.csv. Thesecondgroupcontainedapanelofvolunteerschosentorepresent ‘regularshoppers’whounderwenta three-hour sensometric training sessionbefore rating thequalitiesof the chocolateover 2different tasting sessions.Theaccompanyingdataisinchocolate_amateurs.csv. The responses were recorded over a continuous scale from 0 to 10 with 0 indicating the absence of the sensometricqualityand10indicatingfullypresent.Itisofinteresttodetermineifexpertsperceivesupermarket chocolatedifferentlytonon-experts(theamateurs)using14sensometricvariables(ChocolateAromathrough toGranularTextureinthedatafiles). Forthisquestionyouneedtorandomlyobtaintwosessionidsfortheexpertresponsesonlybymakingacallto sampleasshownbelow.Thetwonumbersthatarereturnedareyoursessionidsthatyouneedtoextractfor youranalysis. sample(9,2) Fortheexpertdatayouwillonlyneedtoanalysetheresponsescorrespondingtothetworandomlyselected sessionids.Amateurdataneedstobeusedinfull. Youareaskedtocomparetheresponsesbetweenthetwogroupsasrequestedineachpartbelow.Apartiallywritten Rscriptisavailableaspartoftheexampackage.Youmustusethisscriptforyouranalysisandfollowtheinstructions therein.Anylinesmarkedwith ####!!!EXAMTIP!!! requiresyoutochangethatlineofcodetosuityourpurposes.Furtherdetailsareprovidedinthecodecomments aroundthatline. Forthepurposesofthisexamaparagraphis8-12linesoftext.Specifically,youranalysisshouldinclude: i) Initial Data Discussion: Write a short explanation (approximately 1 paragraph) of the analysis to be performedandanexplanationofthedata.IncludeyoursessionIDsfortheexpertresponses,andanydata manipulationperformedpriortoanalysisshouldyoudoso. ii) ExploratoryFactorAnalysis:conducttwoseparateexploratoryfactoranalyses:thefirstforyourselectedid sessions for theexpert responses, theother for the full set of amateur responses. Youmaypresent the analysesside-by-sideorinsequence;howeveryoubelieveisbest.ForeachExploratoryFactorAnalysisyou onlyneedtoincludethefollowing: ForeachExploratoryFactorAnalysisyouneedtoincludethefollowing: v Ifappropriate,CronbachAlphaoutputanda shortdiscussion (2---3 lines)ofwhether thedataistrustworthyandwhy. v Correlation output of your choosing (graphical and/or numerical) with an accompanyingdiscussion(3---4lines).Ifnumerical,roundthecorrelationsto2digits; v Asingleparagraphexplainingtheoutcomeofthedeterminanttest,Bartlett’stestof sphericityandtheKMOstatisticforbothdatasets.DonotincludeRoutput. v Yourdecisionregardingthenumberoffactorstoestimate(screeplotmaybeshown, donotshowtheRconsoleoutput). v TheFINALfactorsolution.Youdonotneedtodiscussresultsofanyoftheothersolutions, however you should justify your final factor solution, including loadings, and name the factorsineachanalysis.Youshouldalsoincludeuptotwosentencesindicatingwhetherthe testofresidualswaspassedandwhetherthefactorsarecorrelated. v Allfactorsshouldbenamedandanexplanationastohowyoucomeupwiththese namesshouldbeincluded. v Basedonthefactoranalysisresultsandyourchosenfactornames,discussthefactors thathaveemergedfromthestudy.Whattypesofdifferences(ifany)existbetween theexpertandamateursensometricratings? iii) Conclusions:write2paragraphsofconclusionsbasedonyouranalysis. Hints: v Tomakethecorrelationmatrixmorereadable,usetheround() commandinR,e.g. round(cor(df, 2)) willcomputethecorrelationmatrixofthedatainthematrixdf,totwodecimalplaces.Youcanuse thistipforanyothermatricestoo. v Thebestsolutionmayormaynotbetherotatedsolution,basedonyourrandomlyselected sessions.ChooseyoursolutionbasedontheprinciplesofagoodExploratoryFactor Analysis(EFA). v Ifitemsarenotloadingontoafactor,onereasoncouldbethatyouhavenotextracted enoughfactorsfromthedata.Reconsideryouranalysisifnecessaryhoweverthismaynot solvetheproblem.UsetheprinciplesofEFAtomakeyourfinaldecision. v WhilenosplitloadingsaredesirableinEFA,asmallnumbermaybeunavoidable.Againyou shouldultimatelychooseyourfinalsolutionbasedontheprinciplesofwhatconstitutesa goodExploratoryFactorAnalysis. v Ifthecorrelationsbetweenfactorssuggestanobliquerotationisrequired,simplynotethis inyourdiscussion.Donotre-runtheanalysis. Question2(40Marks) Are We There Yet? ClusteringCitiesAroundtheWorld Thedataforthisquestionaredistancesbetweencitiesindifferentregionsoftheworld. Youwillneedtousethedatasetindividuallyassignedtoyou. Thefilecities.xlsxontheAssignmentspageindicatesthecontinentassignedtoeachstudent. Each data set contains a distancematrix and can be found on the assignments page, in a file of the form RegionCitiesClustering.dat. For example, for the European data the file will be called EuropeanCitiesClustering.dat. For this question, you are asked to conduct clustering analysis using both hierarchicalandpartitionalclusteringtechniques. Forthepurposesofthisexamaparagraphis8-12linesoftext.Specifically,youranalysisshouldinclude: i) Initial Data Discussion: Write a short explanation (approximately 1 paragraph) of the analysis to be performedandanexplanationofthedataincludinganydatamanipulationperformedpriortoclustering. ii) Hierarchical clustering: conduct hierarchical clustering on the data, choosing an appropriate AGNES- basedmethodbasedoneithersingle,complete,average-linkageorWard’smethod.Ensureyoujustify your choice in your write-up and include the resulting dendrogram, as well as a discussion of the outcomesofhierarchicalclusteringonyourdata. iii) Partitionalclustering:conductapartitionalclusteringofyourdatausingK-means.Ensureyouexplain and include any relevant R output (including graphics) supporting your choice of k, the number of clusters. iv) Discussion:(1-2paragraphs)ofyourresults. v) Validation:asaformofclustervalidation,considerthefollowing: Ifthereareobviousoutliersordistancesthatshouldberemoved,identifytheseinyourwrite-upandre-run yourchosenPartitionalClusteringalgorithm,adjustingkifnecessary.Includejustificationofyourchoiceof thenewvaluefork. If there are no obvious outliers/distances that should be removed, then explain this conclusion with justification.Inthiscasere-runyourchosenPartitionalClusteringalgorithmforadifferentvalueofktothat usedinStep3above.Includejustificationofyourchoiceforthenewvaluefork. vi) Conclusions:write2paragraphsofconclusionsbasedonyouranalysis includingastatementregardingwhich clusteringsolutionisthebetteroneandwhy. Hint: v Forhierarchicalclustering,ensureyoudefinetheheightofthedendrogramaccordingtothesizeofthevalues intheoutput.