https://rpubs.com/Jayblake322/629133
he copied every thing from this site and again he saying no plagiarism those plagiarism sites are not official i will the plagiarism report as well. and i asked for revised version and he gave my friend assignment before that i uploaded my firend assignment to check and he gave that solution as revised solution to me please give refund
Assessment 1: Naive Bayes classifier and Discriminant Analysis Issued: Monday of Week 1 Due: 11:59 PM AEST Sunday of Week 3 Weight: 20% Length: 300 to 1000 words. Maximum score: 25 Marks Overview During this assessment you will insert R code and written discussions with justifications to this template file. This assessment implements and explores techniques mainly covered in weeks one, two and three. The assessment is segmented into two tasks (1) Application of a classifier and (2) Comparison of classifiers. The purpose of the assignment is to enable you to: Code and comment R scripts Implement sub-setting, Bayes classifiers and Linear Discriminate Analysis in RStudio Compare classification algorithms Visually present predictions of classifiers in RStudio Learning outcomes Related subject learning outcomes: Evaluate, synthesise and apply classic supervised data mining methods for pattern classification. Effectively integrate, execute and apply the studied concepts, algorithms, and techniques to real datasets using the computer language R and the software environment RStudio. Communicate data concepts and methodologies of data science Background Real-world application of classifiers may require that the predictors used for classification be physically measured and, hence, the inclusion of unnecessary predictors may incur additional costs associated with sensors, instruments and computing. It should be noted that some variables may even require human intervention and/or expensive laboratory analyses in order to be measured. It is important that analysts try to use as few predictors as possible, that is, the smallest set of predictors that are relevant for the classification task in hand and yet sufficient to provide satisfactory classification performance. Selecting predictors is an important task called feature selection in data mining Assessment submission: Complete this templated document in Word and submit it as a PDF document. The question responses are to be entered in the box immediately below the question. Response boxes can be enlarged to accommodate your answer. Include the following in your submission A PDF file that clearly shows the assignment question, the associated answers, any relevant R outputs, analyses and discussions. Your R code in a script file in case this is needed to check your work. Upload all submission files in one go. You can upload as many times as you want, but only the last submission is grad . Name: Assessment Task 1: Application of a classifier __/11 The Mushroom dataset, available from the UCI Data Repository https://archive.ics.uci.edu/ml/datasets/Mushroom, contains 8124 observations describing mushrooms belonging to classes edible or potentially poisonous (first variable encoded ‘e’ or ‘p’, respectively). There are 22 categorical predictors (variables 2 to 23), one of them with missing values (‘?’). Complete the following five instructions to develop and justify a supervised classifier using RStudio: Question 1 Import the Mushroom Data into RStudio. Provide and comment your R-code. [1 Mark] #read data df <- read.csv('agaricus-lepiota.data',="" header="FALSE," na="c(NA," '?'))="" #read="" header="" names="">-><- c('family',="" 'cap-shape',="" 'cap-surface',="" 'cap-color',="" 'bruises',="" 'odor',="" 'gill-attachment',="" 'gill-spacing',="" 'gill-size',="" 'gill-color',="" 'stalk-shape',="" 'stalk-root',="" 'stalk-surface-above',="" 'stalk-surface-below-ring',="" 'stalk-color-above-ring',="" 'stalk-color-below-ring',="" 'veil-type',="" 'veil-color',="" 'ring-number',="" 'ring-type',="" 'spore-print-color',="" 'population',="" 'habitat')="" #set="" names="" colnames(df)="">-><- names="" question="" 2="" randomly="" split="" the="" dataset="" into="" a="" training="" subset="" and="" a="" test="" subset="" containing="" 80%="" and="" 20%="" of="" the="" data.="" provide="" and="" comment="" your="" r-code.="" [2="" marks]="" #set="" seed="" set.seed(1000)="" #set="" train="" inices="" train_index="">-><- sample(seq(nrow(df)),="" round(0.8*nrow(df)))="" #fix="" veil="" type="" df$`veil-type`="">-><- as.numeric(df$`veil-type`)="" #separate="" train="" train_data="">-><- df[train_index,]="" #separate="" test="" test_data="">-><- df[-train_index]="" question="" 3="" define="" and="" justify="" a="" classifier="" to="" classify="" the="" mushroom="" population="" into="" edible="" or="" poisonous.="" the="" classifier="" needs="" to="" use="" all="" 22="" predictors="" (variables="" v2="" tov23)="" to="" model="" the="" dependent="" variable="" (v1)="" [2="" marks]="" decision="" trees,="" generalized="" boosted="" models,="" logistic="" regression,="" neural="" networks,="" random="" forests="" and="" support="" vector="" machine="" are="" great="" packages="" but="" they="" are="" examples="" of="" discriminative="" models.="" while="" we="" need="" naive="" bayes="" as="" it="" is="" generative="" model.="" generative="" models="" aim="" for="" a="" complete="" probabilistic="" description="" of="" the="" data.="" with="" these="" models,="" the="" goal="" is="" to="" construct="" the="" joint="" probability="" distribution="" p(x,="" y)="" –="" either="" directly="" or="" by="" first="" computing="" p(x="" |="" y)="" and="" p(y)="" –="" and="" then="" inferring="" the="" conditional="" probabilities="" required="" to="" classify="" new="" data.="" question="" 4.="" implement="" the="" proposed="" classifier="" from="" step="" 3="" using="" the="" training="" data="" subset="" from="" step="" 1.="" provide="" and="" comment="" your="" r-code.="" [2="" marks]="" #view="" distribution="" table(df$family)="" #call="" library="" library(mlbench)="" library(e1071)="" #create="" model="" model0="">-><- naivebayes(family="" ~.,="" train_data)="" question="" 5="" display="" the="" summary="" of="" the="" fitted="" model="" implemented="" in="" question="" 4.="" [1="" mark]="" #view="" summary="" model0="" question="" 6="" interpret="" the="" relationships="" between="" the="" predictor="" and="" features="" of="" the="" fitted="" model="" using="" the="" role="" of="" a="" naïve="" bayes="" classifier.="" i.e.="" explain="" the="" relationships="" of="" the="" model="" using="" bayes="" theorem.="" [3="" marks]="" naive="" bayes="" is="" a="" kind="" of="" classifier="" which="" uses="" the="" bayes="" theorem.="" it="" predicts="" membership="" probabilities="" for="" each="" class="" such="" as="" the="" probability="" that="" given="" record="" or="" data="" point="" belongs="" to="" a="" particular="" class.="" the="" class="" with="" the highest="" probability="" is="" considered="" as="" the most="" likely="" class.="" this="" is="" also="" known="" as maximum="" a="" posteriori="" (map).="" summary="" tables="" describe="" probabilities="" of="" classification="" when="" each="" event="" occurs.="" when="" probabilities="" are="" near="" equal="" for="" both="" classes="" that="" means="" variable="" is="" not="" good="" predictor,="" as="" there="" is="" no="" way="" to="" run="" a="" classification="" based="" on="" it.="" based="" on="" the="" summary="" we="" can="" define="" as="" good="" predictors="" are="" odor="" and="" spore-print-color="" task="" 2="" on="" next="" page="" assessment="" task="" 2:="" comparison="" of="" classifiers="" __/6="" in="" this="" task="" compare="" the="" performance="" of="" the="" supervised="" learning="" algorithms="" linear="" discriminant="" analysis,="" quadratic="" discriminant="" analysis="" and="" the="" naïve="" bayes="" classifier="" using="" a="" publicly="" available="" blood="" pressure="" data.="" the="" data="" to="" be="" used="" for="" this="" task="" is="" provided="" in="" the="" hbblood.csv="" file="" in="" the="" assessment="" 1="" folder.="" the="" hbblood.csv="" dataset="" contains="" values="" of="" the="" percent="" hba1c="" (a="" measure="" of="" the="" amount="" of="" glucose="" and="" haemoglobin="" joined="" together="" in="" blood)="" and="" systolic="" blood="" pressure="" (sbp)="" (in="" mm/hg)="" for="" 1,200="" clinically="" healthy="" female="" patients="" within="" the="" ages="" 60="" to="" 70="" years.="" additionally,="" the="" ethnicity,="" ethno,="" for="" each="" patient="" was="" recorded="" and="" discombobulated="" into="" three="" groups,="" a,="" b="" or="" c,="" for="" analysis.="" question="" 7="" discuss="" the="" properties="" of="" using="" the="" supervised="" learning="" algorithms="" linear="" discriminant="" analysis,="" quadratic="" discriminant="" analysis="" and="" the="" naïve="" bayes="" classifier="" to="" predict="" ethno="" using="" hba1c="" and="" sbp="" as="" the="" feature="" variables.="" provide="" any="" plots/images="" needed="" to="" support="" your="" discussion.="" (100-200="" words).="" [6="" marks]="" #load="" data="" blood="">-><- read.csv("hbblood.csv")="" #remove="" wrong="" coded="" blood="">-><- blood[blood$ethno="" %in%="" c('a',="" 'b',="" 'c'),]="" blood="">-><- droplevels(blood)="" #par="" par(mfrow="c(1,3))" boxplot(sbp="" ~="" ethno,="" blood,="" main='SBP vs Ethno' )="" boxplot(hba1c="" ~="" ethno,="" blood,="" main='HbA1c vs Ethno' )="" plot(blood$hba1c,="" blood$sbp,="" col="blood$Ethno," main='Blood data \nvisualization' )#lib="" require(caret)="" #lda="" blood1="">-><- train(ethno~.,="" data="blood," method="lda" ,="" metric="Accuracy" ,="" na.action="na.omit)" #performance="" prediction_lda="">-><- predict(blood1,="" blood)="" #accuracy="" sum(diag(table(blood$ethno,="" prediction_lda)))="" sum(table(blood$ethno,="" prediction_lda))="" #lda="" blood2="">-><- train(ethno~.,="" data="blood," method="qda" ,="" metric="Accuracy" ,="" na.action="na.omit)" #performance="" prediction_qda="">-><- predict(blood2,="" blood)="" #accuracy="" sum(diag(table(blood$ethno,="" prediction_qda)))="" sum(table(blood$ethno,="" prediction_qda))="" #lda="" blood3="">-><- train(ethno~.,="" data="blood," method="nb" ,="" metric="Accuracy" ,="" na.action="na.omit)" #performance="" prediction_nb="">-><- predict(blood1,="" blood)="" #accuracy="" sum(diag(table(blood$ethno,="" prediction_nb)))="" sum(table(blood$ethno,="" prediction_nb))="" #lda="" and="" qda="" are="" similar="" as="" they="" are="" both="" classification="" techniques="" with="" gaussian="" assumptions.="" they="" differ="" only="" with="" separation="" rule.="" for="" lda="" it="" is="" strict="" line,="" while="" for="" qda="" it="" is="" quadratic="" boundary.="" naive="" bayes="" is="" a="" very="" different="" classifier,="" having="" nothing="" to="" do="" with="" gaussion="" assumption.="" naïve="" bayes="" is="" explicitly="" modeling="" the="" probability="" of="" a="" class="" using="" bayes's="" rule="" and="" makes="" the="" assumption="" that="" features="" are="" independent="" instead.i="" think="" one="" of="" important="" differences="" is="" that="" lda="" operates="" on="" continuous-valued="" features,="" whereas="" naïve="" bayes="" operates="" on="" categorical="" features.="" this="" makes="" them="" quite="" different="" creatures,="" and="" other="" differences="" fall="" out="" from="" that.="" naive="" bayes="" is="" more="" close="" to="" lda="" as="" they="" are="" both="" linear,="" while="" qda="" is="" quadratic.="" for="" this="" task="" data="" are="" not="" splitter="" lineary,="" that’s="" why="" we="" obtained="" prediction="" accuracy="" about="" 41%="" for="" lda,="" 60%="" for="" qda="" and="" 41%="" for="" nb.="" for="" this="" dataset="" qda="" is="" preferable="" approach.="" task="" 3="" on="" next="" page="" assessment="" task="" 3:="" implementation="" of="" classifiers="" __/8="" in="" this="" task,="" compare="" the="" performance="" of="" the="" supervised="" learning="" algorithms="" linear="" discriminant="" analysis="" and="" the="" naïve="" bayes="" classifier="" using="" a="" publicly="" available="" heart="" diseases="" data.="" the="" heart.txt="" data="" contains="" average="" systolic="" blood="" pressures="" (sbp)="" for="" men="" and="" women="" from="" 41="" different="" countries.="" it="" also="" gives="" the="" 95%="" confidence="" interval="" for="" each="" estimated="" blood="" pressure="" measure="" question="" 8="" implement="" both="" the="" lda="" and="" naïve="" bayes="" classifiers="" using="" the="" heart.csv="" data="" to="" classify="" gender="" (women="" and="" men)="" using="" the="" variable="" sbp="" only.="" display="" the="" r-code,="" code="" comments="" and="" model="" summaries="" for="" both="" models.="" [2="" marks]="" #read="" data="" heart="">-><- read.csv("heart="" (1).txt",quote="" ,="" sep=' ' ,="" row.names="NULL," stringsasfactors="FALSE)" #select="" complete="" heart="">-><- heart[complete.cases(heart),]="" #remove="" unused="" heart[,="" c("row.names",="" "x.country.")]="">-><- null="" #fix="" names="" colnames(heart)="">-><- c('sex',="" 'sbp',="" '95.lcl.sbp',="" '95.ucl.sbp')="" #create="" model="" model1="">-><- train(sex~sbp="" data="heart," method="lda" )="" model2="">-><- train(sex~sbp,="" data="heart," method="naive_bayes" )="" #view="" summary="" model1="" model2="" question="" 9="" compare="" the="" model="" error="" of="" the="" lda="" and="" naïve="" bayes="" classifiers="" derived="" in="" question="" 8.="" (30="" –="" 80="" words)="" [2="" marks]="" #create="" contingency="" tables="" tab_lda="">-><- table(heart$sex,="" predict(model1,="" heart))="" tab_nb="">-><- table(heart$sex, predict(model2, heart)) #estimate error 1-sum(diag(tab_lda))/sum(tab_lda) 1-sum(diag(tab_nb))/sum(tab_nb) #errors for both models for sample-in data are equal zero. models also have the same kappa. both models describe excellent performance on train data. question 10 discuss your findings from question 8 and question 9 using the algorithm assumptions of both lda and naïve bayes classifiers as the basis of your discussion. provide any plots/images or analysis needed to support your discussion. (80-200 words). [4 marks] #par par(mfrow = c(2,3)) plot(heart$sbp, heart$`95.lcl.sbp`, col = heart$sex, main = 'scatterplot') plot(heart$sbp, heart$`95.ucl.sbp`, col = heart$sex, main = 'scatterplot') plot(heart$`95.lcl.sbp`, heart$`95.ucl.sbp`, col = heart$sex, main = 'scatterplot') boxplot(sbp ~ sex, heart, main = 'boxplot') boxplot(`95.ucl.sbp` ~ sex, heart, main = 'boxplot') boxplot(`95.lcl.sbp` ~ sex, heart, main = ‘boxplot’) table(heart$sex) #for this specific task we can see that all variables have non-overlapped ranges across factors of categorical variable sex. so, using even single predictor we can get excellent classification results. data table(heart$sex,="" predict(model2,="" heart))="" #estimate="" error="" 1-sum(diag(tab_lda))/sum(tab_lda)="" 1-sum(diag(tab_nb))/sum(tab_nb)="" #errors="" for="" both="" models="" for="" sample-in="" data="" are="" equal="" zero.="" models="" also="" have="" the="" same="" kappa.="" both="" models="" describe="" excellent="" performance="" on="" train="" data.="" question="" 10="" discuss="" your="" findings="" from="" question="" 8="" and="" question="" 9="" using="" the="" algorithm="" assumptions="" of="" both="" lda="" and="" naïve="" bayes="" classifiers="" as="" the="" basis="" of="" your="" discussion.="" provide="" any="" plots/images="" or="" analysis="" needed="" to="" support="" your="" discussion.="" (80-200="" words).="" [4="" marks]="" #par="" par(mfrow="c(2,3))" plot(heart$sbp,="" heart$`95.lcl.sbp`,="" col="heart$Sex," main='Scatterplot' )="" plot(heart$sbp,="" heart$`95.ucl.sbp`,="" col="heart$Sex," main='Scatterplot' )="" plot(heart$`95.lcl.sbp`,="" heart$`95.ucl.sbp`,="" col="heart$Sex," main='Scatterplot' )="" boxplot(sbp="" ~="" sex,="" heart,="" main='Boxplot' )="" boxplot(`95.ucl.sbp`="" ~="" sex,="" heart,="" main='Boxplot' )="" boxplot(`95.lcl.sbp`="" ~="" sex,="" heart,="" main="‘Boxplot’)" table(heart$sex)="" #for="" this="" specific="" task="" we="" can="" see="" that="" all="" variables="" have="" non-overlapped="" ranges="" across="" factors="" of="" categorical="" variable="" sex.="" so,="" using="" even="" single="" predictor="" we="" can="" get="" excellent="" classification="" results.="">- table(heart$sex, predict(model2, heart)) #estimate error 1-sum(diag(tab_lda))/sum(tab_lda) 1-sum(diag(tab_nb))/sum(tab_nb) #errors for both models for sample-in data are equal zero. models also have the same kappa. both models describe excellent performance on train data. question 10 discuss your findings from question 8 and question 9 using the algorithm assumptions of both lda and naïve bayes classifiers as the basis of your discussion. provide any plots/images or analysis needed to support your discussion. (80-200 words). [4 marks] #par par(mfrow = c(2,3)) plot(heart$sbp, heart$`95.lcl.sbp`, col = heart$sex, main = 'scatterplot') plot(heart$sbp, heart$`95.ucl.sbp`, col = heart$sex, main = 'scatterplot') plot(heart$`95.lcl.sbp`, heart$`95.ucl.sbp`, col = heart$sex, main = 'scatterplot') boxplot(sbp ~ sex, heart, main = 'boxplot') boxplot(`95.ucl.sbp` ~ sex, heart, main = 'boxplot') boxplot(`95.lcl.sbp` ~ sex, heart, main = ‘boxplot’) table(heart$sex) #for this specific task we can see that all variables have non-overlapped ranges across factors of categorical variable sex. so, using even single predictor we can get excellent classification results. data>