Files submitted
Assessment 1: Naive Bayes classifier and Discriminant Analysis Issued: Sunday of Week 1 Due: 11:59 PM AEST Sunday of Week 3 Weight: 30 % Maximum score: 50 Marks Overview During this assessment you will insert R code and written discussions with justifications to this template file. This assessment implements and explores techniques mainly covered in Week 1 and Week 2. The assessment is segmented into three tasks (1) Comparison of classifiers; (2) Application of a classifier; and (3) Implementation of classifiers. The purpose of the assignment is to enable you to: 1. Code and comment R scripts 2. Implement sub-setting, Bayes classifiers and Discriminant Analysis in RStudio 3. Compare classification algorithms 4. Visually present predictions of classifiers in RStudio Learning outcomes Related subject learning outcomes: 1. Evaluate, synthesise and apply classic supervised data mining methods for pattern classification. 2. Effectively integrate, execute and apply the studied concepts, algorithms, and techniques to real datasets using the computer language R and the software environment RStudio. 3. Communicate data concepts and methodologies of data science Background Real-world application of classifiers may require that the predictors used for classification be physically measured and, hence, the inclusion of unnecessary predictors may incur additional costs associated with sensors, instruments and computing. It should be noted that some variables may even require human intervention and/or expensive laboratory analyses in order to be measured. It is important that analysts try to use as few predictors as possible, that is, the smallest set of predictors that are relevant for the classification task in hand and yet sufficient to provide satisfactory classification performance. Selecting predictors is an important task called feature selection in data mining Assessment submission: Your submission should include: An output of the PDF/html file that clearly shows the assignment question, the associated answers, any relevant R outputs, analyses and discussions. The R-script (code) file as evidence. The assignment should not exceed 8-A4 pages. Appendices do not form part of the page limit. The assignment must be presented in 12 font on A4 pages using single line spacing. The task cover sheet. Note that RMarkdown is not required for this assessment but highly recommended. Upload all submission files in one go. You can upload the assessment up to 3 times, however, only the last submission is graded. A word on plagiarism Plagiarism is the act of using another's words, works or ideas from any source as one's own. Plagiarism has no place in a University. Student work containing plagiarised material will be subject to formal university processes in line with procedure described in the subject outline. . Assessment Task 1: Comparison of classifiers In this task compare the performance of the supervised learning algorithms Linear Discriminant Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier using a publicly available Blood Pressure Data. The data to be used for this task is provided in the HBblood.csv file in the Assessment 1 folder. The HBblood.csv dataset contains values of the percent HbA1c (a measure of the amount of glucose and haemoglobin joined together in blood) and systolic blood pressure (SBP) (in mm/Hg) for 1,200 clinically healthy female patients within the ages 60 to 70 years. Additionally, the ethnicity, Ethno, for each patient was recorded and categorized into three groups, A, B or C, for analysis. 1. Discuss and justify which of the supervised learning algorithms (i.e. Linear Discriminant Analysis, Quadratic Discriminant Analysis and the Naïve Bayes Classifier) would you choose for predicting the response Ethno using HbA1c, and SBP as the feature variables. Provide any plots/images needed to support your discussion. Hint: Base your answer on the empirical statistical properties of the data in relation to model assumptions. Task 2 on next page Marks - 10 Assessment Task 2: Application of a classifier Nursery Data (nursery.csv) was derived from a hierarchical decision model originally developed to rank applications for nursery schools. It was used during several years in 1980's when there was excessive enrollment to these schools in Ljubljana, Slovenia, and the rejected applications frequently needed an objective explanation. The dataset given here contains 8 attribute information (column names) and 1 decision variable on nearly 13,000 applications. The attached folder contains description of the data. Please complete the following tasks. 1. Randomly split the dataset into a training subset and a test subset containing 80% and 20%of the data. Provide your R-code with appropriate annotation/commentary. 2. With the training data check if the attributes "finance", "social", and "parents" have statistially significant association with the response, "outcome". Hint: You can choose any statistical test from MA5820, for example. 3. Propose a classification methodology to classify a randomly selected nursery application, outcome (response), into categories, accept versus reject using the eight features on this dataset. Please give reasons for your choice based on your learning from Weeks 1 & 2. 4. Implement the classifier proposed in Question 2, on the training data subset you created in Question 1. Provide your R-code with appropriate annotation/commentary. Fit a classifier with all 8 features. Using relevant R outputs Interpret and discuss the relationships between the predictors and response variables. For discussion/interpretation you can choose any two of the three attributes you investigated in Q2. 5. Discuss the accuracy of the fitted model on the test data. Show relevant R codes and output to support your discussion. Did the fitted model improve prediction compared to a model with no features? Task 3 on next page Marks 20 Marks 6 Marks 1 Marks 3 Marks 6 Marks 4 https://archive.ics.uci.edu/ml/datasets/Mushroom Assessment Task 3: Implementation of classifiers In this task, compare the performance of the supervised learning algorithms Linear Discriminant Analysis and the Naïve Bayes Classifier using the Breast Cancer Wisconsin (Diagnostic) Data Set (wdbc.data). Thirty features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. 1. Implementation - parts of this question would require you to revise Sec 4 of ISLR but also the notion of covariance matrix of a multivariate feature vector. a. For each tumour class (M and B) compute the generalized variance (g.v.) of the feature vectors (consisting of 30 features). Heuristically (no statistical test) compare the two g.vs and comment which type of discriminant analysis is more appropriate for this data. Hint- The generalized variance of a multivariate feature vector is the determinant of its covariance matrix. Generalized variance is the multivariate equivalent of variance. b. Use a randomly 90% sub-sample of this data as your training sample. c. Implement your recommended DA and the Naive Bayes classifiers on the training sample to classify tissue samples into classes- M or B. Show the R-code with annotation, and model summaries for both algorithm. 2. For each algorithm -NB and DA - show the true positive, false positive and accuracy rates for the training and test samples. 3. Based on Q2 recommend the right choice of algorithm for this data. Attribute Information: 1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32) Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1) The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. Marks 20 Marks 4 Mark 1 Marks 6 Marks 6 Marks 3 Marking Criteria and Rubric: MA5810 Assessment 1 Criterion High Distinction Distinction Credit Pass Fail R code (20%) Code submitted Code works correctly, meets the specifications, produces the correct results and displays them correctly. Code is exceptionally well organised and very easy to follow. Code always very well commented so the purpose of each block of code readily understood and what question part it corresponds to. Variable names give the purpose of the variable. Code submitted Code works correctly, meets the specifications, and produces correct results but may not display all of it correctly. Code is clean, understandable and well- organised, with just some minor errors. Code is well commented so that there is very little ambiguity of the code purpose. One or two places could benefit from comments, or the code is overly commented. Variable names clearly describe the purpose of the variable. Code submitted Code mostly works correctly, but functions incorrectly on some inputs. Minor details of the specification are violated. Code is fairly easy to read, although contains at least one major issue that detracts from clarity. The comments leave some code block ambiguous as to the purpose. One or two places could benefit from comments, or the code is overly commented. Variable names do not describe the purpose of the variable Code only provided in answer document but looks correct Code often exhibits incorrect behaviour. Significant details of specification are violated. Code contains more than one major issue that makes it difficult to read. The code is readable only by someone who already knows what it is supposed to be doing. Comments not sufficient to see what the code is doing. Significant lack of comments makes it difficult to understand code. Code not submitted Code not provided in answer document. Code produces incorrect results, does not compile, or significant errors occur. Code is poorly organised and very difficult to read. Code has no comments. Methodology (40%) The methodology implemented is expertly documented and justified. The methodology implemented reflects a sophisticated and nuanced understanding of relevant concepts. All assumptions validated and communicated concisely. The methodology