Microsoft Word - MA5810Assignemnt 2.docxMA5810-Assessment2 Weighting:30%Totalmarks:70.Due date: Week 5 - Sunday, This assessment focuses on machine learning techniques covered...

1 answer below »
Please use 'MA5810_Assessment 2.pdf' for instructions.


Microsoft Word - MA5810Assignemnt 2.docx MA5810-Assessment2 Weighting:30%Totalmarks:70.Due date: Week 5 - Sunday, This assessment focuses on machine learning techniques covered duringWeeks2-5with primaryfocusontopicsof3,4,and5. WhereverrequiredyoumustshowevidenceofyourworkusingR-codeandoutput,as partofyourRscriptorRMarkdownsubmission. The purpose of the assignment is foryou to: • Demonstrate sound knowledge of the basic theory, principles and concepts that underpin data mining and exemplify the most common tasks and types of data mining problems. • Apply classic supervised and/or unsupervised data mining methods to analyse and evaluate descriptive analytics tasks. Submission You will need to submit the following: • A PDF file clearly shows the assignment question number, the associated answers, analyses and discussions. The assignment must be presented in 12pt font on A4 pages using single line spacing and 2.5cm margins • Rscript or R markdown file to reproduce your work. Please attach a separate file or copy the code into an Appendix. • The assignment shouldnotexceed9-A4pages. Appendices do not form part of the page limit. You have up to three attempts to submit your assessment, and only the last submission will be graded. Awordonplagiarism: Plagiarism is the act of using another’s words, works or ideas from any source as one’s own. Plagiarism has no place in a University. Student work containing plagiarised material will be subject to formal university processes. Question1-TotalMarks40 ConsidertheBreastCancerWisconsin(Diagnostic)DataSet(wdbc.data).Thirtyfeaturesare computedfromadigitizedimageofafineneedleaspirate(FNA)ofabreastmass.Theydescribe characteristicsofthecellnucleipresentintheimage.AquickrecalloftheAttributes- 1)IDnumber 2)Diagnosis(M=malignant,B=benign) 3-32)Tenreal-valuedfeaturesarecomputedforeachcellnucleus: a)radius(meanofdistancesfromcentertopointsontheperimeter) b)texture(standarddeviationofgray-scalevalues) c)perimeter d)area e)smoothness(localvariationinradiuslengths) f)compactness(perimeter^2/area-1.0) g)concavity(severityofconcaveportionsofthecontour) h)concavepoints(numberofconcaveportionsofthecontour)i)symmetry j)fractaldimension("coastlineapproximation"-1) Themean,standarderror,and"worst"orlargest(meanofthethreelargestvalues)ofthese featureswerecomputedforeachimage,resultingin30features.Forinstance,field3isMean Radius,field13isRadiusSE,field23isWorstRadius. Assignmenttasks: Importthedataintoyoursession. 1. Partitionthedatainto90%trainingandremainingastestsamples.Fitalogistic regressionmodelforDiagnosisagainstallnumericfeaturestothetrainingsample. Marks6 2. Discussanydifficultiesinthemodelfitandinterpretationofthecoefficients. FromthesummaryoffittedmodelinterprettherelationshipbetweenDiagnosisand thefeaturesTextureandConcavity..Marks4 3.Returntotheunpartitioneddata.Usedescriptivemethodstoinvestigatethe correlationbetweenthe30numericfeaturesontheBCdata.Showrelevantoutput. Marks4 4.Suggestandimplementanunsupervisedlearningmethodtoderivesecondary featuresthataddressinter-featurecorrelation.ShowR-code.Marks6 5.Selectasubset(filter(.))ofsecondaryfeaturesobtainedin4)Marks12 a.Justifyyourapproachusingresult(s)obtainedin4). b.Partitionthedatacontainingsecondaryfeaturesintotraining(90%)versustest samples.Usethedataobtainedin5a)tofitalogisticregressionmodelwithDiagnosis asresponseonthisnewtrainingsample. c.UsethesamefeaturestofitaquadraticdiscriminantanalysistoDiagnosis. 6.Implementbothmodelsonthetestdataalongwiththelogisticregressionmodelwith allfeatures(asinQ1)Marks8 Provideaccuracymeasuresforeachcaseanddiscussyourfindings. Question2-TotalMarks30 Clustering is a common exploratory technique used in bioinformatics where researchers aim to identify subgroups within diseases using gene expression. Imagine you are asked to analyse the gene expression dataset available in the leukemia_dat.Rdatafile. This data was originally generated by [Golub et al., Science, 1999] https://science.sciencemag. org/content/sci/286/5439/531.full.pdfand contains the expression level of 1867 selected genes from 72 patients with different types of leukemia. The data in each column are summarized as follows: •Column 1: patient id = a unique identifier for each patient (observation) •Column 2: type = A factor variable with two subtypes of leukemia; acute lymphoblastic leukemia (ALL, n = 47) and acute myeloblastic leukemia (AML, n = 25). •Columns 3: to 1869. Gene expression data for 1867 genes, Gene 1, ..., Gene 1867. AssignmentTasks: The researchers hypothesized that patient samples will cluster by subtype of leukemia based on gene expression. Your task is to use a clustering technique to address this scientific hypothesis and report your results back to the researcher. (a) Select a clustering technique to apply. Justify your choice. Marks5 (b) Implement your chosen clustering technique in R. Describe your implementation You need to provide details of all steps relating to the implementation of the clustering algorithms, such as data preparation including any transformations performed on the data prior to clustering, training the model & evaluating the performance of the model. Marks25 Rubric template Criteria HD P Fail Rmarkdown/R (10%) Codes are reproducible. Demonstrate superior ability to write code in Rmarkdown/R efficiently and produce accurate results. Code is well organised and very easy to follow. Code is well commented so the purpose of each block of code readily understood and what question part it corresponds to. Variable names give the purpose of the variable. Codes are reproducible. Demonstrate limited ability to use R/Rmarkdown. Some of the results produced by the code are accurate. The code is readable only by someone who already knows what it is supposed to be doing. Comments not sufficient to see what the code is doing. Significant lack of comments makes it difficult to understand code. A lack of compliance with the factors described in adjacent columns. Question 1 (30%) Demonstrate superior understanding and implementing the logistic regression to classify breast cancer type. Provide full detail of the implementation. The results and discussion are explained correctly, clearly, and in sufficient detail. Demonstrate some understanding and implementing the logistic regression to classify breast cancer type. Provide some steps of the implementation in detail. The results and discussion are explained clearly and in sufficient detail most of the time. There are some misunderstandings in interpreting results. A lack of compliance with the factors described in adjacent columns. Question 2 (30%) Demonstrate superior understanding of complete and single linkage clustering. Provide all steps to obtain the dendrograms. Writing is authentic, easy to understand with excellent level of detail. Demonstrate limited understanding of complete and single linkage clustering. Lack of explanations on some steps obtaining the dendrograms. Writing is authentic, easy to understand with some level of detail. A lack of compliance with the factors described in adjacent columns. Glenn Fulford Glenn Fulford Glenn Fulford Glenn Fulford Glenn Fulford Glenn Fulford 40% Glenn Fulford Glenn Fulford incorporated into Q1 and Q2 marks Question 3 (30%) Demonstrate superior understanding and implementing clustering algorithms. Provide full detail of the implementation. The results and discussion are explained correctly, clearly, and in sufficient detail. Demonstrate some understanding and implementing clustering algorithms. Provide some steps of the implementation in detail. The results and discussion are explained correctly, clearly and in sufficient detail most of the time. There are some misunderstandings in interpreting results. A lack of compliance with the factors described in adjacent columns. Glenn Fulford Glenn Fulford Question 2 Rubric template
Answered 4 days AfterNov 21, 2022

Answer To: Microsoft Word - MA5810Assignemnt 2.docxMA5810-Assessment2 Weighting:30%Totalmarks:70.Due...

Amar Kumar answered on Nov 24 2022
55 Votes
Q1.
1.
Calculate the accuracy
This function determines how accurate our algorithm is.
Code 1: The algorithm used to determine accuracy.
I am putting the log regression with two variables into practice.
In the sections before this one, all of the essential functions needed to carry out the Logistic Regression were built. Let us quickly go over each one:
To gauge the results of danger in light of two of the 20 non-repetitive characteristics in our dataset, we will currently construct the code that envelops these capabilities. Because they have a connection value of 0.32, we might select Sweep and Surface as one of the element matches from the Stage 3 disclosure procedure.
The following DataFrame df code is used to create the output NumPy vector Y and features of the NumPy array X:
    Code 2: Create the NumPy arrays for X and Y.
Plotting the two characteristics
    Code 3: Draw a feature map.
Figure 1 shows the plot that was produced as a result:
            Fig 1. Plotting the dimension and texture
The yellow spheres indicate the dark, malignant, and benign cells.
Scale and normalise our data now.
Additionally, the typical X values in our practise set, or mu, and the standard deviation, or sigma, must be gathered.
Create a new cell in your notepad and write the following:
Code 4: Implement Feature Scaling and Normalization.
The function must now be used to add a "ones" column to the array X. stack:
    Code 5: The X matrix should now have a column of "ones"
Testing
Let's put a few things to the test: Let's try to calculate the Gradient & Revenue Function to test our code. With a = [0, 0, 0]:
Code 6: With an initial value of zero, calculate the Gradient and Cost Function for the first test.
The new vector's J() value is 0.69, and its coordinates are [0.12741652, -0.35265304, -0.20056252].
We could also try using values that are not zero to see what happens:
Code 7: Use a starting value that is not zero to calculate the cost function and gradient for the second exam.
The revised vector is now = [-0.37258348, -0.35265304, -0.20056252] with a corresponding J() value of 8.48.
Advanced Descent Optimization for Gradients
Using the Create a visually, Taylor, Goldfarb, and Shanno quasi-Newton technique [5], we will construct the BFGS optimisation method. The BFGS method will be used internally by the function Scypy minimise, which will be implemented in Code 8.
Code 8: Advanced Descent Optimization for Gradients
The BFGS algorithm is utilised by default if we do not indicate the method type we wish to use in the parameter "approach". Minimise procedure. Using a truncated Newton algorithm, another method, TNC, minimises a function with bounding variables. With Scypy's.minimize capability, clients can try out the different upgrading calculations that are accessible. Discover further about the role. Minimise and the other optimisation techniques on the Scypy demonstrated the application. Code 7 results in the following:
Limit on choices
Using the BFGS algorithm, the scypy.minimize function's Result.x argument was located as = [-0.70755981, 3.72528774, 0.93824469].In Step 3, we stated that the likelihood of the result is either 0 or 1 is determined by the Hypothesis h(x) for Logistic Regression. To discretise this probability into the classes "Bening/Malignant," we select a threshold of 0.5, above which we will classify values as "1," and below which we will classify values as "0."Consequently, we must keep an eye on the previously defined Decision Boundary. A decision boundary is not a feature of a dataset but rather of a hypothesis and its inputs. Again plotting the Radius and Texture features, this time with a red line indicating the discovered's Decision Boundary:
Code 9: Draw the Data Boundary and the Decision on a Map
    Fig 2. Both the radius and the texture are plotted simultaneously to the decision boundary.
Although the Logistic Regression Hypothesis model has a non-linear (nonlinear activation) function, it is critical to remember that the Discriminator is linear.
Figure out the accuracy.
We now want to determine how accurate our algorithm is. This will be accomplished via the function CalcAccuracy mentioned :
Code10: determine the accuracy
89.1 is the result of CalculateAccuracy, which is a good accuracy rating.
Make a forecast.
We wish to make predictions now that we have tested our system and determined its correctness. A query may look like this: we want to know what happens when we use the parameters radius = 18.00 and texture = 10.12. The code below illustrates this.
Code: Calculate the likelihood of cancer for a Radius of 18.00 and a Texture of 10.12, respectively.
Keep taking mind that the Inquiry should be standardised involving mu and sigma for scaling and standardisation. With a radius of 18 and a texture of 10.12, the predicted outcome is 0.79, which indicates that the likelihood of malignancy is close to 1.
2.
Regression Variable P Value Interpretation.
Inferential statistics include regression analysis. Regression p values can be used to determine whether the associations you find in your group apply to the entire population. The p-value of each exogenous variable in a linear regression tests the hypothesis that the variable does not relate to the predictor variables. If there is no correlation, there cannot be a link between changes in the dependent variable and variation in the independent variables. In other words, more information is needed to establish clearly that there was a change in the population.
If the p-value for a particular variable is below your significance threshold, the sample data are sufficient to reject the null hypothesis for such an unpopulactionion. Your findings support the notion that there is a correlation, one that isn't zero. Variations in the predictor variables are relatedtakingtake to ttakingndependent variables at the population level. This variable's statistical significance implies that you should include it in your regression model.
On the tp-valuer hand, the p-value of a regression indicates if there are insufficient data in your sample to support a non-zero association if it is more than the significance threshold.
The regression output sample below illustrates the statistical importance of the South and North predicvariablessThe p-values for the South and North predictor variables are equal to 0.000. However, since East's p-value (0.092) is higher than the typical significance level of 0.05, it is not statically important.
The correlation p-values are typically used to decide whether to include components inside the final model. Let’s consider eliminating East in light of the information provided above. It's possible that keeping variables that aren't statistically significant will decrease the model's precision.
3.
Bivariate Analysis
When choosing...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here