All attached in fileSIT720 Machine Learning Assessment Task 2: Individual Problem Solving task ©...

Question

All attached in fileSIT720 Machine Learning  Assessment Task 2: Individual Problem Solving task  © Deakin University 1 FutureLearn  SIT720 Machine Learning    Assessment 2: Problem solving task    This document supplies detailed information on assessment tasks for this unit.    Key information  • Due: Wednesday 4 September 2019 by 11.30pm AEST  • Weighting: 25%  • Word count: Max 30 pages    Learning Outcomes  This assessment assesses the following Unit Learning Outcomes (ULO) and related Graduate Learning Outcomes (GLO): Unit Learning Outcome (ULO) Graduate Learning Outcome (GLO)  ULO 2: Perform linear regression, and linear  classification for two and more classes using logistic  regression model.   GLO 1: Discipline knowledge and capabilities  GLO 4: Critical thinking  GLO 5: Problem solving  ULO 5: Perform model assessment and selection for  linear and logistic regression models.   GLO 1: Discipline knowledge and capabilities  GLO 4: Critical thinking      Purpose  Demonstrate your skills for applying regularized logistic regression to perform two-class and multi-class classification for real- world tasks. You also need to demonstrate your skill in recognizing under-fitting/overfitting situations.    Instructions    This is an individual assessment task of maximum 20 pages including all relevant material, graphs, images and tables. Students  will be required to provide responses for series of problem situations related to their analysis techniques. They are also required  to provide evidence through articulation of the scenario, application of programming skills, analysis techniques and provide a  rationale for their response.    Part-1: Binary Classification  For this problem, we will use a subset of the Wisconsin Breast Cancer dataset. Note that this dataset has some information  missing.    1.1 Data Munging (3 Marks)  Cleaning the data is essential when dealing with real world problems. Training and testing data is stored in  "data/wisconsin_data" folder. You have to perform the following:  • Read the training and testing data. Print the number of features in the dataset. (0.5 marks)  • For the data label, print the total number of B's and M's in the training and testing data. Comment on the class  distribution. Is it balanced or unbalanced? (0.5 marks)  • Print the number of features with missing entries (feature value is zero). (0.5 marks)  • Fill the missing entries. For filling any feature, you can use either mean or median value of the feature values from  observed entries. Explain the reason behind your choice. (1.0 marks)  • Normalize the training and testing data. (0.5 marks)      https://cloudstor.aarnet.edu.au/plus/s/hTypqSrQ1vBbd6R SIT720 Machine Learning  Assessment Task 2: Individual Problem Solving task  © Deakin University 2 FutureLearn  1.2 Logistic Regression (5 Marks) Train logistic regression models with L1 regularization and L2 regularization using alpha = 0.1  and lambda = 0.1. Report accuracy, precision, recall, f1-score and print the confusion matrix.  1.3 Choosing the best hyper-parameter (7 Marks)  A- For L1 model, choose the best alpha value from the following set: {0.1,1,3,10,33,100,333,1000, 3333, 10000, 33333} based on  parameter P. (2 Marks)  B- For L2 model, choose the best lambda value from the following set: {0.001, 0.003, 0.01, 0.03, 0.1,0.3,1,3,10,33} based on  parameter P. (2 Marks)  [Hints: To choose the best hyperparameter (alpha/lambda) value, you have to do the following:  • For each value of hyperparameter, perform 10 random splits of training data into training (70%) and validation (30%)  set.  • Use these 10 sets of data to find the average validation performance P.  • The best hyperparameter will be the one that gives maximum validation performance.  • Performance is defined as: P='accuracy' if fID=0, P='f1-score' if fID=1, P='precision' if fID=2. Calculate fID using  modulus operation fID=SID % 3, where SID is your student ID. For example, if your student ID is 356288 then  fID=(356288 % 3)=2 then use 'precision' for selecting the best value of alpha/lambda.]  C- Use the best alpha and lambda parameter to re-train your final L1 and L2 regularized model. Evaluate the prediction  performance on the test data and report the following:  • Precision and Accuracy (1 Mark)  • The top 5 features selected in decreasing order of feature weights. (1 Mark)  • Confusion matrix (1 Mark) Part-2 (Multiclass Classification):  For this experiment, we will use a small subset of MNIST dataset for handwritten digits. This dataset has no missing data. You  will have to implement one-versus-rest scheme to perform multi-class classification using a binary classifier based on L1  regularized logistic regression.  2.1 Read and understand the data, create a default One-vs-Rest Classifier (3 Marks)   1- Use the data from the file reduced_mnist.csv in the data directory. Begin by reading the data. Print the following  information: (1 Mark)  • Number of data points  • Total number of features  • Unique labels in the data  2- Split the data into 70% training data and 30% test data. Fit a One-vs-Rest Classifier (which uses Logistic regression classifier  with alpha=1) on training data, and report accuracy, precision, recall on testing data. (2 Marks)     http://yann.lecun.com/exdb/mnist/ SIT720 Machine Learning  Assessment Task 2: Individual Problem Solving task  © Deakin University 3 FutureLearn  2.2 Choosing the best hyper-parameter (7 Marks)  1- Choose the best value of alpha from the set a={0.1, 1, 3, 10, 33, 100, 333, 1000, 3333, 10000, 33333} by observing average  training and validation performance P. On a graph, plot both the average training performance (in red) and average validation  performacne (in blue) w.r.t. each hyperparameter value. Comment on this graph by identifying regions of overfitting and  underfitting. Print the best value of alpha hyperparameter. (2+1+1=5 Marks)  [Hints: To choose the best hyperparameter alpha value, you have to do the following:  • For each value of hyperparameter, perform 10 random splits of training data into training (70%) and validation (30%)  set.  • Use these 10 sets of data to find the average training and validation performance P.  • The best hyperparameter shall be selected from the plot that shows both average training and validation  performance against alpha value. While selecting the best alpha value you should consider overfitting and  underfitting concepts.  • Performance is defined as: P='accuracy' if fID=0, P='f1-score' if fID=1, P='precision' if fID=2. Calculate fID using  modulus operation fID=SID % 3, where SID is your student ID. For example, if your student ID is 356288 then  fID=(356288 % 3)=2 then use 'precision' for selecting the best value of alpha.]  2- Use the best alpha and all training data to build the final model and then evaluate the prediction performance on test data  and report the following: (1 Mark)  • The confusion matrix  • Precision, recall and accuracy for each class.  3- Discuss if there is any sign of underfitting or overfitting with appropriate reasoning. (1 Mark) References that may be helpful:  • Finding missing values  • Titanic Problem  • Numpy: Sorting and Searching  • Multiclass Classification    Submission details  • Deakin University has a strict standard on plagiarism as a part of Academic Integrity. To avoid any issues with  plagiarism, students are strongly encouraged to run the similarity check with the Turnitin system, which is available  through Unistart. A Similarity score MUST NOT exceed 39% in any case.  • Late submission penalty is 5% per each 24 hours from 11.30pm, 4th of September.  • No marking on any submission after 5 days (24 hours X 5 days from 11.30pm 4th of September)  • Be sure to downsize the photos in your report before your submission in order to have your file uploaded in time.    Extension requests  Requests for extensions should be made to Unit/Campus Chairs well in advance of the assessment due date. If you wish to seek  an extension for an assignment, you will need to apply by email directly to Chandan Karmakar (karmakar@deakin.edu.au), as  soon as you become aware that you will have difficulty in meeting the scheduled deadline, but at least 3 days before the due  date. When you make your request, you must include appropriate documentation (medical certificate, death notice) and a copy  of your draft assignment.  Conditions under which an extension will normally be approved include:    Medical To cover medical conditions of a serious nature, e.g. hospitalisation, serious injury or chronic illness. Note: Temporary  minor ailments such as headaches, colds and minor gastric upsets are not serious medical conditions and are unlikely to be  accepted. However, serious cases of these may be considered.    Compassionate e.g. death of close family member, significant family and relationship problems.    https://chartio.com/resources/tutorials/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe http://nbviewer.jupyter.org/github/agconti/kaggle-titanic/blob/master/Titanic.ipynb http://docs.scipy.org/doc/numpy/reference/routines.sort.html http://scikit-learn.org/stable/modules/multiclass.html SIT720 Machine Learning  Assessment Task 2: Individual Problem Solving task  © Deakin University 4 FutureLearn  Hardship/Trauma e.g. sudden loss or gain of employment, severe disruption to domestic arrangements, victim of crime. Note:  Misreading the timetable, exam anxiety or returning home will not be accepted as grounds for consideration.    Special consideration  You may be eligible for special consideration if circumstances beyond your control prevent you from undertaking or completing  an assessment task at the scheduled time.  See the following link for advice on the application process: http://www.deakin.edu.au/students/studying/assessment-and- results/special-consideration    Assessment feedback  The results with comments will be released within 15 business days from the due date.    Referencing  You must correctly use the Harvard method in this assessment. See the Deakin referencing guide.    Academic integrity, plagiarism and collusion  Plagiarism and collusion constitute extremely serious breaches of academic integrity. They are forms of cheating, and severe  penalties are associated with them, including cancellation of marks for a specific assignment, for a specific unit or even exclusion  from the course. If you are ever in doubt about how to properly use and cite a source of information refer to the referencing site  above.    Plagiarism occurs when a student passes off as the student’s own work, or copies without acknowledgement as to its  authorship, the work of any other person or resubmits their own work from a previous assessment task.    Collusion occurs when a student obtains the agreement of another person for a fraudulent purpose,

Kshitij · Accepted Answer

SIT720 Machine Learning
Binary Classification 
* Read the training and testing data. Print the number of features in the dataset. 
* For the data label, print the total number of B's and M's in the training and testing data. Comment on the class distribution. Is it balanced or unbalanced? 
* Print the number of features with missing entries. 
* Fill the missing entries. For filling any feature, you can use either mean or median value of the feature values from observed entries. 
* Normalize the training and testing data. 
* Train logistic regression model with L1 regularization using alpha = 0.1.

SIT720 Machine Learning Assessment Task 2: Individual Problem Solving task © Deakin University 1 FutureLearn SIT720 Machine Learning Assessment 2: Problem solving task This document supplies detailed...

Answer To: SIT720 Machine Learning Assessment Task 2: Individual Problem Solving task © Deakin University 1...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment