Please find the assignment AttachedSIT720 Machine Learning Assessment Task 5: Machine Learning...

Question

Please find the assignment AttachedSIT720 Machine Learning   Assessment Task 5: Machine Learning Project.    ©Deakin University                                                                  1                                                                   SIT720  This document supplies detailed information on Assessment Task 5 for this unit. Key information   • Due: Sunday 10 October 2021 by 8.00 pm (AEST),  • Weighting: 35% Learning Outcomes   This assessment assesses the following Unit Learning Outcomes (ULO) and related Graduate Learning  Outcomes (GLO):   Unit Learning Outcome (ULO) Graduate Learning Outcome (GLO)    ULO6 - Perform model selection and compute  relevant evaluation measure for a given problem.  ULO7 - Use concepts of machine learning algorithms  to design solution and compare multiple solutions.  GLO1 - Discipline-specific knowledge and  capabilities  GLO2 - Communication  GLO4 - Critical thinking  GLO5 - Problem solving  GLO6 - Self-management     Purpose   This assessment is an extensive machine learning project. The task is open in nature, where students should  make all design decisions to solve a problem and justify their decisions. In addition, they have to design and  develop solutions that are better than any existing solutions.        Assessment 5                                                                                         Total marks = 35    Submission Instructions   a) Submit your solution codes into a notebook file with “.ipynb” extension. Write discussions and  explanations including outputs and figures into a separate file and submit as a PDF file.  b) Submission other than the above-mentioned file formats will not be assessed and given zero for the  entire submission.  c) Insert your Python code responses into the cell of your submitted “.ipynb” file followed by the  question i.e., copy the question by adding a cell before the solution cell. If you need multiple cells for  better presentation of the code, add question only before the first solution cell.  d) Your submitted code should be executable. If your code does not generate the submitted solution,  then you will get zero for that part of the marks.   e) Answers must be relevant and precise.   f) No hard coding is allowed. Avoid using specific value that can be calculated from the data provided.  g) Use all the topics covered in the unit for answering this assignment.   h) Submit your assignment after running each cell individually.  i) The submitted notebook file name should be of this form “SIT720_A5_studentID.ipynb”. For example, if  your student ID is 1234, then the submitted file name should be “SIT720_A5_1234.ipynb”.                ________________________________________________________________________________    SIT720 Machine Learning   Assessment Task 5: Machine Learning Project.    ©Deakin University                                                                  2                                                                   SIT720  Questions  ________________________________________________________________________________    Background  In this project you are given a dataset and an article that uses this dataset. The authors have developed ten ML  models for predicting survival of patients with heart failure and compared their performance. You must read the  article to understand the problem, the dataset, and the methodology to complete the following tasks. Dataset  The dataset contains the medical records of patients who had heart failure, collected during their follow-up period.  Each patient profile has 13 clinical features. A detailed description of the dataset can be found in the Dataset  section of the provided article (patient_survival_prediction.pdf). Tasks:  1. Read the article and reproduce the results presented in Table-4 using Python modules and packages (including  your own script or customised codes). Write a report summarising the dataset, used ML methods, experiment  protocol and results including variations, if any. During reproducing the results:                                   (10 marks)  i) you should use the same set of features used by the authors.  ii) you should use the same classifier with exact parameter values.  iii) you should use the same training/test splitting approach as used by the authors.  iv) you should use the same pre/post processing, if any, used by the authors.  v) you should report the same performance metrics as shown in Table-4. N.B.  (i) Some of the ML methods are not covered in the current unit. Consider them as HD tasks i.e., based on the  knowledge gained in the unit you should be able to find necessary packages and modules to reproduce the results.  (ii) If you find any issue in reproducing results or some subtle variations are found due to implementation  differences of packages and modules in Python then appropriate explanation of them will be considered during  evaluation of your submission.  (iii) Similarly, variation in results due to randomness of data splitting will also be considered during evaluation  based on your explanation.  (iii) Obtained marks will be proportional to the number of ML methods that you will report in your submission  with correctly reproduced results.   (iv) Make sure your Python code segment generates the reported results, otherwise you will receive zero marks  for this task. Marking criteria:  i) Unsatisfactory (x=8): appropriately implemented >=90% of the methods presented in the article.  Variation of marks in this group will depend on the quality of report.   SIT720 Machine Learning   Assessment Task 5: Machine Learning Project.    ©Deakin University                                                                  3                                                                   SIT720 2. Design and develop your own ML solution for this problem. The proposed solution should be different from  all approaches mentioned in the provided article. This does not mean that you must have to choose a new ML  algorithm. You can develop a novel solution by changing the feature selection approach or parameter  optimisations process of used ML methods or using different ML methods or different combinations of them.  This means, the proposed system should be substantially different from the methods presented in the article but  not limited to only change of ML methods. Compare the result with reported methods in the article. Write a  technical report summarising your solution design and outcomes. The report should include:                  (20 marks)  i) Motivation behind the proposed solution.  ii) How the proposed solution is different from existing ones.  iii) Detail description of the model including all parameters so that any reader can implement your model.  iv) Description of experimental protocol.  v) Evaluation metrics.   vi) Present results using tables and graphs.   vii) Compare and discuss results with respect to existing literatures.  viii) Appropriate references (IEEE numbered). N.B. This is a HD (High Distinction) level question. Those students who target HD grade should answer this  question (including answering all the above questions). For others, this question is an option. This question aims  to demonstrate your expertise in the subject area and the ability to do your own research in the related area. Marking criteria:  (i)   Unsatisfactory (=14): an appropriate solution presented whose performance is better than the best reported  performances in the article

Pritam Kumar · Accepted Answer

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      age
",
       "      anaemia
",
       "      creatinine_phosphokinase
",
       "      diabetes
",
       "      ejection_fraction
",
       "      high_blood_pressure
",
       "      platelets
",
       "      serum_creatinine
",
       "      serum_sodium
",
       "      sex
",
       "      smoking
",
       "      time
",
       "      DEATH_EVENT
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      0
",
       "      75.0
",
       "      0
",
       "      582
",
       "      0
",
       "      20
",
       "      1
",
       "      265000.00
",
       "      1.9
",
       "      130
",
       "      1
",
       "      0
",
       "      4
",
       "      1
",
       "    
",
       "    
",
       "      1
",
       "      55.0
",
       "      0
",
       "      7861
",
       "      0
",
       "      38
",
       "      0
",
       "      263358.03
",
       "      1.1
",
       "      136
",
       "      1
",
       "      0
",
       "      6
",
       "      1
",
       "    
",
       "    
",
       "      2
",
       "      65.0
",
       "      0
",
       "      146
",
       "      0
",
       "      20
",
       "      0
",
       "      162000.00
",
       "      1.3
",
       "      129
",
       "      1
",
       "      1
",
       "      7
",
       "      1
",
       "    
",
       "    
",
       "      3
",
       "      50.0
",
       "      1
",
       "      111
",
       "      0
",
       "      20
",
       "      0
",
       "      210000.00
",
       "      1.9
",
       "      137
",
       "      1
",
       "      0
",
       "      7
",
       "      1
",
       "    
",
       "    
",
       "      4
",
       "      65.0
",
       "      1
",
       "      160
",
       "      1
",
       "      20
",
       "      0
",
       "      327000.00
",
       "      2.7
",
       "      116
",
       "      0
",
       "      0
",
       "      8
",
       "      1
",
       "    
",
       "  
",
       "
",
       ""
      ],
      "text/plain": [
       "    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
",
       "0  75.0        0                       582         0                 20   
",
       "1  55.0        0                      7861         0                 38   
",
       "2  65.0        0                       146         0                 20   
",
       "3  50.0        1                       111         0                 20   
",
       "4  65.0        1                       160         1                 20   
",
       "
",
       "   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
",
       "0                    1  265000.00               1.9           130    1   
",
       "1                    0  263358.03               1.1           136    1   
",
       "2                    0  162000.00               1.3           129    1   
",
       "3                    0  210000.00               1.9           137    1   
",
       "4                    0  327000.00               2.7           116    0   
",
       "
",
       "   smoking  time  DEATH_EVENT  
",
       "0        0     4            1  
",
       "1        0     6            1  
",
       "2        1     7            1  
",
       "3        0     7            1  
",
       "4        0     8            1  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd
",
    "
",
    "data = pd.read_csv("D:\\New\\heartfailureclinicalrecordsdataset.csv")
",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "age
",
      "anaemia
",
      "creatinine_phosphokinase
",
      "diabetes
",
      "ejection_fraction
",
      "high_blood_pressure
",
      "platelets
",
      "serum_creatinine
",
      "serum_sodium
",
      "sex
",
      "smoking
",
      "time
",
      "DEATH_EVENT
"
     ]
    }
   ],
   "source": [
    "for col in data.columns:
",
    "    print(col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Task 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Random Forests feature selection through accuracy reduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier
",
    "from sklearn.model_selection import train_test_split
",
    "from sklearn.ensemble import RandomForestRegressor
",
    "from sklearn.inspection import permutation_importance
",
    "from matplotlib import pyplot as plt
",
    "
",
    "plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
",
    "plt.rcParams.update({'font.size': 14})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "#train-test split for random forest feature selection process
",
    "
",
    "X = data.iloc[:, 0:11]
",
    "y = data['DEATH_EVENT']
",
    "
",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.1487753 , 0.01530943, 0.12832902, 0.01560628, 0.15776899,
",
       "       0.01319197, 0.12878872, 0.26402424, 0.09584213, 0.02159089,
",
       "       0.01077302])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rf = RandomForestRegressor(n_estimators=100)
",
    "rf.fit(X_train, y_train)
",
    "
",
    "rf.feature_importances_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features sorted by their score:
",
      "[(0.264, 'serum_creatinine'), (0.1578, 'ejection_fraction'), (0.1488, 'age'), (0.1288, 'platelets'), (0.1283, 'creatinine_phosphokinase'), (0.0958, 'serum_sodium'), (0.0216, 'sex'), (0.0156, 'diabetes'), (0.0153, 'anaemia'), (0.0132, 'high_blood_pressure'), (0.0108, 'smoking')]
"
     ]
    }
   ],
   "source": [
    "names = X.columns
",
    "print ("Features sorted by their score:")
",
    "print (sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), 
",
    "             reverse=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the results, we select "serum_creatinine" and "ejection_fraction" as the best two features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Classification tasks with different classifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "#train-test split for classification tasks
",
    "
",
    "X = pd.DataFrame(data['serum_creatinine'])
",
    "X['ejection_fraction'] = data['ejection_fraction']
",
    "y = data['DEATH_EVENT']
",
    "
",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "#random forest classifier
",
    "
",
    "ranfor_clf = RandomForestClassifier(n_estimators=1000, random_state=42)
",
    "ranfor_model = ranfor_clf.fit(X_train,y_train)
",
    "ranfor_model_fit = ranfor_model.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import (
",
    " f1_score,
",
    " accuracy_score,
",
    " matthews_corrcoef,
",
    " roc_auc_score,
",
    " average_precision_score,   
",
    " confusion_matrix   
",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Random Forest:
",
      "
",
      "MCC is:  0.327
",
      "F1-score is:  0.55
",
      "Accuracy is:  0.7
",
      "TP Rate is:  0.524
",
      "TN Rate is:  0.795
",
      "ROC-AUC Score is:  0.659
",
      "PR-AUC Score is:  0.47
"
     ]
    }
   ],
   "source": [
    "tn, fp, fn, tp = confusion_matrix(y_test, ranfor_model_fit).ravel()
",
    "
",
    "print("Random Forest:\n")
",
    "print("MCC is:","", matthews_corrcoef(y_test, ranfor_model_fit).round(3))
",
    "print("F1-score is:","", f1_score(y_test, ranfor_model_fit).round(3))
",
    "print("Accuracy is:","", accuracy_score(y_test, ranfor_model_fit).round(3))
",
    "print("TP Rate is:","", (tp/(tp + fn)).round(3))
",
    "print("TN Rate is:","", (tn/(tn + fp)).round(3))
",
    "print("ROC-AUC Score is:","", roc_auc_score(y_test, ranfor_model_fit).round(3))
",
    "print("PR-AUC Score is:","", average_precision_score(y_test, ranfor_model_fit).round(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#decision tree classifier
",
    "
",
    "from sklearn.tree import DecisionTreeClassifier
",
    "dt_clf = DecisionTreeClassifier(random_state=42)
",
    "dt_model = dt_clf.fit(X_train, y_train)
",
    "dt_model_fit = dt_model.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Decision Tree:
",
      "
",
      "MCC is:  0.282
",
      "F1-score is:  0.513
",
      "Accuracy is:  0.683
",
      "TP Rate is:  0.476
",
      "TN Rate is:  0.795
",
      "ROC-AUC Score is:  0.636
",
      "PR-AUC Score is:  0.448
"
     ]
    }
   ],
   "source": [
    "tn, fp, fn, tp = confusion_matrix(y_test, dt_model_fit).ravel()
",
    "
",
    "print("Decision Tree:\n")
",
    "print("MCC is:","", matthews_corrcoef(y_test, dt_model_fit).round(3))
",
    "print("F1-score is:","", f1_score(y_test, dt_model_fit).round(3))
",
    "print("Accuracy is:","", accuracy_score(y_test, dt_model_fit).round(3))
",
    "print("TP Rate is:","", (tp/(tp + fn)).round(3))
",
    "print("TN Rate is:","", (tn/(tn + fp)).round(3))
",
    "print("ROC-AUC Score is:","", roc_auc_score(y_test, dt_model_fit).round(3))
",
    "print("PR-AUC Score is:","", average_precision_score(y_test, dt_model_fit).round(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "#gradient boosting
",
    "
",
    "from sklearn.ensemble import GradientBoostingClassifier
",
    "gradboost_clf = GradientBoostingClassifier()
",
    "gradboost_model = gradboost_clf.fit(X_train,y_train)
",
    "gradboost_model_fit = gradboost_model.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gradient Boosting:
",
      "
",
      "MCC is:  0.332
",
      "F1-score is:  0.485
",
      "Accuracy is:  0.717
",
      "TP Rate is:  0.381
",
      "TN Rate is:  0.897
",
      "ROC-AUC Score is:  0.639
",
      "PR-AUC Score is:  0.471
"
     ]
    }
   ],
   "source": [
    "tn, fp, fn, tp = confusion_matrix(y_test, gradboost_model_fit).ravel()
",
    "
",
    "print("Gradient Boosting:\n")
",
    "print("MCC is:","", matthews_corrcoef(y_test, gradboost_model_fit).round(3))
",
    "print("F1-score is:","", f1_score(y_test, gradboost_model_fit).round(3))
",
    "print("Accuracy is:","", accuracy_score(y_test, gradboost_model_fit).round(3))
",
    "print("TP Rate is:","", (tp/(tp + fn)).round(3))
",
    "print("TN Rate is:","", (tn/(tn + fp)).round(3))
",
    "print("ROC-AUC Score is:","", roc_auc_score(y_test, gradboost_model_fit).round(3))
",
    "print("PR-AUC Score is:","", average_precision_score(y_test, gradboost_model_fit).round(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "#linear regression
",
    "
",
    "from sklearn.linear_model import LogisticRegression
",
    "linreg_clf = LogisticRegression()
",
    "linreg_model = linreg_clf.fit(X_train,y_train)
",
    "linreg_model_fit = linreg_model.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Linear Regression:
",
      "
",
      "MCC is:  0.475
",
      "F1-score is:  0.533
",
      "Accuracy is:  0.767
",
      "TP Rate is:  0.381
",
      "TN Rate is:  0.974
",
      "ROC-AUC Score is:  0.678
",
      "PR-AUC Score is:  0.555
"
     ]
    }
   ],
   "source": [
    "tn, fp, fn, tp = confusion_matrix(y_test, linreg_model_fit).ravel()
",
    "
",
    "print("Linear Regression:\n")
",
    "print("MCC is:","", matthews_corrcoef(y_test, linreg_model_fit).round(3))
",
    "print("F1-score is:","", f1_score(y_test, linreg_model_fit).round(3))
",
    "print("Accuracy is:","", accuracy_score(y_test, linreg_model_fit).round(3))
",
    "print("TP Rate is:","", (tp/(tp + fn)).round(3))
",
    "print("TN Rate is:","", (tn/(tn + fp)).round(3))
",
    "print("ROC-AUC Score is:","", roc_auc_score(y_test, linreg_model_fit).round(3))
",
    "print("PR-AUC Score is:","", average_precision_score(y_test, linreg_model_fit).round(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "#One rule clssifier
",
    "
",
    "from sklearn.dummy import DummyClassifier
",
    "dummy_clf = DummyClassifier(strategy="stratified")
",
    "onerule_model = dummy_clf.fit(X_train, y_train)
",
    "onerule_model_fit = dummy_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "One Rule Classifier:
",
      "
",
      "MCC is:  -0.172
",
      "F1-score is:  0.238
",
      "Accuracy is:  0.467
",
      "TP Rate is:  0.238
",
      "TN Rate is:  0.59
",
      "ROC-AUC Score is:  0.414
",
      "PR-AUC Score is:  0.323
"
     ]
    }
   ],
   "source": [
    "tn, fp, fn, tp = confusion_matrix(y_test, onerule_model_fit).ravel()
",
    "
",
    "print("One Rule Classifier:\n")
",
    "print("MCC is:","", matthews_corrcoef(y_test, onerule_model_fit).round(3))
",
    "print("F1-score is:","", f1_score(y_test, onerule_model_fit).round(3))
",
    "print("Accuracy is:","", accuracy_score(y_test, onerule_model_fit).round(3))
",
    "print("TP Rate is:","", (tp/(tp + fn)).round(3))
",
    "print("TN Rate is:","", (tn/(tn + fp)).round(3))
",
    "print("ROC-AUC Score is:","", roc_auc_score(y_test, onerule_model_fit).round(3))
",
    "print("PR-AUC Score is:","", average_precision_score(y_test, onerule_model_fit).round(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "#naive bayes
",
    "
",
    "from sklearn.

SIT720 Machine Learning Assessment Task 5: Machine Learning Project. ©Deakin University XXXXXXXXXX1 XXXXXXXXXXSIT720 This document supplies detailed information on Assessment Task 5 for this unit. Key...

Answer To: SIT720 Machine Learning Assessment Task 5: Machine Learning Project. ©Deakin University XXXXXXXXXX1...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time	DEATH_EVENT
0	75.0	0	582	0	20	1	265000.00	1.9	130	1	0	4	1
1	55.0	0	7861	0	38	0	263358.03	1.1	136	1	0	6	1
2	65.0	0	146	0	20	0	162000.00	1.3	129	1	1	7	1
3	50.0	1	111	0	20	0	210000.00	1.9	137	1	0	7	1
4	65.0	1	160	1	20	0	327000.00	2.7	116	0	0	8	1