SIT720 Machine Learning Assessment Task 4: Individual ML project © Deakin University 1 FutureLearn Assessment 4: Machine Learning Project This document supplies detailed information on assessment...

1 answer below »
Please let me know how much time its gonna take to do.


SIT720 Machine Learning Assessment Task 4: Individual ML project © Deakin University 1 FutureLearn Assessment 4: Machine Learning Project This document supplies detailed information on assessment tasks for this unit. Key information • Due: Wednesday 25 September 2019 by 11.30pm (AEST) • Weighting: 30% Learning Outcomes This assessment assesses the following Unit Learning Outcomes (ULO) and related Graduate Learning Outcomes (GLO): Unit Learning Outcome (ULO) Graduate Learning Outcome (GLO) ULO 2: Perform linear regression, classification using logistic regression and linear Support Vector Machines. GLO 1: Discipline knowledge and capabilities GLO 5: Problem solving ULO 3: Perform non-linear classification using Support Vector Machines with kernels, Decision trees and Random forests. GLO 1: Discipline knowledge and capabilities GLO 5: Problem solving ULO 4: Understand the concept of maximum likelihood and Bayesian estimation. GLO 1: Discipline knowledge and capabilities GLO 5: Problem solving ULO 5: Construct a multi-layer neural network using backpropagation training algorithm. GLO 1: Discipline knowledge and capabilities ULO 6: Perform model selection and compute relevant evaluation measure for a given problem. GLO 2: Communication Purpose This assessment is an extensive machine learning project. You will be given a specific data set for analysis and will be required to develop and compare various classification techniques. You must demonstrate skills acquired in data representation, classification and evaluation. You will use a lot of concepts learnt in this unit to come up with a good solution for a given human activity recognition problem. Instructions The dataset consists of training and testing data in "train" and "test" folders. Use training data: X_train.txt labels: y_train.txt and testing data: X_test.txt labels: y_test.txt. There are other files that also come with the dataset and may be useful in understanding the dataset better. Please read the pdf file "dataset-paper.pdf" to answer Part 1. Part 1: Understanding the data (2 Marks) Answer the following questions briefly, after reading the paper • What is the objective of the data collection process? (0.5 Marks) • What human activity types does this dataset have? How many subjects/people have performed these activities? (0.5 Marks) • How many instances are available in the training and test sets? How many features are used to represent each instance? Summarize the type of features extracted in 2-3 sentences. (0.5 Marks) • Describe briefly what machine learning model is used in this paper for activity recognition and how is it trained. How much is the maximum accuracy achieved? (0.5 Marks) https://cloudstor.aarnet.edu.au/plus/s/6Uxker9axWrCJAx SIT720 Machine Learning Assessment Task 4: Individual ML project © Deakin University 2 FutureLearn Part 2: K-Nearest Neighbour Classification (5 Marks) Build a K-Nearest Neighbour classifier for this data. • Let K take values from 1 to 50. Show a plot of cross-validation accuracy with respect to K. (1 Mark) • Choose the best value of K based on model performance P. (2 Marks) • Using the best K value, evaluate the model performance on the supplied test set. Report the confusion matrix, multi- class averaged F1-score and accuracy. (2 Marks) [Hints: To choose the best K value, you have to do the following: • For each value of K, use 10 fold cross-validation to computer the performance P. • The best hyperparameter will be the one that gives maximum validation performance. • Performance is defined as: P='f1-score' if fID=0, P='accuracy' if fID=1. Calculate fID using modulus operation fID=SID % 2, where SID is your student ID. For example, if your student ID is 356289 then fID=(356289 % 2)=1 then use 'accuracy' for selecting the best value of K.] Part 3: Multiclass Logistic Regression with Elastic Net (5 Marks) Build an elastic-net regularized logistic regression classifier for this data. • Elastic-net regularizer takes in 2 parameters: alpha and l1-ratio. Use the following values for alpha: 1e-4,3e-4,1e- 3,3e-3, 1e-2,3e-2. Use the following values for l1-ratio: 0,0.15,0.5,0.7,1. Choose the best values of alpha and l1-ratio based on model performance P. (2 Marks) • Draw a surface plot of F1-score with respect to alpha and l1-ratio values. (1 Mark) • Use the best value of alpha and l1-ratio to re-train the model on the training set and use it to predict the labels of the test set. Report the confusion matrix, multi-class averaged F1-score and accuracy. (1+1=2 Marks) [Hints: To choose the best alpha/l1-ratio value, you have to do the following: • For each value of hyperparameter, use 10 fold cross-validation to computer the performance P. • The best hyperparameter will be the one that gives maximum validation performance. • Performance is defined as: P='accuracy' if fID=0, P='f1-score' if fID=1. Calculate fID using modulus operation fID=SID % 2, where SID is your student ID. For example, if your student ID is 356289 then fID=(356289 % 2)=1 then use 'f1- score' for selecting the best value of alpha/l1-ratio. ] Part 4: Support Vector Machine (RBF Kernel) (6 Marks) Build a SVM (with RBF Kernel) classifier for this data. • SVM with RBF takes 2 parameters: gamma (length scale of the RBF kernel) and C (the cost parameter). Use the following values for gamma: 1e-3, 1e-4. Use the following values for C: 1, 10, 100, 1000. Choose the best values of gamma and C based on model performance P. (2 Marks) • Draw a surface plot of F1-score with respect to gamma and C. Describe the graph. (1+1=2 Mark) • Use the best value of gamma and C to re-train the model on the training set and use it to predict the labels of the test set. Report the confusion matrix, multi-class averaged F1-score and accuracy. (1+1=2 Marks) [Hints: To choose the best gamma/C value, you have to do the following: • For each value of hyperparameter, use 10 fold cross-validation to computer the performance P. • The best hyperparameter will be the one that gives maximum validation performance. • Performance is defined as: P='f1-score' if fID=0, P='precision' if fID=1, P='accuracy' if fID=2. Calculate fID using modulus operation fID=SID % 3, where SID is your student ID. For example, if your student ID is 356289 then fID=(356289 % 3)=0 then use 'f1-score' for selecting the best value of gamma/C.] SIT720 Machine Learning Assessment Task 4: Individual ML project © Deakin University 3 FutureLearn Part 5: Random Forest (6 Marks) Build a Random forest classifier for this data. • Random forest uses two parameters: the tree-depth for each decision tree and the number of trees. Use the following values for the tree-depth: 300,500,600 and the number of trees: 200,500,700. Choose the best values of tree-depth and number of treesbased on model performance P. (2 Marks) • Draw a surface plot of F1-score with respect to tree-depth and number of trees. Describe the graph. (1+1=2 Marks) • Use the best value of tree-depth and number of trees to re-train the model on the training set and use it to predict the labels of the test set. Report the confusion matrix, multi-class averaged F1-score and accuracy. (1+1=2 Marks) [Hints: To choose the 'tree-depth'/'number of trees' value, you have to do the following: • For each value of hyperparameter, use 10 fold cross-validation to computer the performance P. • The best hyperparameter will be the one that gives maximum validation performance. • Performance is defined as: P='f1-score' if fID=0, P='precision' if fID=1, P='accuracy' if fID=2, P='recall' if fID=3 . Calculate fID using modulus operation fID=SID % 4, where SID is your student ID. For example, if your student ID is 356289 then fID=(356289 % 4)=1 then use 'precision' for selecting the best value of 'tree-depth'/'number of trees'.] Part 6: Discussion (6 Marks) • Write a brief discussion about which classification method achieved the best performance and your thoughts on the reason behind this. (2 Marks) • Which method performed the worst and why? (2 Marks) • Do you have any suggestions to further improve model performances? (2 Marks) SIT720 Machine Learning Assessment Task 4: Individual ML project © Deakin University 4 FutureLearn Submission details Deakin University has a strict standard on plagiarism as a part of Academic Integrity. To avoid any issues with plagiarism, students are strongly encouraged to run the similarity check with the Turnitin system, which is available through Unistart. A Similarity score MUST NOT exceed 39% in any case. Late submission penalty is 5% per each 24 hours from 11.30pm, 25th of September. No marking on any submission after 5 days (24 hours X 5 days from 11.30pm 25th of September) Be sure to downsize the photos in your report before your submission in order to have your file uploaded in time. Extension requests Requests for extensions should be made to Unit/Campus Chairs well in advance of the assessment due date. If you wish to seek an extension for an assignment, you will need to apply by email directly to Chandan Karmakar ([email protected]), as soon as you become aware that you will have difficulty in meeting the scheduled deadline, but at least 3 days before the due date. When you make your request, you must include appropriate documentation (medical certificate, death notice) and a copy of your draft assignment. Conditions under which an extension will normally be approved include: Medical To cover medical conditions of a serious nature, e.g. hospitalisation, serious injury or chronic illness. Note: Temporary minor ailments such as headaches, colds and minor gastric upsets are not serious medical conditions and are unlikely to be accepted. However, serious cases of these may be considered. Compassionate e.g. death of close family member, significant family and relationship problems. Hardship/Trauma e.g. sudden loss or gain of employment, severe disruption to domestic arrangements, victim of crime. Note: Misreading the timetable, exam anxiety or returning home will not be accepted as grounds for consideration. Special consideration You may be eligible for special consideration if circumstances beyond your control prevent you from undertaking or completing an assessment task at the scheduled time. See the following link for
Answered Same DaySep 22, 2021SIT720Deakin University

Answer To: SIT720 Machine Learning Assessment Task 4: Individual ML project © Deakin University 1 FutureLearn...

Ximi answered on Sep 25 2021
150 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SIT 720 - Machine Learning\n",
"\n",
"---\n",
"Lecturer: Chandan Karmakar | [email protected]
\n",
"\n",
"\n",
"School of Information Technology,
\n",
"Deakin University, VIC 3125, Australia.\n",
"\n",
"---\n",
"\n",
"## Assignment 4\n",
"\n",
"\n",
"In this assignment, you will use a lot of concepts learnt in this unit to come up with a good solution for a given human activity recognition problem.\n",
"\n",
"**Instructions**\n",
"1. The dataset consists of training and testing data in \"train\" and \"test\" folders. Use training data: X_train.txt labels: y_train.txt and testing data: X_test.txt labels: y_test.txt. There are other files that also come with the dataset and may be useful in understanding the dataset better.\n",
"\n",
"2. Please read the pdf file \"dataset-paper.pdf\" to answer Part 1.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 1: Understanding the data **(2 Marks)**\n",
"\n",
"Answer the following questions briefly, after reading the paper "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* What is the objective of the data collection process? **(0.5 Marks)**\n",
"\n",
"\n",
"\n",
"* What human activity types does this dataset have? How many subjects/people have performed these activities? **(0.5 Marks)** \n",
"\n",
"\n",
"\n",
"* How many instances are available in the training and test sets? How many features are used to represent each instance? Summarize the type of features extracted in 2-3 sentences. **(0.5 Marks)**\n",
"\n",
"\n",
"\n",
"* Describe briefly what machine learning model is used in this paper for activity recognition and how is it trained. How much is the maximum accuracy achieved? **(0.5 Marks)**\n"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [],
"source": [
"# Reading training data \n",
"with open('train/X_train.txt') as f:\n",
" train_x = f.read().split('\\n')\n",
"with open('train/y_train.txt') as f:\n",
" train_y = f.read().split('\\n')\n",
"# Reading testing data \n",
"with open('test/X_test.txt') as f:\n",
" test_x = f.read().split('\\n')\n",
"with open('test/y_test.txt') as f:\n",
" test_y = f.read().split('\\n')\n"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"# Converting each data point into float for training and test data as they are in string\n",
"train_x = map(lambda x: x.strip().split(), trai
n_x)\n",
"train_x = [np.array(map(float, row)) for row in train_x]\n",
"train_x = np.array(filter(lambda x: len(x) == 561, train_x))\n",
"\n",
"test_x = map(lambda x: x.strip().split(), test_x)\n",
"test_x = [np.array(map(float, row)) for row in test_x]\n",
"test_x = np.array(filter(lambda x: len(x) == 561, test_x))"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [],
"source": [
"# Converting each data point into float\n",
"train_y = map(float, filter(lambda x: len(x) == 1, train_y))\n",
"train_y = np.array(train_y)\n",
"\n",
"test_y = map(float, filter(lambda x: len(x) == 1, test_y))\n",
"test_y = np.array(test_y)"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data size: 7352, and features: 561\n"
]
}
],
"source": [
"print (\"Data size: %d, and features: %d\"%train_x.shape)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"# Sampling training and test data for quick running of models\n",
"# Please comment out this code for actual model training\n",
"SAMPLE_SIZE = 100\n",
"train_x = train_x[:SAMPLE_SIZE]\n",
"train_y = train_y[:SAMPLE_SIZE]\n",
"\n",
"test_x = test_x[:SAMPLE_SIZE]\n",
"test_y = test_y[:SAMPLE_SIZE]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 2: K-Nearest Neighbour Classification **(5 Marks)**\n",
"\n",
"Build a K-Nearest Neighbor classifier for this data. \n",
"\n",
"- Let K take values from 1 to 50. Show a plot of cross-validation accuracy with respect to K. **(1 Mark) ** \n",
"- Choose the best value of K based on model performance P. **(2 Marks) **\n",
"- Using the best K value, evaluate the model performance on the supplied test set. Report the confusion matrix, multi-class averaged F1-score and accuracy. **(2 Marks)**\n",
"\n",
"*[Hints: To choose the best K value, you have to do the following:*\n",
"- *For each value of K, use 10 fold cross-validation to computer the performance P. *\n",
"- *The best hyperparameter will be the one that gives maximum validation performance.*\n",
"- *Performance is defined as: P='f1-score' if fID=0, P='accuracy' if fID=1. Calculate fID using modulus operation fID=SID % 2, where SID is your student ID. For example, if your student ID is 356289 then fID=(356289 % 2)=1 then use 'accuracy' for selecting the best value of K.]*
\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"# Values of K from 1-50\n",
"K = list(range(1, 51))\n",
"accuracies = []\n",
"def model(k, X, y):\n",
" \"\"\"\n",
" Defining Model and performing 10-fold CV\n",
" \"\"\"\n",
" clf = KNeighborsClassifier(n_neighbors=k)\n",
" scores = cross_val_score(clf, X, y, cv=10)\n",
" return np.average(np.array(scores))\n",
"\n",
"for k_value in K:\n",
" accuracies.append(model(k_value, train_x, train_y))\n"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"# Structuring accuracy data for plotting\n",
"import pandas as pd\n",
"df = pd.DataFrame(list(zip(K, accuracies)), columns = ['K', 'Accuracy'])"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns; sns.set()\n",
"sns.lineplot(x='K', y='Accuracy', data=df)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"# Finding best K in terms of accuracy\n",
"best_k = df.sort_values('Accuracy', ascending=False)['K'].iloc[0]"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9482517482517483"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Model accuracy on best K\n",
"model(best_k, test_x, test_y)"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=None, n_neighbors=3, p=2,\n",
" weights='uniform')"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Model evalutation on test set\n",
"clf = KNeighborsClassifier(n_neighbors=best_k)\n",
"clf.fit(train_x, train_y)"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [],
"source": [
"# Predictions on test set\n",
"test_y_predictions = clf.predict(test_x)"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 1.0 1.00 1.00 1.00 21\n",
" 4.0 0.60 0.12 0.21 24\n",
" 5.0 0.58 0.94 0.72 31\n",
" 6.0 1.00 1.00 1.00 24\n",
"\n",
" micro avg 0.77 0.77 0.77 100\n",
" macro avg 0.80 0.77 0.73 100\n",
"weighted avg 0.77 0.77 0.72 100\n",
"\n"
]
}
],
"source": [
"# Classification report on test set predictions\n",
"from sklearn.metrics import classification_report\n",
"print(classification_report(test_y, test_y_predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 3: Multiclass Logistic Regression with Elastic Net **(5 Marks)**\n",
"\n",
"Build an elastic-net regularized logistic regression classifier for this data. \n",
"\n",
"- Elastic-net regularizer takes in 2 parameters: alpha and l1-ratio. Use the following values for alpha: 1e-4,3e-4,1e-3,3e-3, 1e-2,3e-2. Use the following values for l1-ratio: 0,0.15,0.5,0.7,1. Choose the best values of alpha and l1-ratio based on model performance P. ** (2 Marks)**\n",
"\n",
"- Draw a surface plot of F1-score with respect to alpha and l1-ratio values. ** (1 Mark)**\n",
"\n",
"- Use the best value of alpha and l1-ratio to re-train the model on the training set and use it to predict the labels of the test set. Report the confusion matrix, multi-class averaged F1-score and accuracy. ** (1+1=2 Marks)**\n",
"\n",
"*[Hints: To choose the best alpha/l1-ratio value, you have to do the following:*\n",
"- *For each value of hyperparameter, use 10 fold cross-validation to computer the performance P.* \n",
"- *The best hyperparameter will be the one that gives maximum validation performance.*\n",
"- *Performance is defined as: P='accuracy' if fID=0, P='f1-score' if fID=1. Calculate fID using modulus operation fID=SID % 2, where SID is your student ID. For example, if your student ID is 356289 then fID=(356289 % 2)=1 then use 'f1-score' for selecting the best value of alpha/l1-ratio. ]*\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Library/Python/2.7/site-packages/sklearn/linear_model/stochastic_gradient.py:166: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.\n",
" FutureWarning)\n"
]
}
],
"source": [
"from sklearn.linear_model import SGDClassifier\n",
"\n",
"# Defining hyperparameters for grid searching\n",
"ALPHA = [1e-4,3e-4,1e-3,3e-3, 1e-2,3e-2]\n",
"L1_RATIO = [0,0.15,0.5,0.7,1]\n",
"\n",
"accuracies = []\n",
"def model(alpha, l1_ratio, X, y):\n",
" \"\"\"\n",
" Defining Model and performing 10-fold CV\n",
" \"\"\"\n",
" clf = SGDClassifier(penalty='elasticnet', alpha=alpha, l1_ratio=l1_ratio)\n",
" scores = cross_val_score(clf, X, y, cv=10)\n",
" return np.average(np.array(scores))\n",
"\n",
"for alpha in ALPHA:\n",
" for l1_ratio in L1_RATIO:\n",
" accuracy = model(alpha, l1_ratio, train_x, train_y)\n",
" accuracies.append({'alpha': alpha, 'l1_ratio': l1_ratio, 'accuracy': accuracy})"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame().from_records(accuracies)"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here