{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SIT 720 - Machine Learning\n",
"\n",
"---\n",
"Lecturer: Chandan Karmakar | [email protected]
\n",
"\n",
"\n",
"School of Information Technology,
\n",
"Deakin University, VIC 3125, Australia.\n",
"\n",
"---\n",
"\n",
"## Assignment 4\n",
"\n",
"\n",
"In this assignment, you will use a lot of concepts learnt in this unit to come up with a good solution for a given human activity recognition problem.\n",
"\n",
"**Instructions**\n",
"1. The dataset consists of training and testing data in \"train\" and \"test\" folders. Use training data: X_train.txt labels: y_train.txt and testing data: X_test.txt labels: y_test.txt. There are other files that also come with the dataset and may be useful in understanding the dataset better.\n",
"\n",
"2. Please read the pdf file \"dataset-paper.pdf\" to answer Part 1.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 1: Understanding the data **(2 Marks)**\n",
"\n",
"Answer the following questions briefly, after reading the paper "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* What is the objective of the data collection process? **(0.5 Marks)**\n",
"\n",
"\n",
"\n",
"* What human activity types does this dataset have? How many subjects/people have performed these activities? **(0.5 Marks)** \n",
"\n",
"\n",
"\n",
"* How many instances are available in the training and test sets? How many features are used to represent each instance? Summarize the type of features extracted in 2-3 sentences. **(0.5 Marks)**\n",
"\n",
"\n",
"\n",
"* Describe briefly what machine learning model is used in this paper for activity recognition and how is it trained. How much is the maximum accuracy achieved? **(0.5 Marks)**\n"
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [],
"source": [
"# Reading training data \n",
"with open('train/X_train.txt') as f:\n",
" train_x = f.read().split('\\n')\n",
"with open('train/y_train.txt') as f:\n",
" train_y = f.read().split('\\n')\n",
"# Reading testing data \n",
"with open('test/X_test.txt') as f:\n",
" test_x = f.read().split('\\n')\n",
"with open('test/y_test.txt') as f:\n",
" test_y = f.read().split('\\n')\n"
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"# Converting each data point into float for training and test data as they are in string\n",
"train_x = map(lambda x: x.strip().split(), train_x)\n",
"train_x = [np.array(map(float, row)) for row in train_x]\n",
"train_x = np.array(filter(lambda x: len(x) == 561, train_x))\n",
"\n",
"test_x = map(lambda x: x.strip().split(), test_x)\n",
"test_x = [np.array(map(float, row)) for row in test_x]\n",
"test_x = np.array(filter(lambda x: len(x) == 561, test_x))"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [],
"source": [
"# Converting each data point into float\n",
"train_y = map(float, filter(lambda x: len(x) == 1, train_y))\n",
"train_y = np.array(train_y)\n",
"\n",
"test_y = map(float, filter(lambda x: len(x) == 1, test_y))\n",
"test_y = np.array(test_y)"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data size: 7352, and features: 561\n"
]
}
],
"source": [
"print (\"Data size: %d, and features: %d\"%train_x.shape)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"# Sampling training and test data for quick running of models\n",
"# Please comment out this code for actual model training\n",
"SAMPLE_SIZE = 100\n",
"train_x = train_x[:SAMPLE_SIZE]\n",
"train_y = train_y[:SAMPLE_SIZE]\n",
"\n",
"test_x = test_x[:SAMPLE_SIZE]\n",
"test_y = test_y[:SAMPLE_SIZE]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 2: K-Nearest Neighbour Classification **(5 Marks)**\n",
"\n",
"Build a K-Nearest Neighbor classifier for this data. \n",
"\n",
"- Let K take values from 1 to 50. Show a plot of cross-validation accuracy with respect to K. **(1 Mark) ** \n",
"- Choose the best value of K based on model performance P. **(2 Marks) **\n",
"- Using the best K value, evaluate the model performance on the supplied test set. Report the confusion matrix, multi-class averaged F1-score and accuracy. **(2 Marks)**\n",
"\n",
"*[Hints: To choose the best K value, you have to do the following:*\n",
"- *For each value of K, use 10 fold cross-validation to computer the performance P. *\n",
"- *The best hyperparameter will be the one that gives maximum validation performance.*\n",
"- *Performance is defined as: P='f1-score' if fID=0, P='accuracy' if fID=1. Calculate fID using modulus operation fID=SID % 2, where SID is your student ID. For example, if your student ID is 356289 then fID=(356289 % 2)=1 then use 'accuracy' for selecting the best value of K.]*
\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"# Values of K from 1-50\n",
"K = list(range(1, 51))\n",
"accuracies = []\n",
"def model(k, X, y):\n",
" \"\"\"\n",
" Defining Model and performing 10-fold CV\n",
" \"\"\"\n",
" clf = KNeighborsClassifier(n_neighbors=k)\n",
" scores = cross_val_score(clf, X, y, cv=10)\n",
" return np.average(np.array(scores))\n",
"\n",
"for k_value in K:\n",
" accuracies.append(model(k_value, train_x, train_y))\n"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"# Structuring accuracy data for plotting\n",
"import pandas as pd\n",
"df = pd.DataFrame(list(zip(K, accuracies)), columns = ['K', 'Accuracy'])"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"
"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"