Answer To: { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simple Nearest Neighbours...
Shreyan answered on May 31 2021
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple Nearest Neighbours document classification\n",
"\n",
"For this project, you will write a program that can classify documents into any number of\n",
"classes, based on provided training data.\n",
"\n",
"The training data consists of a number of sets of documents. Each set represents a \n",
"\"class\" of documents, and the task is, given a new document, to find which class it \n",
"belongs to.\n",
"\n",
"There are many classification algorithms; we will look at one of the simplest possible \n",
"methods, that still often turns out to work quite well in practice: the *nearest \n",
"neighbour classifier*.\n",
"\n",
"The idea is as follows:\n",
"1. Define a distance function between documents. The distance between two documents \n",
"that are equal is zero. The more different the documents are, the larger the distance.\n",
"You have considerable freedom in choosing which distance function you want to use,\n",
"but some will work better than others!\n",
"2. Given a new document that we want to classify, we calculate the distance between it \n",
"and *all* documents in the training data. We find the closest matching document in the \n",
"training set, and we report the class of that document as the predicted class for the new \n",
"document.\n",
"\n",
"You can test your program on the provided spam email dataset you can find with\n",
"this project. In the steps below we will provide lots of hints as to how to process your \n",
"data, what functions and data structures to use and how to set up your distance function,\n",
"so read very closely!\n",
"\n",
"The process has been split up into several steps; with each step is listed a date\n",
"by which you should try to finish it. It is absolutely okay to move faster than\n",
"what the dates indicate, but do try not to fall behind in order to avoid stress."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### Step 1 (complete by 19-05)\n",
"\n",
"Make a project folder; download and unzip the mail dataset and save there.\n",
"\n",
"As a first step to building a classifier, you are provided with some code that will help you read all the email documents. First we need to find out what files actually appear in a given directory. To make sure there are no misunderstandings, we make a couple of **data definitions** regarding files and directories. We will make several more data definitions further down. As I'm sure you're aware by now, data definitions are crucial for understanding any program, so make sure to keep these well in mind! If you get confused, it can be worth it to make a little reference document for yourself with all the data definitions in there.\n",
"\n",
"```\n",
"A Path is a string: a name of a file or directory, \n",
"relative to the project directory. Example: \"email/train/spam\".\n",
"\n",
"A Dirname is a string: the name of a directory, but without its path.\n",
"Example: \"ham\".\n",
"\n",
"A Filename is a string: the name of a file, but without its path.\n",
"Example: \"0001.f0cf04027e74802f09f723cb8916b48e\".\n",
"```\n",
"\n",
"The functions `filelist` and `dirlist` are supplied. They will create a list of all files in a given path.\n",
"Try out the filelist function below by reading and printing the files in one of the email directories.\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [],
"source": [
"from os import listdir # used to see what files are in a directory\n",
"from os.path import isfile, isdir, join # to check if a thing is a file or a directory\n",
"\n",
"# Path -> [Filename]\n",
"def filelist(path):\n",
" \"Find the names of all files in a given directory.\"\n",
" return [f for f in listdir(path) if isfile(join(path, f))]\n",
"\n",
"# Path -> [Dirname]\n",
"def dirlist(path):\n",
" \"Find the names of all directories in a given directory.\"\n",
" return [d for d in listdir(path) if isdir(join(path, d))]"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1',\n",
" '0001.bfc8d64d12b325ff385cca8d07b84288',\n",
" '0002.24b47bb3ce90708ae29d0aec1da08610',\n",
" '0003.4b3d943b8df71af248d12f8b2e7a224a',\n",
" '0004.1874ab60c71f0b31b580f313a3f6e777',\n",
" '0005.1f42bb885de0ef7fc5cd09d34dc2ba54',\n",
" '0008.9562918b57e044abfbce260cc875acde',\n",
" '0010.7f5fb525755c45eb78efc18d7c9ea5aa',\n",
" '0011.2a1247254a535bac29c476b86c708901',\n",
" '0013.9034ac0917f6fdb82c5ee6a7509029ed',\n",
" '0014.ed99ffe0f452b91be11684cbfe8d349c',\n",
" '0018.259154a52bc55dcae491cfded60a5cd2',\n",
" '0020.4120dc06a0124a8688e96f8cff029113',\n",
" '0026.4f10fab6e6776379c17ee9c9ac7da4a8',\n",
" '0028.83a43dd97923463030349506a56226c1',\n",
" '0032.081c3615bc9b91d09b6cbb9239ba8c99',\n",
" '0034.d5a5e526aa6b249ed6ca184548a44b1a',\n",
" '0040.256602e2cb5a5b373bdd1fb631d9f452',\n",
" '0041.21cc985cc36d931916863aed24de8c27',\n",
" '0042.21cc985cc36d931916863aed24de8c27',\n",
" '0043.8d93819b95ff90bf2e2b141c2909bfc9',\n",
" '0045.75baa6797e2a65053a8373d5aa96f594',\n",
" '0046.0b4fff9cd7cffe94cc4f04bbf3928c28',\n",
" '0049.625bab436c7fc6299cfceeaa24e198ae',\n",
" '0051.374f4d4300a5d39544b2f052e7a9429d',\n",
" '0052.f12ac251d1fbdc679daadc6b97229e63',\n",
" '0053.92bcea73123d0ea0fb26c285d5e045a9',\n",
" '0054.839a9c0a07f13718570da944986a898a',\n",
" '0057.92fdae44bdd1d9e5461eef3c852dfd23',\n",
" '0059.a633106e3ce62fa7b46c2e4dc8c666d3',\n",
" '0061.c148ebba16540e48c7aae2e3f733a8a3',\n",
" '0064.d700742b9815d990b2e5a7921e8d854c',\n",
" '0071.4c3840b98dc207623d0c0e66a6d40af2',\n",
" '0073.d57c16429fa19fbebfb9aec34f391aa2',\n",
" '0075.4568998f41d50bccf8f7c3d4aeb7a425',\n",
" '0078.8ff64b5c77f9c9618bd7b119ae14c8b2',\n",
" '0079.4a5fbaf2e531918c44642b3cfae40089',\n",
" '0082.705cdc08c2ee77821391a847d9c1a4e3',\n",
" '0083.a042c7512d5db5f9fc1857fdc6bbdcc3',\n",
" '0084.df5ac85de3405b6d07c9fa7ba3eecf6a',\n",
" '0087.1cbd88a0c1564cb5d6c9b12c8c4175d8',\n",
" '0090.9a7e76d58065e29e709161dbe569fe54',\n",
" '0091.113ec7122d4046a2754bcf70b9fb5299',\n",
" '0092.bf7453c6b7917ca30074a3030d84e36d',\n",
" '0093.2bb8a2a7e4d2841a14f27f32076dd77e',\n",
" '0095.e1db2d3556c2863ef7355faf49160219',\n",
" '0096.b2cb600e893f7a663ea5f9bff3a6276e',\n",
" '0097.dce08392ba6bc552d13394fa73974b62',\n",
" '0098.01d2958ccb7c2e4c02d0920593962436',\n",
" '0099.c4ff6dba0a5177d3c7d8ef54c8920496',\n",
" '0100.c60d1c697136b07c947fa180ba3e0441',\n",
" '0102.2e3969075728dde7a328e05d19b35976',\n",
" '0103.8c39bfed2079f865e9dfb75f4416a468',\n",
" '0106.fa6df8609cebb6f0f37aec3f70aa5b9a',\n",
" '0108.4506c2ef846b80b9a7beb90315b22701',\n",
" '0111.a163d41592b3a52747d7521341a961af',\n",
" '0114.c104ada3a249e1e1846c0cd156a303e9',\n",
" '0116.8e13644b995f98dbab198b71e26f67ec',\n",
" '0119.07aedc59172c0c25ef617188ada9b80f',\n",
" '0123.68e87f8b736959b1ab5c4b5f2ce7484a',\n",
" '0124.37afd066a74d18b7f14bea0b1fb43d4d',\n",
" '0125.44381546181fc6c5d7ea59e917f232c5',\n",
" '0128.4da9b2cfacbe9bfd128aacbb526d68d4',\n",
" '0129.78a705ff6b3bde3395d067459e6e46e2',\n",
" '0131.0b7281078874ca88f95d6fdf5d905d50',\n",
" '0133.95454d70cc62190c0e167d4c3cb591af',\n",
" '0134.83a63d7a1589ba4cd6aefe20c8e6385f',\n",
" '0135.73d44c9405f00110ae76a3addcb4eed6',\n",
" '0136.7e7d6adf293fa0a3dc56b3f796cf00d1',\n",
" '0140.a2bb669eaf743ed123fca884a40cfbd4',\n",
" '0141.516a4fe92f63469bd4a21d46dd6bb3be',\n",
" '0142.1fd05cffaba260b9ecd3e75b6dddaf73',\n",
" '0143.260a940290dcb61f9327b224a368d4af',\n",
" '0145.ec89d85ec20f9aeda6fe37c0b6e8bbed',\n",
" '0146.6656452972931e859e640f6ac57d2962',\n",
" '0148.7641581f551a1bf533b995087a8a91db',\n",
" '0150.30c44c205041fd95f00ef524ea54e356',\n",
" '0151.6f8f0ec4d897a5285d662ef4ec31d924',\n",
" '0152.c0ea23686b9ad63dfba6040c1539da71',\n",
" '0153.eddc658b08a04641a2494ba6b6eb0a3c',\n",
" '0154.e39fc51ffdb9c2ecd480ce972078aeaa',\n",
" '0156.279e5f92cf12922fbbf0cbda112b7fcb',\n",
" '0159.8a5c778f65ecc30e14507369b9eb8292',\n",
" '0161.00e60d1a3478f1ae99ff49fbd4b30605',\n",
" '0163.e4abb3f86aa9fd5bfa85886055fd923d',\n",
" '0167.1665f2336b63debb3463fcf4d37e8485',\n",
" '0170.fe4f77fa9456b48dffa9288074b2bb2a',\n",
" '0172.e524e85cab354337018e1d0d2fc21ffd',\n",
" '0174.3874b6ff3c86a5ebefb558138a6bfb28',\n",
" '0175.bf85f34d953215bca7d0004aca087812',\n",
" '0176.70022adaab1a9dfe64ae7588ffa5add9',\n",
" '0177.d62ac309d8030ef816f7831c3d5d3f7d',\n",
" '0178.bf2ab7492e5080b07d7397b0662821a7',\n",
" '0181.e3259c0ef889b5c76054abe2fafddeda',\n",
" '0183.4aaadeb40e3362e71e3e4aba15624e3a',\n",
" '0185.9f02f77f7f5a2724c109f598b2245675',\n",
" '0187.e2178f6d01a70dfbdf9c84c4dcaf58dc',\n",
" '0188.6590e73ef71e79c5b6adedbacf91ac8c',\n",
" '0192.2d3e74aaf18c1c4193067f025e757507',\n",
" '0193.4ceae11e1dae2059c9a526eebda8b259',\n",
" '0195.8b276e08dd05b0131faa8fb24764f205',\n",
" '0197.6968d98720065059247cefe4e5bcd192',\n",
" '0198.43bac6df7ea16e4c4b0026779341f14b',\n",
" '0199.955edee89f34960c033c4d1072841356',\n",
" '0200.a56926c058fa84b0ea031b5774e5dcfa',\n",
" '0202.f1c9a17fe805c50677c104743e4f8be2',\n",
" '0205.d3c294d833fd7c79edd96dac71039821',\n",
" '0206.806263422d55d38a151fe3b89d56192f',\n",
" '0209.59817ef0dc8d05d4b49bd5914fa88afa',\n",
" '0213.5f17fdf863726d4704840f86f698d10b',\n",
" '0214.b5ba0ff48cee07a36c6f312de7f77207',\n",
" '0216.feb2a8df9887bc2d84e80c9d2a8faf56',\n",
" '0217.2a937e0b9912e1e40dbf17bad6026372',\n",
" '0219.0f66069db1b4e25ba851233ce4a107c4',\n",
" '0221.ee1d208001fd30265827fb309441d662',\n",
" '0222.6ad799703d958681d6e427762f86f179',\n",
" '0226.409b6577c79d85773d50cb37fde4ba79',\n",
" '0230.035bbcbe1235cb6fdd0a5d6d626dc5c4',\n",
" '0231.30ae582570716a95e79c87a2de31cb30',\n",
" '0233.e9834d55f8185a84ce8a047b2eba2139',\n",
" '0236.ca8e7524e271aec0324e707cb7d420a1',\n",
" '0238.7d0de37650a0c0e2d99e52eef4042602',\n",
" '0239.43b3279a300a122610f91725bb92a538',\n",
" '0240.96467ad3d42ebd44b042599f5aa9c9d9',\n",
" '0241.abb2882a304357a47681f887244c2f76',\n",
" '0243.458c8e32e405b69f561fd77bc16f440c',\n",
" '0245.39c15852204971c72e8d89f9f3f9bb38',\n",
" '0249.c429ab5c1413c4386bf64b228a68e768',\n",
" '0251.d542591a25f8fe8c4accd692113a0554',\n",
" '0252.c90694cf3f09ef0111b761eefd95cc3f',\n",
" '0255.42a6feb4435a0a68929075c0926f085d',\n",
" '0256.ad88c1a165392a509a8b0b8df6d56cbd',\n",
" '0257.554324ab4a8f7093f5222303a4c59a8b',\n",
" '0258.1d61b380a23168881253ed86bb4f79ac',\n",
" '0260.737eefb83e7eedbd531117c273c56241',\n",
" '0261.93d7dab5dc0c469b58aa9b0e5e25bb25',\n",
" '0262.c996a3709ca616fce1bfc6d50cf5bda3',\n",
" '0264.2281c4eb36accd65d9c2cab379de2789',\n",
" '0266.99e95dc7251843f7a2015cb602775694',\n",
" '0267.0bf79a17115bffdf00bb0997f773dfc5',\n",
" '0270.d50e186af7a00114ad967b8f77b70338',\n",
" '0275.0404a07cd99e27d569958716f392082b',\n",
" '0280.2507969221ea95a019506366f6c361d8',\n",
" '0281.7e8c08897b61b9b008238efec9ca8d15',\n",
" '0284.cfe6e278b87c3e9b6abf6cf6a16bf708',\n",
" '0285.b44ae825681c0f28db2e742ab790b191',\n",
" '0287.37dd6b1a54993de94495643ead4fd2cf',\n",
" '0290.13035c75be0d5b447a10e2263f8c1361',\n",
" '0296.c9b10ba5ae2e480e37a6e2e1455671eb',\n",
" '0298.804507b6d4d03a86e53c63249fe70772',\n",
" '0299.9d0b292172cb787eb2ed9e8855222edd',\n",
" '0303.c18c1a0222b07f2b2250fbda5a961b7e',\n",
" '0306.521d917ac6509c499c406647fd0d336b',\n",
" '0310.23036f6ae05720b052b73117b6ecb957',\n",
" '0312.a0e7f2633bd0ceaddf16fba58be54778',\n",
" '0319.e4a20802d12937998f3b3bf805362a3f',\n",
" '0320.e34c9c6f982b8ce353c10aa362d6da17',\n",
" '0321.89f41bbace08275ee298ed419e22bc9a',\n",
" '0323.badf0273f656afd0dfebaa63af1c81f6',\n",
" '0325.78b93ee9713b6594d03c86993286e6c5',\n",
" '0327.5df76bb4359800b5408821285677b5cf',\n",
" '0330.a4df526233e524104c3b3554dd8ab5a8',\n",
" '0331.1de50a02d91a4e0c6daf5e2cc28a60c6',\n",
" '0334.3e4946e69031f3860ac6de3d3f27aadd',\n",
" '0337.4e2d92485e5b880d494821c1fcee790a',\n",
" '0338.033c0109da096486c7d797cccd2c3198',\n",
" '0342.babb5045c49b585808041391599bc05d',\n",
" '0344.8bbe5c7c8269a039761968a1b10a936a',\n",
" '0346.8c8e3c5107bf6bf30b940f79d598c1b9',\n",
" '0347.e74f831074ea17d0721bd06a5fa7857c',\n",
" '0350.0f2ef01282cb99a4eeb9a19923597b3f',\n",
" '0356.86a795300367f707a8b648e0c50253ad',\n",
" '0358.8a6a162daac1368fcfe83a5db1084ee1',\n",
" '0359.2794a4ec8f226ea59a009e972d012f64',\n",
" '0362.d605ea00a259c1245d6e21ecf38264cf',\n",
" '0364.8e5f3385c2deb2c0c32794b403851ec4',\n",
" '0366.539843bed9a06ae77966ccbc9dc2e103',\n",
" '0368.3a53888c2f7fbe52a7293f223375c245',\n",
" '0369.2530542de47d461ccb925fcafc6f0ad5',\n",
" '0372.216f90ef52558ed24402e192586a40e8',\n",
" '0373.2171ee7f8e73e1092279077df2910ff6',\n",
" '0374.ed17ed71f8d321cf8505672678c56e71',\n",
" '0375.ad5939ae436ed745d5222893d5ffe191',\n",
" '0376.d87b4313e6c43a986060d57a0b8515a6',\n",
" '0378.36f7856d38f84ffea7f1fd98044f756e',\n",
" '0380.c4d530b5816543f4f1a23b8ce0d281f5',\n",
" '0383.5b89d5a9c0152070a77e133734f7cd83',\n",
" '0384.e25b766bea2f1efe35eccb7eb6f54e37',\n",
" '0385.8db8e827e6fec2fae5f7e407fe0e0ca3',\n",
" '0386.27345c618f7ca368d7a12b0dd09a9da3',\n",
" '0387.c2b993b46377256bdcb2314c2553b6f0',\n",
" '0388.23ff533336b63fb45d267b8cbe59b7b4',\n",
" '0389.ed4ca8aceef91808c783909351c7bdb4',\n",
" '0391.a52ab775baefe8b277a285560cac7d78',\n",
" '0392.9e194dfff92f7d9957171b04a8d4b957',\n",
" '0393.d3a4d296a35c6a7f39429247c007eeae',\n",
" '0394.9c882c72ddfd810b56776fdaa1c727a6',\n",
" '0400.a152ca3d2735f5dfe48601331471c591',\n",
" '0403.5aa6261d36d1362bcd181ed7738de7f7',\n",
" '0404.a2c9ac35a89a129ce473c5d977409131',\n",
" '0405.18a5c3d971e1def2c3b4a2df122f3583',\n",
" '0406.4b29229820cc5e9675ad369a3a000f43',\n",
" '0409.09cb28cd8753bff06fc8a547c3ed8fe2',\n",
" '0411.e6e37cbb02ad33b4e0ba5fb6caf2bbcf',\n",
" '0415.e241b6184464107168656739bf96c6b9',\n",
" '0419.a42a284750591b454968a76dfab38370',\n",
" '0420.6112350c5fb3dcf5a67a4fafac80702e',\n",
" '0421.a5e7e7b43acb5501368b8c61221477f1',\n",
"...