This programming assignment should be written in C++ in beginner level.
Spam or Non Spam? The data for this problem comes from https://github.com/gognjanovski/SpamFilterMachineLearning (you don't need to download anything from here, instead use the link below). A spam filter for email automatically classifies an incoming email as spam or non-spam. One way to build a spam filter is to use machine learning, especially if you want to customize the filter per person. In this problem we'll use a very simple machine learning algorithm called nearest neighbor to build a classifier for spam filtering and test its effectiveness. A machine learning algorithm learns by feeding it examples of the concept to learn. For example, in our spam classifier, we will feed it training examples of emails that are already classified as spam to teach it what spam looks like. Then we will feed it training examples of email that are classified as non-spam to teach it what non-spam looks like. This set of examples is called the training set. Once training is complete, we usually want to test our classifier to see how well it works. One way to do this is to take a different set of unseen examples, the test set, run it through the classifier, and see how accurate the trained classifier is on the unseen data. For our spam filter example, we'll have another set of spam emails that we have the machine learning algorithm classify and count how many of these emails are classified as spam and how many as non-spam. If the algorithm classified them as spam then of course it was correct, and non-spam classification is incorrect. Then we do the same thing with a set of non-spam emails where now the algorithm is correct if they are classified as non-spam. From the total we can calculate overall classification accuracy on the test set. The closer this is to 100% the better! Spam Dataset I have the data and some starter code in this zip file: spam.zip There are four folders with the following contents: · spam-train: 350 text files, each is an email considered to be spam, used to train the classifier · nonspam-train: 350 text files, each is an email considered to be non-spam, used to train the classifier · spam-test: 130 text files, each is an email considered to be spam, used to test the classifier · nonspam-test: 130 text files, each is an email considered to be non-spam, used to test the classifier In the root folder there are four more textfiles, filelist-nonspam-test.txt, filelist-nonspam-train.txt, filelist-spam-test.txt, filelist-spam-train.txt. These text files contain the names of each file in the respective folder listed above to provide your program an easy way to read in each file for training or testing. Representating an Email Document If you look at one of the text files, for example, spmsgb11.txt in the spam-train folder, then it looks like this: re never forget ll never forget again forget friend birthday ... It is simply a list of words that were in the email document. However, the email has been pre-processed to remove punctuation, capitalization, and common words like "the", "and", "I", etc. As a result, we have a list of what are called terms instead of words. To represent an email document in our program we will use a singly linked list of Term objects. Each Term object contains the string for the term and an integer that counts how many times the term appeared. For the example above, our linked list would look like this: "re" has a count of 1 because it appeared once, while "forget" has a count of 3 because it appeared three times. The order of each node doesn't matter. In our case the Node class will use a template for the item so we can plug in different types. In implementation, our template type will be the Term class. Part A (5 pts): Read an email document into a linked list of terms The zip file above contains the linked list code from problem 3, the Term class in term.h, and a skeleton main.cpp for you to fill in. If using unix, from the command line it can be compiled with g++ main.cpp -std=c++11. To do: Complete this function: LinkedList
* readEmail(string filename) { } The function should take the pathname to an email file (e.g. spam-train/spmsgb11.txt) and create a linked list of nodes with Term objects that represent the document as described above. Tip: Depending on how you read each term from the file you might end up with an empty read at the end of the file. The function should then return a pointer to the linked list object. I recommend you add some sample calls from main to test the function. You can use the print method in the linked list class to output a linked list. Part B (5 pts): Read multiple documents into an array of linked lists To do: Complete this function: void readEmails(LinkedList* documents[], string filenames, string basepath) { } The filenames parameter is the name of the file containing the list of files in a folder, e.g. filelist-spam-train.txt. The basepath parameter is the name of the folder matching the filenames, e.g. spam-train/. The function should read each email document from the folder into a linked list using the readEmail function from part A. The pointer to each linked list should be stored in the array documents. Note that this array is already created for you in the main function. An example of what this could look like is below: Each entry in the array is a linked list representing an email document. You will probably want to add some test code in main to check if the function is working correctly. In main I already have calls to this function for each of the four folders. Part C (5 pts): Comparing two documents (linked lists) An email document is represented as a linked list. The nearest neighbor algorithm requires some way to compare a document to another and get a value that represents how similar the documents are. For this assignment we'll use a simple comparison metric that just computes the percentage of terms that are the same, ignoring the frequency counter for now. For example if we have these documents with these terms: · doc1 → never → forget → hot → krunk · doc2 → hello → hot → reply If we compare doc1 to doc2 where doc1 is the source and doc2 is the target then we go through every term in doc1 and count how many are in doc2. "hot" is the only term that appears in doc2, so the similarity is 1/4 or 0.25. If we compare doc2 to doc1 where doc2 is the source and doc1 is the target then we go through every term in doc2 and count how many are in doc1. "hot" is once again the only term that appears in doc1, so the similarity is 1/3 or 0.33. This metric has some obvious shortcomings, one is if there is a document with every single term, but maybe lots of other extraneous terms, then it will still get a 100% match. We'll explore a better metric in a future assignment that uses the counter value. To do: Complete this function: float compareDocuments(LinkedList* targetList, LinkedList* sourceList) { } The function compares the two linked lists as described above and returns a float that is their similarity. Once again I recommend you test this function from main before moving on to the last part. Part D (5 pts): Implementing Nearest Neighbor We can finally implement nearest neighbor! Although this is a machine learning algorithm, there is not too much learning that happens. It is really more like memorization. The way nearest neighbor works is given an unseen document, find the closest document out of the training data. We can use our similarity function to compute the degree of closeness. If the closest document from the training data is spam, then classify the unseen document as spam. Otherwise, classify the unseen document as non-spam. We can implement this using the function from part C. To do: Complete this function: float closestMatch(LinkedList* targetDocuments[], int numDocs, LinkedList* sourceDocument) { } targetDocuments is an array of either training spam or training non-spam. numDocs is the size of the array. sourceDocument is a linked list representing the unseen document that we want to compare to every document in the targetDocuments array. Complete the function to loop through the targetDocuments array and find the value with the greatest similarity to sourceDocument. This value should be returned. If this and the other functions work correctly, the code in main should now work! It runs through the test examples for both spam and non-spam, and calculates the accuracy. If my code is correct then you should get · spam-test: 129/130 correct or 99% (which is surprisingly good, so if you get a different result you are confident in let me know in case I made an error) · nonspam-test: 94/130 correct or 72% Part E (5 pts): Runtime Comment on the runtime of the algorithm, both training and testing, including a Big-O analysis. Spam or Non Spam? The data for this problem comes from https://github.com/gognjanovski/SpamFilterMachineLearning (you don't need to download anything from here, instead use the link below). A spam filter for email automatically classifies an incoming email as spam or non - spam. One way to build a spam filter is to use machine learning, especially if you want to custom ize the filter per person. In this problem we'll use a very simple machine learning algorithm called nearest neighbor to build a classifier for spam filtering and test its effectiveness. A machine learning algorithm learns by feeding it examples of the con cept to learn. For example, in our spam classifier, we will feed it training examples of emails that are already classified as spam to teach it what spam looks like. Then we will feed it training examples of email that are classified as non - spam to teach i t what non - spam looks like. This set of examples is called the training set . Once training is complete, we usually want to test our classifier to see how well it works. One way to do this is to take a different set of unseen examples, the test set , run it through the classifier, and see how accurate the trained classifier is on the unseen data. For our spam filter example, we'll have another set of spam emails that we have the machine learning algorithm classify and count how many of these emails are classi fied as spam and how many as non - spam. If the algorithm classified them as spam then of course it was correct, and non - spam classification is incorrect. Then we do the same thing with a set of non - spam emails where now the algorithm is correct if they are classified as non - spam. From the total we can calculate overall classification accuracy on the test set. The closer this is to 100% the better!