Document Classification, NLTK Chapter 6 For this problem, you will work a lot with classifying text in chapter 6. You will build a Naive Bayes Classifier from scratch for the task of classifying if an...

Document Classification, NLTK Chapter 6
For this problem, you will work a lot with classifying text in chapter 6. You will build a Naive Bayes
Classifier from scratch for the task of classifying if an email is spam or not. You will get hands-on
experience on how to build a machine learning classifier using NLTK. You can submit a python
notebook file for this homework. The answers can be submitted separately in a document.
Don’t hesitate to GOOGLE. But, don’t copy the code. I will give you a zero.
1. Read section 1 in NLTK chapter 6 and familiarize with the document classification example for
movie reviews dataset.
2. Download the dataset from ACL Wiki
http://www.aueb.gr/users/ion/data/lingspam_public.tar.gz There are many Spam datasets. Untar
the dataset. Google untar and find out how you will deal with a non-zip type archives.
3. Carefully read the readme file.
1. How many folders are there in the archive?
2. What is the difference between the different folders?
4. We will work with Part 1 folder in the lemm_stop folder. Show the code snippet to get marks
for this question.
1. How many documents are marked as spam and not spam? How did you come up with the
number?
2. How many words are there in all the documents?
3. What are the top 5 frequent words in the spam documents?
4. What are the top 5 frequent words in the non-spam documents?
5. What is the maximum number of words in a document?
6. What is the minimum number of words in a document?
5. Create a feature extractor function similar to document_features in the NLTK example. Don’t
copy the code from NLTK book. Use the feature extractor function to create a training dataset
on Part 1 of the data. Train a Naive Bayes classifier as shown in the book chapter.
6. For testing, we will use Part 10 in the lemm_stop folder. Follow similar steps as above to
create a test dataset. Apply the feature extractor function to extract features from the test
dataset. What is its accuracy on the test dataset? Show the code. What happens if you test on
the training dataset? If you get accuracies below 50%, then, there is a bug in your code.
7. Evaluation: What is the Precision, Recall, and F-score of the classifier that you trained? Read
section 3 of the chapter to answer these questions.
8. Can you try another classifier such as Logistic Regression? How do the evaluation metrics look
like? This is a good starting point to start using scikit-learn.
1. You can use scikit-learn’s implementation:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.h
tml
2. An example of working with text data is here:
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Mar 31, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here