In this project, you will create a python program that counts the word frequency in a list of New York Times (NYT) articles, per category and in general.
First, make sure to download the following files to your working folder (where you save your program):
Stopwords -
https://drive.google.com/file/d/1V9rAioz980HuIigNV5tZlOmAC9qeB1BK/view?usp=sharing
NYT articles (csv) -
https://drive.google.com/file/d/1s-c75Uzzme8irdYuZW9z9kH-j9gp_m0X/view?usp=sharing
NYT article (text file, UTF-8)
https://drive.google.com/file/d/1rwFzwcSP3L2B8VSTkOEXFUjuQU-aR7Vq/view?usp=sharing
NYT article (text file, ANSI)
https://drive.google.com/file/d/1ry598L-YdtXV8DgntLd8DE8UVJ7lvLlP/view?usp=sharing
Part I - word count
In the first part you will read the external files and count the word frequencies.
The program should:
Display a message stating its goal
Read the StopWords.txt file (make sure to follow the right encoding)
Output how many stop words are in the file
Read ONE of the NYT article files. You can use EITHER the text files or the csv file, whichever is more convenient. They are the same. All file includes field names in the first line, and the articles in the following lines. All values are separated by “ | “.
For each article, read through ArticleTitle, ArticleSubtitle and ArticleKeywords and extract all the unique words and their frequencies (disregard lower or upper case).
For each unique word, count its total frequency in each ArticleCategory, as well as overall total frequency (a sum of all the category counts).
Hint: Use dictionaries + nested dictionaries!
Output the following:
For the whole list:
How many articles are in the files?
How many different categories are in the file?
How many unique words are in the file (remember to only count the words from the ArticleTitle, ArticleSubtitle and ArticleKeywords fields)
What's the total number of words (sum of unique word frequencies)
The top ten most frequent words that ARE NOT stop words + their frequency
For each category:
Total number of unique words in the specific category
Total number of words in the category (sum of frequencies)
The top ten most frequent words in the category that ARE NOT stopwords + their frequency
Again, don't forget: for each article only count the words from the ArticleTitle, ArticleSubtitle and ArticleKeywords fields
Part II - save list to file
In this part you will save the word frequency list to a new csv file.
The program should:
Display a message stating its goal
Create a new csv file with the student name
In the file, create fields for word list (where you'll store the unique words). all the different categories (where you store the word count for each specific ArticleCategory) and total count (the overall frequency sum for the unique word)
Based on the lists / dictionaries created in the previous part, fill your csv file with values: all the unique words + their frequency count in each category + their total frequency count (sum of frequencies in all the categories)
Save and close the file.
Part III - word search
In the last part, your program should input words from the users and output their respective frequency counts.
The program should:
Display a message stating its goal
Ask the user to input a word
Check if the word is in the database (disregard lower or upper case)
Notify the user If the word cannot be found
If the word is in the database, output its frequency in each category, as well as total frequency
Ask the user to input “1” to try another word or “0” to exit the program
Remember:
Only use the material covered in this module -- do not use more advanced functions not covered yet in the course
Make sure to include comments that explain all your steps (starts with #). Also use a comment to sign your name at the beginning of the program!
Work individually and only submit original work
Run the program a few times to make sure it executes and meets all the requirements
Submit one .py file!