1 CS0012 Introduction to Computing for the Humanities Project 2 Lab 3 Lab 3 Part 1: Counting unique words per week Looking over the lists of words you generated at the end of Lab 2, you should see...

complete this lab?


1 CS0012 Introduction to Computing for the Humanities Project 2 Lab 3 Lab 3 Part 1: Counting unique words per week Looking over the lists of words you generated at the end of Lab 2, you should see that each list will contain many duplicates. To perform some meaningful analysis of the New York Times data, we’ll need to keep a count of how many times each unique word is used per week. To keep these counts, you should modify process_file so that, instead of returning a list of lists, it returns a list of dictionaries. Each nested dictionary should have each unique word per week as a key and the corresponding value should be the count of the number of times that that word appears. Assuming your main function is unchanged from Lab 2, here is the (partial) output from a run of the desired program when processing nytimes_news_articles_SMALL.txt: Enter a filename: nytimes_news_articles_SMALL.txt [{'behsud': 4, 'afghanistan': 6, 'the': 2136, 'first': 61, 'time': 42, 'noor': 6, 'ul-haq': 6, 'died': 13, 'his': 203, 'afghan': 12, 'army': 9, 'outpost': 1, 'was': 293, 'completely': 4, 'cut': 9, 'off': 18, 'by': 189, 'taliban': 14, 'on': 300, 'a': 975, 'bleak': 1, 'southern': 9, 'battleground': 1, 'hundreds': 10, 'of': 908, 'insurgent': 3, 'fighters': 2, 'swept': 3, 'in': 857, 'and': ... SKIPPING 1300 LINES OF OUTPUT ... 'pledges': 1, 'brawl': 1, '1995': 1, 'installed': 1, 'camera': 1, 'ordered': 1, 'complained': 1, 'untold': 1, 'peek': 1, 'verdes’s': 1, 'splendor': 1, 'warshaw': 1, 'edits': 1, 'encyclopedia': 1, 'undercover': 1, 'guardian': 1, 'harassing': 1, 'jeff': 1, 'kepley': 2, 'swell': 1, 'majestic': 1, 'watched': 1, 'porch': 1, 'snacks': 1, 'equipment': 1, 'stashed': 1, 'hawk': 1, 'magazine': 1}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}] Part 2: Using the counts Now that you have a count for how frequently each word is used each week, you should look through this list of dictionaries to determine and print out the 5 more commonly used words for each week. Write a function named top5 that: • Takes one argument named tokens and should be a dictionary containing all of the unique words for a week as keys, with the frequency of use of each word as the associated value (i.e., one of the nested dictionaries from the result of process_file). • Return a dictionary containing 5 key/value pairs. For each of the top 5 most frequently used words in week number week, that word should appear as a key in the returned dictionary, with the associated value being the number of times that it is used in week number week. Once you have top5 working, you should modify your main function so that it: 2 1. Calls get_filename and assigns the result to the variable name fname. 2. Calls process_file with fname and assigns the result to the variable name weekly_tokens. 3. For each week number (0-10, inclusive), call top5 with the nested dictionary from weekly_tokens for the current week number. Then, print out the result. Here is the (partial) output from a run of the desired program when processing nytimes_news_articles_FULL.txt: Enter a filename: nytimes_news_articles_FULL.txt Top five words for Week0: the: 21734 a: 9276 to: 8972 of: 8773 and: 8258 Top five words for Week1: the: 42232 a: 19222 to: 18207 of: 18182 and: 17002 ... SKIPPING 8 WEEKS OF OUTPUT ... Top five words for Week10: the: 42216 a: 18665 to: 17949 of: 17768 and: 16928 NOTE: Slight differences in the counts output by your program could be due the use of slightly different cleaning (from Part 2 of Lab 2). As you can see, our initial analysis is... relatively boring. But this is to be expected. “the”, “a”, “to”, “of”, and “and” are some of the most commonly used words in the English language. In order to pick out words that are actually relevant per week, we can use TF-IDF. Now, you won’t be expected to implement TF-IDF to complete Project 2. Once you have completed all parts of the lab, be sure to show your work to the lab instructor. https://en.wikipedia.org/wiki/Tf%E2%80%93idf https://en.wikipedia.org/wiki/Tf%E2%80%93idf 3 Rubric Your program will be evaluated according to the following rubric: get_filename works as specified 10 process_file works as specified 25 top5 works as specified 30 Counts from top 5 per week are reasonably close 15 All functions are defined in global scope 10 All other code appears inside of a function (aside from a call to main and any magic value definitions) 10
Apr 13, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here