In this project, you will create a program that extracts unique words from text files, create databases to store their frequency counts, and query through them using a GUI. Important: · read all the...

Assignment has been attached.


In this project, you will create a program that extracts unique words from text files, create databases to store their frequency counts, and query through them using a GUI.  Important:  · read all the instructions carefully  · Go over the material assigned in the course: all the answers are there! · If you get stuck on a certain part and can’t get it to work, comment it out and move on. It’s better to submit a program that executes some of the requirements than a program that does not run at all (you can use the submission text to explain what was your issue).    Part 1 - creating databases (10 points)   1. Display a message stating its goal  2. Create a database file using SQLite using your name (check first that it does not already exist)  3. Create a table named “datasources” with the following information: ID datasource description sourceurl 1 StopWords A list of words with little meaning / value https://raw.githubusercontent.com/nimdvir/teaching/master/stopwords.txt 2 IMDB A list of titles, descriptions and keywords of movied from the website IMDB.com https://raw.githubusercontent.com/nimdvir/teaching/master/imdb.txt 3 NYT A list of titles, descriptions and keywords of articles from the New York Times https://raw.githubusercontent.com/nimdvir/teaching/master/nyt.txt   0. (Check if not exist, and then) create 3 tables for each data source (StopWords. IMDB, NYT)  1. Each table should include the following columns: a. ID - a unique id number for each value b. Word - the list of unique words in the data source file  c. Freq - the frequency count of each unique word (how many times it appears in the text file) 2. For each data source, open and read the text file using their URL from the “datasources” table - use urllib! 3. For each file, extract all the unique words and count them  4. insert the values (word and freq) into the corresponding database.  Output: 0. Output the number of unique words and the total numbers of words in each database   Tips:    · Use regular expressions for word / string extractions (look at WA #4 for reminders) · Make sure to strip symbols such as ‘|- etc. and to not count lower and uppercase twice. For example, the words “Trump”, “Trump’s”, “trump!” should all be counted together (as “trump”).  · Words that are separated by a sign (such as -|,: etc.) should be counted separately. For example “super-star” should be split into two unique words (hint: notice the keywords… make sure they are all counted!) · You can use dictionaries to count the frequencies or use the databases directly (I personally prefer dictionaries)  · While you should check for errors, the URLs should work. If you do get an error, make sure that the URL was typed correctly and that your code is OK. If the error continues and you can’t resolve it, email me   Part 2 - calculating significance (5 points) 0. In the databases “IMDB” and “NYT” add a column called “significance” 1. For each word in the database, calculate and insert its significance score - its frequency count / the total number of words in the database.  output: 0. For each database, output the 20 words with the highest significance score that are NOT in the StopWords  1. Include in your output - the word, its frequency count and its significance score   Part 3 - creating a GUI (5 points)   0. Create a graphic user interface (GUI) to accept a word input and display its frequency and significance score in the IMDB and NYT databases.   1. The GUI should include a message stating its goal, a field for the user to enter a word to check and a button to execute.  2. Once the user clicks the button, the GUI should display the word frequency count and significance score for both IMDB and NYT databases (if the word is not in a database, the displayed values should be 0).  3. The calculation can be in the same GUI or in a new window.  4. However, make sure that your GUI includes a “try again” button, to allow the user to search for other words without running the program again.    Bonus (up to 5 points)   0. In your GUI, display for the searched word its “total significance score” -- sum of the word frequency count in both databases / sum of total number of words in both databases (2 points) 1. In your GUI, display for the searched word its “distinctiveness score” for each database -  the word frequency count in the relevant database / sum of the word frequency count in both databases. For example, if a word appears 2 times in the IMDB database and 8 times in the NYT database, its IMDB distinctiveness score will be 2/(2+8) = 0,2 , and its NYT distinctiveness score will be 8/(2+8) = 0.8  (3 points)   Remember:  · Only use the material covered in this module -- do not use more advanced functions not covered yet in the course  · Make sure to include comments that explain all your steps (starts with #). Also use a comment to sign your name at the beginning of the program! · Work individually and only submit original work · Run the program a few times to make sure it executes and meets all the requirements   · Submit a .py file! Write a program that displays a graphic user interface (GUI) for entering and comparing values for an experiment. In the experiment, participants are divided into control and treatment groups, and the success rates in each group are compared.    The program should:   1. Create and display a GUI with 4 inputs fields (where the users can enter data):  a. Control group - total participants number a. Control group - total success cases number a. Treatment group-  total participants number a. Treatment group - total success cases number 1.   Make sure the fields are labeled in a clear way! 1. The GUI should also include a button to execute 1. Once the users clicked the button, the program should calculate the percentages of the success rates in each group (total success / total participants).   1. The program should then open a new GUI that clearly displays the results and states which group’s rate is higher.    Bonus (up to 5 points)   0. Based on the results, in the second GUI also draw and present a column chart in which each percentage is represented by a different column in a different color.  1. Make sure to label the chart - include some kind of legend and a clear labeling for the columns values.   Remember:  · Only use the material covered in this module -- do not use more advanced functions not covered yet in the course  · Make sure to include comments that explain all your steps (starts with #). Also use a comment to sign your name at the beginning of the program! · Work individually and only submit original work · Run the program a few times to make sure it executes and meets all the requirements   · Submit a .py file!
Apr 26, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here