For this Assignment, you are asked to write a python program for indexing and searching keywords in multiple websites.
Your program should use all the material covered in the class (databases, network programming, graphics etc.)
Specifically, make sure your program answer the following requirements:
Part 1 - Creating / opening a database:
- Ask the users to input a name of database
- Check if the database exists, if it can be opened and if it can be read
- If the database does not exist, then:
Create a new database - use input for the file name (the name entered by the users)
Create the following tables and fields:
- Table “webpages” with the fields “Id” (a unique id number for each webpage stored), “url” (the url address of the page), “title” (the page’s title)
- Table “keywords” with the fields “page_id” (the id number from the previous table), “word” (all the unique words that appear in the page), “count” (the frequency of each word in the page). “significance” (the significance score for each word, calculated as the word frequency count divided by the total number of words on the page)
Output the total number of values in each table (if it’s a new database, the values will be “0”).
Step 2 - scraping websites
Ask the user to enter a url address, or “0” to stop
Check if the url address can be opened. If not, display an error message and ask the users to enter a new url. Otherwise, use BeautifulSoup to grab (a) the webpage title (b) All the visible text on a webpage.
Output the following:
The page title
The total number of words on the page
The total number of
keywords / unique words
on the page (Not case sensitive)
The five most significant keywords on the page (significance = word count on page / total number of words on page)
Step 3 - indexing websites
Check if the url is stored in the database (use the same database from step 1).
If the url is
not
stored, then add the following values:
Table 1 - unique ID, page url and page title
Table 2 - relevant page ID, all the unique words (keywords), their word count (frequency in text), their page significance (word count / total number of words)
If the url
is
stored, then:
Delete all the values associated with the url (title, keywords, etc.)
Output how many values were deleted
Add new values to the database (previous step)
Repeat parts 2 and 3 (steps 5 - 10) until the users enter “0”. Note: run it at least 5 times with different url addresses to make sure it works correctly.
part 4 - searching keywords
Once the users enter “0” to stop indexing url addresses, display a graphic user interface (GUI) with (a) your name (b) The objective of the program (c) An input field to enter a keyword (can be numbers, letters, or both. NOT case sensitive) (d) A button to execute
Once the users enter a keyword and click the button, advance to the next step (Note: do not advance if the users didn’t enter any input!)
Ina new graphic window
or
in the same GUI, display:
How many times the word appears in total (sum of all counts for all pages in the database)?
In how many indexed web pages the keyword appears?
display all the indexed pages that contain the keyword - page url, page title, and page significance score for the keyword
Allow the users to return to the previous GUI / search more keywords (as many times as they want)
Remember:
Only use the material covered in the course!
Make sure to include comments that explain all your steps (starts with #). Also use a comment to sign your name at the beginning of the program!
Work individually and only submit original work
Run the program a few times to make sure it executes and meets all the requirements
Submit a .py file!