Python coding project. Please see details in the pdf file.CITS1401 Computational Thinking with...

Question

Python coding project. Please see details in the pdf file.CITS1401 Computational Thinking with Python  Project 2 Semester 2 2020    Page 1 of 11    Project 2: How Good (Positive and Patriotic) is Australia? Submission deadline: 5:00 pm, Friday 23rd October 2020  Value: 20% of CITS1401 To be completed individually. You should construct a Python 3 program containing your solution to the following  problem and submit your program electronically on Moodle. No other method of  submission is allowed. Your program will be automatically tested on Moodle. Remember  your first two checks against the tester on Moodle will not have any penalty. However  any further check will carry 10% penalty per check.  You are expected to have read and understood the University's guidelines on academic  conduct. In accordance with this policy, you may discuss with other students the general  principles required to understand this project, but the work you submit must be the  result of your own effort. Plagiarism detection, and other systems for detecting potential  malpractice, will therefore be used. Besides, if what you submit is not your own work  then you will have learnt little and will therefore, likely, fail the final exam.   You must submit your project before the submission deadline listed above. Following  UWA policy, a late penalty of 5% will be deducted for each day (or part day), after the  deadline, that the assignment is submitted. No submissions will be allowed after 7 days  following the deadline except approved special consideration cases. Context:  For this project, imagine for a moment that you have successfully completed your UWA  course and recently taken up a position for the Department of Prime Minister and  Cabinet in Canberra with the Australian Federal Government. At first you were quite  reluctant to leave Perth to move ‘over east’ and, more generally, wondered what use a  new graduate with a heavy focus on computing, programming and data could be to this  department. Regardless, the opportunity to gain experience in the ‘real world’ was too  good, and although it is not quite your own multi-million dollar technology start-up,  there was no way you weren’t taking up the offer.    Your first few weeks of orientation was a mostly blur. However, one thing you noticed  was that any time you mentioned your skills in programming, and with Python1 in  particular, to any senior bureaucrat, or even some of the savvier politicians, their eyes  seemed to ‘light up’ and they suddenly became much more interested in whatever you    1 Actually their eyes are more likely to light up if / when you mention your skills in data science and machine  learning and big data, for all of which Python is basically the foundational tool for.      CITS1401 Computational Thinking with Python  Project 2 Semester 2 2020    Page 2 of 11    were saying to them. After reflecting on these experiences, maybe there would be some  even more interesting opportunities for you in the near future?  However, for now you decide to put aside these, as it’s not like the work that you have  been doing already has not been interesting, and this is what you need to focus on for  today. At an early morning meeting with your immediate supervisor, you were told that  the Government is very interested in reducing its spend on trying to understand what  (and how) the Australian population currently thinks about it. Instead of spending  millions of dollars calling randomised groups of Australian residents every quarter to  ask about their opinions on various Government services, many senior bureaucrats have  wondered for a while now whether there was any way to use the masses of freely  available data on the internet to provide similar insights at a fraction of the cost.  It is within this context that your supervisor has asked you to develop a program, as a  proof-of-concept, to demonstrate that it is possible to provide some of these insights at  a much lower cost. At your meeting your supervisor noted that, for the proof-of-concept  stage, the use of any ‘live’ internet data will not be possible without approval from the  legal team (as well as possibly many others). This seemed like quite an obstacle until  you thought back to one of your early Python units (maybe this one?) and remembered  that there is an open source, freely available corpus collection of billions of recently  crawled websites called the Common Crawl (http://commoncrawl.org/). More  specifically the Common Crawl corpus consists of tens of thousands of files saved in a  certain format (the WARC format, see below), each of which contains the raw HTML of  tens of thousands of web pages from a web ‘crawl’ performed in the recent past. Being  open source this data is free for you to use so with it you can immediately begin building  your proof-of-concept. The Project:  As your program is to be a proof-of-concept, both you and your supervisor decided that  its scope should be kept as narrow as possible (but, of course, it must be broad enough  so that it can successfully demonstrate some really good insights). For this reason, it  was decided that your program is to focus only on providing four insights only:   1. How ‘positive’ is Australia generally?  2. How ‘positive’ does Australia feel towards their Government specifically?  3. How ‘patriotic’ is Australia compared with two other major English speaking  countries – UK and Canada?  4. What are the most referred-to websites (domains) by all Australian websites  (your team may want to use this information in the future to better understand  how ‘influential’ each Australian web result is to your insights, i.e. highly-referred  to web domains should be counted as more influential, and lowly-referred to web  domains should be counted as less influential). As outlined in the ‘context’ section, in order to generate these insights (which will be  discussed in greater detail later in this document), your program will need to examine  http://commoncrawl.org/     CITS1401 Computational Thinking with Python  Project 2 Semester 2 2020    Page 3 of 11    the raw HTML from large quantities of Australian web pages, and such information is  available in WARC format from the Common Crawl. The Common Crawl and WARC format:  The WARC (Web ARChive) format is a standard format for mass storage of large  amounts of ‘web pages’ within a single file. The Common Crawl makes the results of  their crawl freely available for download in this format (as well as the WAT and WET  formats, which will not be used for this project). For this project we will use WARC files  from the August 2020 crawl (https://commoncrawl.org/2020/08/august-2020-crawl- archive-now-available/). In order to access these files you need to download the “WARC  files” list – which you can access by clicking on the “CC-MAIN-2020-34/warc.paths.gz”  hyperlink in the table in the August 2020 crawl homepage.   Clicking on this link will download an archive, which, when opened, will contain a text  file. Once you open the text file you can download any of the WARC files from the  common crawl by appending https://commoncrawl.s3.amazonaws.com/ to the front of  any of the lines of this file and pasting this full address into your browser.   A couple of notes about the Common Crawl WARC files as discussed so far:   • The file list and all Common Crawl WARC files are compressed using gzip. These files  can be unzipped automatically if you are using Linux or Mac OSX. For Windows you  will have to download a free application to do this - try 7-Zip: https://www.7- zip.org/.   • The Common Crawl WARC files are very large – approximately 900MB compressed  and up to 5GB uncompressed. Each file contains approximately 45,000 individual  crawl results.  Due to the size of the files above, this project has made available a massively cut down  sample Common Crawl WARC file on LMS as well as Moodle server. It is expected you  will use this file to get familiar with the format and for your (initial) testing of your  project. However, your submission will be tested with other WARC files.  To start getting familiar with WARC files, it is recommended you download the sample  file and open it in a text editor (for Windows, Wordpad performs better; you can also  use Thonny). You will see that a WARC file consists of an overall file header, beginning  with the text “WARC/1.0”, and the next time you see this text is to describe either a  request (“WARC/1.0
WARC-Type: request”), a response (“WARC/1.0
WARC- Type: response”) or possibly a metadata or other type of WARC category (e.g.  “WARC/1.0
WARC-Type: metadata”). For this project we are only interested in  WARC responses (“WARC/1.0
WARC-Type: response”), as these are the only  categories that contains the raw HTML data of the web page we are analysing.2  Looking into more detail at WARC responses, you can see that these are further broken  down into three sections, which are separated by blank lines. The first is the WARC    2 Note the use of ‘’ with ‘
’ to signify a line ending in the WARC (and HTTP) headers. This is a standard line  ending code for text files saved with Microsoft Windows and some other scenarios. You will need to account for  this when processing these headers.   https://commoncrawl.org/2020/08/august-2020-crawl-archive-now-available/ https://commoncrawl.org/2020/08/august-2020-crawl-archive-now-available/ https://commoncrawl.s3.amazonaws.com/ https://www.7-zip.org/ https://www.7-zip.org/     CITS1401 Computational Thinking with Python  Project 2 Semester 2 2020    Page 4 of 11    response header (beginning with “WARC/1.0”). The second is the HTTP header (usually  beginning with “HTTP/1.1 200”) and the third is the raw HTML data (usually but not  necessarily beginning with “”). For the purposes of this project, you  can assume that the first block of text (before the first blank line) is the WARC header,  the second block of text (after the first blank line) is always the HTTP header, and the  third block of text (i.e. anything after the second blank line and before the next  “WARC/1.0” heading) is the raw HTML that we need to analyse.  Taking into account the above, your program will need to be able to open a WARC file,  discard or ignore the overall WARC file header, and then for each result:  1. Extract the URL from the WARC response header (this is stored in the line starting  with

Sandeep Kumar · Accepted Answer

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer
",
    "from sklearn.feature_extraction.text import TfidfTransformer
",
    "from sklearn.ensemble import VotingClassifier
",
    "from sklearn.linear_model import LogisticRegression
",
    "from sklearn.neighbors import KNeighborsClassifier
",
    "from sklearn.naive_bayes import MultinomialNB
",
    "from sklearn.pipeline import Pipeline
",
    "from nltk.util import ngrams
",
    "import re
",
    "from sklearn import linear_model
",
    "
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fileWriter = open('negativewords.txt','w')
",
    "
",
    "mystopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
",
    "'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
",
    "'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
",
    "'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
",
    "'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
",
    "'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
",
    "'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
",
    "'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
",
    "'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
",
    "'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
",
    "'most', 'other', 'some', 'such', 'no', 'nor',

CITS1401 Computational Thinking with Python Project 2 Semester 2 2020 Page 1 of 11 Project 2: How Good (Positive and Patriotic) is Australia? Submission deadline: 5:00 pm, Friday 23rd October 2020...

Answer To: CITS1401 Computational Thinking with Python Project 2 Semester 2 2020 Page 1 of 11 Project 2: How...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment