Answer To: Objectives: Retrieve text data from an article, speech, story, debate or some other web-based...
Neha answered on Oct 15 2021
{
"cells": [
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"#import dependencies\n",
"from urllib.request import urlopen\n",
"from bs4 import BeautifulSoup\n",
"import requests\n",
"import nltk\n",
"from nltk.corpus import stopwords\n",
"import re"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"url = 'https://www.americanrhetoric.com/speeches/gwbush911worldremembers.htm'\n",
"res = requests.get(url)\n",
"html_page = res.content\n",
"soup = BeautifulSoup(html_page, 'html.parser')\n",
"text = soup.find_all(text=True)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"output = ''\n",
"blacklist = [\n",
" '[document]',\n",
"\n",
" 'noscript',\n",
" 'header',\n",
" 'html',\n",
" 'meta',\n",
" 'head', \n",
" 'input',\n",
" 'script',\n",
" # there may be more elements you don't want, such as \"style\", etc.\n",
"]\n",
"\n",
"for t in text:\n",
" if t.parent.name not in blacklist:\n",
" output += '{} '.format(t)\n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"def review_to_words( text ):\n",
" # Function to convert a raw review to a string of words\n",
" # The input is a single string (a raw movie review), and \n",
" # the output is a single string (a preprocessed movie review)\n",
" #\n",
" # 1. Remove HTML\n",
" output = BeautifulSoup(text).get_text() \n",
" #\n",
" # 2. Remove non-letters \n",
" letters_only = re.sub(\"[^a-zA-Z]\", \" \", text) \n",
" #\n",
" # 3. Convert to lower case, split into individual words\n",
" words = letters_only.lower().split() \n",
"\n",
" return( \" \".join( words ))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"troy hunt the million record collection data breach gh post upgrade cta content gh post upgrade cta display flex flex direction column align items center font family apple system blinkmacsystemfont segoe ui roboto oxygen ubuntu cantarell open sans helvetica neue sans serif text align center width color ffffff font size px gh post upgrade cta content border radius px padding px vw gh post upgrade cta h color ffffff font size px letter spacing px margin padding gh post upgrade cta p margin px padding gh post upgrade cta small font size px letter spacing px gh post upgrade cta a color ffffff cursor pointer font weight box shadow none text decoration underline gh post upgrade cta a hover color ffffff opacity box shadow none text decoration underline gh post upgrade cta a gh btn display block background ffffff text decoration none margin px padding px px border radius px font size px font weight gh post upgrade cta a gh btn hover opacity root ghost accent color a home workshops speaking media about contact sponsor sponsored by the million record collection data breach january many people will land on this page after learning that their email address has appeared in a data breach i ve called collection most of them won t have a tech background or be familiar with the concept of credential stuffing so i m going to write this post for the masses and link out to more detailed material for those who want to go deeper let s start with the raw numbers because that s the headline then i ll drill down into where it s from and what it s composed of collection is a set of email addresses and passwords totalling rows it s made up of many different individual data breaches from literally thousands of different sources and yes fellow techies that s a sizeable amount more than a bit integer can hold in total there are unique combinations of email addresses and passwords this is when treating the password as case sensitive but the email address as not case sensitive this also includes some junk because hackers being hackers they don t always neatly format their data dumps into an easily consumable fashion i found a combination of different delimiter types including colons semicolons spaces and indeed a combination of different file types such as delimited text files files containing sql statements and other compressed archives the unique email addresses totalled this is the headline you re seeing as this is the volume of data that has now been loaded into have i been pwned hibp it s after as much clean up as i could reasonably do and per the previous paragraph the source data was presented in a variety of different formats and levels of cleanliness this number makes it the single largest breach ever to be loaded into hibp there are unique passwords as with the email addresses this was after implementing a bunch of rules to do as much clean up as i could including stripping out passwords that were still in hashed form ignoring strings that contained control characters and those that were obviously fragments of sql statements regardless of best efforts the end result is not perfect nor does it need to be it ll be x perfect though and that x has very little bearing on the practical use of this data and yes they re all now in pwned passwords more on that soon that s the numbers let s move onto where the data has actually come from data origins last week multiple people reached out and directed me to a large collection of files on the popular cloud service mega the data has since been removed from the service the collection totalled over separate files and more than gb of data one of my contacts pointed me to a popular hacking forum where the data was being socialised complete with the following image as you can see at the top left of the image the root folder is called collection hence the name i ve given this breach the expanded folders and file listing give you a bit of a sense of the nature of the data i ll come back to the word combo later and as you can see it s allegedly from many different sources the post on the forum referenced a collection of dehashed databases and combos stored by topic and provided a directory listing of of the files which i ve reproduced here this gives you a sense of the origins of the data but again i need to stress allegedly i ve written before about what s involved in verifying data breaches and it s often a non trivial exercise whilst there are many legitimate breaches that i recognise in that list that s the extent of my verification efforts and it s entirely possible that some of them refer to services that haven t actually been involved in a data breach at all however what i can say is that my own personal data is in there and it s accurate right email address and a password i used many years ago like many of you reading this i ve been in multiple data breaches before which have resulted in my email addresses and yes my passwords circulating in public fortunately only passwords that are no longer in use but i still feel the same sense of dismay that many people reading this will when i see them pop up again they re also ones that were stored as cryptographic hashes in the source data breaches at least the ones that i ve personally seen and verified but per the quoted sentence above the data contains dehashed passwords which have been cracked and converted back to plain text there s an entirely different technical discussion about what makes a good hashing algorithm and why the likes of salted sha is as good as useless in short if you re in this breach one or more passwords you ve previously used are floating around for others to see so that s where the data has come from let me talk about how to assess your own personal exposure checking email addresses and passwords in hibp there ll be a significant number of people that ll land here after receiving a notification from hibp about m people presently use the free notification service and k of them are in this breach many others over the years to come will check their address on the site and land on this blog post when clicking in the breach description for more information these people all know they were in collection and if they ve read this far hopefully they have a sense of what it is and why they re in there if you ve come here via another channel checking your email address on hibp is as simple as going to the site entering it in then looking at the results scrolling further down lists the specific data breaches the address was found in but what many people will want to know is what password was exposed hibp never stores passwords next to email addresses and there are many very good reasons for this that link explains it in more detail but in short it poses too big a risk for individuals too big a risk for me personally and frankly can t be done without taking the sorts of shortcuts that nobody should be taking with passwords in the first place but there is another way and that s by using pwned passwords this is a password search feature i built into hibp about months ago the original intention of it was to provide a data set to people building systems so that they could refer to a list of known breached passwords in order to stop people from using them again or at least advise them of the risk this provided a means of implementing guidance from government and industry bodies alike but it also provided individuals with a repository they could check their own passwords against if you re inclined to lose your mind over that last statement read about the k anonymity implementation then continue below here s how it works let s do a search for the word p ssw rd which incidentally meets most password strength criteria upper case lower case number and characters long obviously any password that s been seen over k times is terrible and you d be ill advised to use it anywhere when i searched for that password the data was anonymised first and hibp never received the actual value of it yes i m still conscious of the messaging when suggesting to people that they enter their password on another site but in the broader scheme of things if someone is actually using the same one all over the place as the vast majority of people still do then the wakeup call this provides is worth it as of now all passwords from collection have been added to pwned passwords bringing the total number of unique values in the list to whilst i can t tell you precisely what password was against your own record in the breach i can tell you if any password you re interested in has appeared in previous breaches pwned passwords has indexed if one of yours shows up there you really want to stop using it on any service you care about if you have a bunch of passwords and manually checking them all would be painful give this a go if you use password account you now have a brand new watchtower integrated with haveibeenpwned api thank you troyhunt also looks like i have to update some passwords pic twitter com toyynrpi h roustem karimov roustem may this is password s watchtower feature and it can take all your stored passwords and check them against pwned passwords in one go the same anonymity model is used neither password nor hibp ever see your actual password and it enables bulk checking all in one go i m conscious that many people reading this won t be using a password manager of any kind in the first place and that s an absolutely pivotal part of how to deal with this incident so i ll come back to that a little later apparently this feature along with integrated hibp searches and notifications when new breaches pop up is one of the most loved features of password which is pretty cool for some background on that without me knowing in advance they launched an early version of this only a day after i released v with the anonymity model incidentally that was a key motivator for later partnering with them hey you know what would be cool if password was to integrate with my newly released pwned passwords k anonymity model so you could securely check your exposure against the service it d have to be opt in of course oh wow look at this https t co rcspu kntr troy hunt troyhunt february for those using pwned passwords in their own systems eve online github okta et al the api is now returning the new data set and all cache has now been flushed you should see a very recent last modified response header all the downloadable files have also been revised up to version and are available on the pwned passwords page via download courtesy of cloudflare or via torrents they re in both sha and ntlm formats with each ordered both alphabetically by hash and by prevalence most common passwords first why load this into hibp every single time i came across a data set that s not clearly a breach of a single easily identifiable service i ask the question should this go into hibp there are a number of factors that influence that decision and one of them is uniqueness is this a sufficiently new set of data with a large volume of records i haven t seen before in determining that i take a slice of the email addresses and ran them against hibp to see how many of them had been seen before here s what it looked like after a few hundred thousand checks in other words there s somewhere in the order of m email addresses in this breach that hibp has never seen before the data was also in broad circulation based on the number of people that contacted me privately about it and the fact that it was published to a well known public forum in terms of the risk this presents more people with the data obviously increases the likelihood that it ll be used for malicious purposes then there s the passwords themselves and of the m unique ones about half of them weren t already in pwned passwords keeping in mind how this service is predominantly used that s a significant number that i want to make sure are available to the organisations that rely on this data to help steer their customers away from using higher risk passwords and finally every time i ve asked the question should i load data i can t emphatically identify the source of the response has always been overwhelmingly yes if i have a massive spam list full of personal data being sold to spammers should i load it into haveibeenpwned troy hunt troyhunt november people will receive notifications or browse to the site and find themselves there and it will be one more little reminder about how our personal data is misused if like me you re in that list people who are intent on breaking into your online accounts are circulating it between themselves and looking to take advantage of any shortcuts you may be taking with your online security my hope is that for many this will be the prompt they need to make an important change to their online security posture and if you find yourself in this data and don t feel there s any value in knowing about it ignore it for everyone else let s move on and establish the risk this presents then talk about fixes what s the risk if my data is in there i referred to the word combos earlier on and simply put this is just a combination of usernames usually email addresses and passwords in this case it s almost billion of them compiled into lists which can be used for credential stuffing credential stuffing is the automated injection of breached username password pairs in order to fraudulently gain access to user accounts in other words people take lists like these that contain our email addresses and passwords then they attempt to see where else they work the success of this approach is predicated on the fact that people reuse the same credentials on multiple services perhaps your personal data is on this list because you signed up to a forum many years ago you ve long since forgotten about but because its subsequently been breached and you ve been using that same password all over the place you ve got a serious problem by pure coincidence just last week i wrote about credential stuffing attacks and how they led many people to believe that spotify had suffered a data breach in that post i embedded a short video that shows how easily these attacks are automated and i want to include it again here within the first seconds the author of the video has chosen a combo list just like the one three quarters of a billion people are in via this combination breach another seconds and the software is testing those accounts against spotify and reporting back with email addresses and passwords that can logon to accounts there that s how easy it is and also how indiscriminate it is it s not personal you re just on the list for people wanting to go deeper check out shape security s video on credential stuffing to be clear too this is not just a spotify problem automated tools exist to leverage these combo lists against all sorts of other online services including ones you shop at socialise at and bank at if you found your password in pwned passwords and you re using that same one anywhere else you want to change each and every one of those locations to something completely unique which brings us to password managers get a password manager you have too many passwords to remember you know they re not meant to be predictable and you also know they re not meant to be reused across different services if you re in this breach and not already using a dedicated password manager the best thing you can do right now is go out and get one i did that many years ago now and wrote about how the only secure password is the one you can t remember a password manager provides you with a secure vault for all your secrets to be stored in not just passwords i store things like credit card and banking info in mine too and its sole purpose is to focus on keeping them safe and secure a password manager is also a rare exception to the rule that adding security means making your life harder for example logging on to a mobile app is dead easy password managers are one of the few security constructs that actually make your life easier take logging onto a mobile app with password on ios tap the email field choose the account face id login button job done not a single character typed pic twitter com zkcghfhhq troy hunt troyhunt january i chose the password manager password all those years ago and have stuck with it ever it since as i mentioned earlier they partnered with hibp to help drive people interested in personal security towards better personal security practices and obviously there s some neat integration with the data in hibp too there s also a dedicated page explaining why i chose them if a digital password manager is too big a leap to take go old school and get an analogue one aka a notebook seriously the lesson i m trying to drive home here is that the real risk posed by incidents like this is password reuse and you need to avoid that to the fullest extent possible it might be contrary to traditional thinking but writing unique passwords down in a book and keeping them inside your physically locked house is a damn sight better than reusing the same one all over the web just think...