{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"# Text Mining\n",
"\n",
"#### Automated Understanding of Text\n",
"\n",
"-----------\n",
"_Authors: Carleton Smith_\n",
"\n",
"## Project Guide\n",
"\n",
"- [Introducing the Amazon Review Dataset](#Introducing-the-Amazon-Review-Dataset)\n",
"- [Counting Positive/Negative Words](#Counting-Positive/Negative-Words)\n",
"- [Sentiment Intensity](#Sentiment-Intensity)\n",
"- [LDA Topics](#LDA-Topics)\n",
"- [Review Scores](#Review-Scores)\n",
"\n",
"\n",
"## Project Overview\n",
"\n",
"----------------------------------\n",
"#### EXPECTED TIME: 1.5 HRS\n",
"\n",
"The lectures this week covered a large amount of material. As should be apparent, text mining offers\n",
"many avenues for investigation. This assignment will focus on how to create a couple different\n",
"features from a text document. In particular, activities will include:\n",
"\n",
"- Picking out positive and negative words\n",
"- Calculating sentiment scores\n",
"- Creating \"topics\" with LDA\n",
"\n",
"# VERY IMPORTANT: READ BELOW\n",
"\n",
"**If you recieve an error when trying to run the `imports` cell, go to the top of the screen;\n",
"select `Kernel` on the tool bar, go down to `Change kernel`, and select `Python 3.5`**\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"### Introducing the Amazon Review Dataset\n",
"\n",
"**DATA CITATION**\n",
"\n",
" Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering\n",
" R. He, J. McAuley\n",
" WWW, 2016\n",
" \n",
" http://jmcauley.ucsd.edu/data/amazon/\n",
" \n",
"The data today is a collection of reviews of outdoor products from `Amazon.com`. The full data-set\n",
"includes many features:\n",
"**DATA DICTIONARY**\n",
"\n",
"1. `reviewerID` - ID of the reviewer, e.g. A2SUAM1J3GNN3B\n",
"2. `asin` - ID of the product, e.g. 0000013714\n",
"3. `reviewerName` - name of the reviewer\n",
"4. `helpful` - helpfulness rating of the review, e.g. 2/3\n",
"5. `reviewText` - text of the review\n",
"6. `overall` - rating of the product\n",
"7. `summary` - summary of the review\n",
"8. `unixReviewTime` - time of the review (unix time)\n",
"9. `reviewTime` - time of the review (raw)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [],
"source": [
"import nbconvert\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import nltk\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"**READ IN THE DATA**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape: (296337, 9)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
"\n",
"\n",
" | \n",
"reviewerID | \n",
"asin | \n",
"reviewerName | \n",
"helpful | \n",
"reviewText | \n",
"overall | \n",
"summary | \n",
"unixReviewTime | \n",
"reviewTime | \n",
"
\n",
"\n",
"\n",
"\n",
"0 | \n",
"AIXZKN4ACSKI | \n",
"1881509818 | \n",
"David Briner | \n",
"[0, 0] | \n",
"This came in on time and I am veru happy with ... | \n",
"5 | \n",
"Woks very good | \n",
"1390694400 | \n",
"01 26, 2014 | \n",
"
\n",
"\n",
"1 | \n",
"A1L5P841VIO02V | \n",
"1881509818 | \n",
"Jason A. Kramer | \n",
"[1, 1] | \n",
"I had a factory Glock tool that I was using fo... | \n",
"5 | \n",
"Works as well as the factory tool | \n",
"1328140800 | \n",
"02 2, 2012 | \n",
"
\n",
"\n",
"2 | \n",
"AB2W04NI4OEAD | \n",
"1881509818 | \n",
"J. Fernald | \n",
"[2, 2] | \n",
"If you don't have a 3/32 punch or would like t... | \n",
"4 | \n",
"It's a punch, that's all. | \n",
"1330387200 | \n",
"02 28, 2012 | \n",
"
\n",
"\n",
"3 | \n",
"A148SVSWKTJKU6 | \n",
"1881509818 | \n",
"Jusitn A. Watts \"Maverick9614\" | \n",
"[0, 0] | \n",
"This works no better than any 3/32 punch you w... | \n",
"4 | \n",
"It's a punch with a Glock logo. | \n",
"1328400000 | \n",
"02 5, 2012 | \n",
"
\n",
"\n",
"4 | \n",
"AAAWJ6LW9WMOO | \n",
"1881509818 | \n",
"Material Man | \n",
"[0, 0] | \n",
"I purchased this thinking maybe I need a speci... | \n",
"4 | \n",
"Ok,tool does what a regular punch does. | \n",
"1366675200 | \n",
"04 23, 2013 | \n",
"
\n",
"\n",
"
\n",
"
"
],
"text/plain": [
" reviewerID asin reviewerName helpful \\\n",
"0 AIXZKN4ACSKI 1881509818 David Briner [0, 0] \n",
"1 A1L5P841VIO02V 1881509818 Jason A. Kramer [1, 1] \n",
"2 AB2W04NI4OEAD 1881509818 J. Fernald [2, 2] \n",
"3 A148SVSWKTJKU6 1881509818 Jusitn A. Watts \"Maverick9614\" [0, 0] \n",
"4 AAAWJ6LW9WMOO 1881509818 Material Man [0, 0] \n",
"\n",
" reviewText overall \\\n",
"0 This came in on time and I am veru happy with ... 5 \n",
"1 I had a factory Glock tool that I was using fo... 5 \n",
"2 If you don't have a 3/32 punch or would like t... 4 \n",
"3 This works no better than any 3/32 punch you w... 4 \n",
"4 I purchased this thinking maybe I need a speci... 4 \n",
"\n",
" summary unixReviewTime reviewTime \n",
"0 Woks very good 1390694400 01 26, 2014 \n",
"1 Works as well as the factory tool 1328140800 02 2, 2012 \n",
"2 It's a punch, that's all. 1330387200 02 28, 2012 \n",
"3 It's a punch with a Glock logo. 1328400000 02 5, 2012 \n",
"4 Ok,tool does what a regular punch does. 1366675200 04 23, 2013 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_path = 'C:/Users/Atal/PycharmProjects/GreyNodes/Dataset/reviews_Sports_and_Outdoors_5.json.gz'\n",
"reviews = pd.read_json(data_path, lines=True, compression='gzip')\n",
"print(\"Shape: \", reviews.shape)\n",
"reviews.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"**PREPARE DATASET** \n",
"\n",
"However, we will only be using a portion of this data; much of the provided data is auxilliary\n",
"to our text-mining purposes:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[\"These are really useful. I wear them with a short-sleeve road jersey under a light cycling windbreaker then, as I warm up, pull off the jacket and then the sleeves. The fit is comfortable -- they're quite stretchy, so they should fit arms of reasonable size (mine are long and slender). Rolled up, they fit easily into a jersey pocket.\", 'I recently added Profile Design Aqua Rack 2-bottle Aero Bottle Holder and needed to move to a smaller saddle bag to accommodate it on my small frame road bike (48\"). Since space is a premium, I started looking for another bag option that wouldn\\'t add bulk but would hold my iPhone and possibly a gu or two. This pack fits the bill perfectly - it attaches within seconds and cinches down tight. I have plenty of room and it has never been in my way. A couple of added bonuses is that I can see my phone\\'s screen through the see-through top or a printed route map, which is a big big plus.', \"Nice to know when you're shooting a 40 Cal that your guide rod is not plastic anymore never know if or when that will break this is nice heavy-duty looks like it will last a life time or at least the life of the gun.\", 'This is my first Spyderco folder. I have a fixed blade that I love from them. I have a nice assortment of knives from SOG, Benchmade, Kershaw, Zero Tolerance and others. I really like this knife, especially with the brown handle. The feel is great and the blade makes it a bargain for under $60.00. Might be a bit large for sme for every day carry but it does not bother me.. Nice job Spyderco...it is a keeper!', \"I bought this for my M&P Shield. This fits perfectly like it's from the factory. These holsters are sown super tough. I use this for my everyday conceal carry. I am extremely satisfied. Great product!\"] \n",
"\n",
"\n",
"['So it worked well for a couple weeks, but during a lunge workout, it snapped on me. I liked it and thought it was a great product until this happened. I noticed small rips on the band. This could have been the issue.', 'I have several different bands and these are the least likable, they do not stretch enough and make it way to difficult to use', 'The red band(heaviest resistance) started tearing the first time I used it, so for obvious reason I stopped and tried the black one. The black one lasted fine for the exercise, but the next day it started ripping in the same place as the red one. I will not use the other colors as they simply do not have enough resistance. A big minus is that this set has only one pair of handles and to switch from one handle to another is a big inconvenience and a waist of time.', 'As most exercise products, these were used a few times, and stored away in the bag for a few months. I took them out to start using them again after a car accident has kept me off the free weights, and the bands are starting to discolor and fade. Makes me nervous that the integrity of the rubber is weakening...', 'I wish I would have taken the warnings I read in some of the other reviews seriously!I purchased these bands and loved them the first 2 times I used them. The handles make the bands very comfortable to hold. They seemed to be high quality. I am 5\\'3\" and the stretch was plenty for me, though if you are tall I\\'m not so sure this set of bands would work for you. After my first 2 uses I would have given these bands 5 out of 5 stars, but after today\\'s workout there\\'s no chance!While I was using the yellow band the hardware snapped and hit my leg. Thankfully it hit our concrete garage floor before hitting my leg. I\\'ve attached photos of the hardware issue-- as you can see one of the handles becomes disconnected even with the slightest amount of pull. It seems that the peg that keeps the clip attached to the band is too small for the hole.All in all I would say don\\'t waste your money! I\\'m going to cough up a little more money and buy actual dumbbells.']\n"
]
}
],
"source": [
"# Drop unnecessary columns\n",
"cols_to_keep = ['overall', 'reviewText']\n",
"reviews = reviews.loc[:,cols_to_keep]\n",
"\n",
"# Take a sample of 20,000 5-star reviews (since they are majority)\n",
"five_star_sample = reviews.loc[reviews['overall'] == 5,:].sample(20000, random_state=24)\n",
"\n",
"# Grab the ~19,000+ reviews of 1 and 2 stars\n",
"one_and_two_stars = reviews.loc[reviews['overall'].isin([1,2]),:]\n",
"\n",
"# Display first 5 entries 5-star and low-star corpora \n",
"five_star_corpus = list(five_star_sample['reviewText'])\n",
"low_star_corpus = list(one_and_two_stars['reviewText'])\n",
"print(five_star_corpus[:5], \"\\n\\n\")\n",
"print(low_star_corpus[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"At This point there are two \"corpora\" one, a list of review text from 5-star reviews,\n",
"and the other a list of review text from 1/2 star reviews.\n",
"\n",
"Of course we would expect significant difference betweeen the text of 1/2-star\n",
"reviews and 5-star reviews. This is by design -- we want to see exactly how the reviews look different."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"### Counting Positive/Negative Words\n",
"\n",
"Below is the `\"get_words()\"` function used in lecture, along with the calls that will collect the lists of positive and negative words. \n",
"\n",
"Below that is the function `\"count_pos_and_neg()\"`. \n",
"\n",
"`count_pos_and_neg()` functionalizes the counting of positive and negative words demonstrated in lecture for the restaurants \"Community\" and \"Le Monde\". \n",
"\n",
"Finally, `\"count_pos_and_neg()\"` is used, on our positive/negative word lists (to see the cross-over)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counting on positive words: (2006, 3, 2006)\n",
"Counting on negative words: (3, 4778, 4796)\n",
"Overlapping words: ['envious', 'enviously', 'enviousness']\n"
]
}
],
"source": [
"def get_words(file):\n",
" import requests\n",
" f = open(file, 'r', encoding = \"ISO-8859-1\")\n",
" lines = [l for l in f.readlines() if (l[0] != \";\") and l[0:2] != \"\\n\"]\n",
" word_list = [w.replace(\"\\n\",\"\") for w in lines]\n",
"\n",
" return word_list\n",
"\n",
"p_url = 'C:/Users/Atal/PycharmProjects/GreyNodes/Dataset/positive-words.txt'\n",
"n_url = 'C:/Users/Atal/PycharmProjects/GreyNodes/Dataset/negative-words.txt'\n",
"positive_words = get_words(p_url)\n",
"negative_words = get_words(n_url)\n",
"\n",
"def count_pos_and_neg(text, positive = positive_words, negative = negative_words):\n",
" pos = 0\n",
" neg = 0\n",
" words = nltk.word_tokenize(text)\n",
" for word in words:\n",
" if word in positive: pos +=1\n",
" if word in negative: neg +=1\n",
" \n",
" return pos, neg, len(words)\n",
"\n",
"def proportion_pos_neg(pos, neg, total):\n",
" print(\"Positive: {:.2f}%\\t Negative: {:.2f}%\".format(pos/total*100, neg/total*100))\n",
" \n",
"print(\"Counting on positive words:\" , count_pos_and_neg(\" \".join(positive_words)))\n",
"print(\"Counting on negative words:\" , count_pos_and_neg(\" \".join(negative_words)))\n",
"print(\"Overlapping words: \", [w for w in positive_words if w in negative_words])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Positive: 8.70%\t Negative: 4.35%\n"
]
}
],
"source": [
"### Example of using `proportion_pos_neg`\n",
"proportion_pos_neg(4,2,46)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"Remember: the output from `count_pos_and_neg` is `(count positive words, count negative words, count total words)`\n",
"\n",
"Below is an example of counting the positive / negative words in a few of the reviews:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counting in a five-star review: (4, 2, 46)\n",
"Proportions in a five-star review:\n",
"Positive: 8.70%\t Negative: 4.35%\n",
"\n",
"Counting in a one/two-star review: (20, 14, 772)\n",
"Proportions in a one/two-star review:\n",
"Positive: 2.59%\t Negative: 1.81%\n"
]
}
],
"source": [
"print(\"Counting in a five-star review:\", count_pos_and_neg(five_star_corpus[100]))\n",
"print(\"Proportions in a five-star review:\")\n",
"proportion_pos_neg(*count_pos_and_neg(five_star_corpus[100]))\n",
"\n",
"print(\"\\nCounting in a one/two-star review:\", count_pos_and_neg(low_star_corpus[100]))\n",
"print(\"Proportions in a one/two-star review:\")\n",
"proportion_pos_neg(*count_pos_and_neg(low_star_corpus[100]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"As you can see, there may be many positve words in a \"negative\" review (20!) Which is even more than the number of negative words (14). Full review texts below"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
},
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Low-Star\n",
" To be honest, I really wanted a Burley trailer for my kids and I do even more now after having dealt with the Quick N EZ trailer for a year. Our first trailer we got for $25 at a garage sale. It was also InStep but I don't see that they make this style any longer. It had a hard plastic base as opposed to the canvas most (if not all) trailers are made of these days. My son would slip and slide all over. After just a few minutes of biking with no suspension and no seats, he'd be slumped down with the straps around his neck.Needless to say, I wasn't going to make him endure that any longer! So we picked up the Quick and EZ for $80. We were pleased with how easy it was to set up and how quickly it folded down, though it still wasn't small enough to fit in the trunk of our Grand Prix. We now have a Sienna so it's no longer an issue, but my son's comfort was immediately a factor from the first time he was put in it. The hammock-style seat is straight up in the back, so when he falls asleep he just slumps over and looks very uncomfortable.Not only that, but the airflow inside the trailer is a HUGE concern for me. After 10 minutes of biking in 80ish degree weather, he is beet red and soaked in sweat. Now, he is not in direct sunlight so that's not what's heating him up. It's that because the hammock seat is canvas and because the seat effectively goes from \"wall to wall\" inside the trailer, there is nowhere for the air to go, even though there is a vent in the back of the trailer. We started putting him in shorts and nothing else to try to keep him comfortable but he still gets extremely hot. How can I enjoy my ride when I know he is miserable?And you can forget about putting two kids in this thing unless they're very easy-going and don't mind being squished together. When I was a daycare mom, I would regularly take my son and one of our daycare kids who was only 4 days younger than my son and about the same build, and the first few times they did nothing but fight! Sure they were two 2-year-olds, but can you blame them? The hammock seat forced them to lean into each other and make them even MORE hot! Would you really enjoy riding in the car if you were pressed up against the person in the seat beside you the entire time? Not likely!Aside from that major complaint, I do have to complain that the small wheels and lack of suspension make for a very bumpy ride and I also feel bad subjecting my kids to that. Also, it would be nice if you could only partially lower the sun shade to keep your baby protected. When we're going in the wrong direction, he always gets sun on his thighs and sometimes even his arms if the sun is high. We always put sunscreen on of course, but if I'm going to expect him to sit in there while I bike for two hours, I'd like to make sure he's comfortable.So, all complaints aside, the trailer has one big plus. It's price. You can't beat the price, but you get what you pay for. This summer I am seriously considering spending one of my bonus checks to get my son a wonderful Burley trailer. Yes, it's $450 and probably not practical if you only bike once in a while, but seeing as we bike daily in warm weather and sometimes 2-3 times a day, I think the price is justified to make sure he enjoys it as much as I do. Burley trailers come with large wheels with suspension, a roll bar, PADDED BUCKET SEATS!! with mesh on the upper part to allow for good airflow, and padded seatbelts.Yep, definitely getting a Burley this year! I've waited long enough! ;)\n",
"\n",
"\n",
"High-Star\n",
" WOw, I love having sights that glow in the dark and glow in the light! Sooooo sick! You seriously need these! Day time shooting with these sights are liking shooting at night because the sights do what they say, GLOW!\n"
]
}
],
"source": [
"print(\"Low-Star\\n\", low_star_corpus[100])\n",
"print(\"\\n\\nHigh-Star\\n\", five_star_corpus[100])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"#### Question 1"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"### GRADED\n",
"### How many positive words are in the review at index 10 in `low_star_corpus`?\n",
"### Assign int to ans1\n",
"\n",
"### YOUR ANSWER BELOW\n",
"text = low_star_corpus[10]\n",
"ans1 = '23'"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 1",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
" ###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [
{
"data": {
"text/plain": [
"(23, 10, 620)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"no_of_positive_words = count_pos_and_neg(text)\n",
"no_of_positive_words"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"#### Question 2"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"### GRADED\n",
"### What is the average number of positive words in the first 10 texts in `low_star_corpus`?\n",
"\n",
"### Assign numeric answer to ans1\n",
"\n",
"### YOUR ANSWER BELOW\n",
"\n",
"texts = low_star_corpus[0:10]\n",
"\n",
"ans1 = '2'"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2\n"
]
}
],
"source": [
"avg = 0\n",
"for i in range(0, len(texts)):\n",
" no_of_positive_word, _, _ = count_pos_and_neg(texts[i])\n",
" avg = avg + no_of_positive_word\n",
"print(int(avg / len(texts)))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 2",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"#### Question 3"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### GRADED\n",
"### How many negative words are in the the text at index 30 in `low_star_corpus`?\n",
"\n",
"### Assign integer to ans1\n",
"\n",
"### YOUR ANSWER BELOW\n",
"\n",
"text = low_star_corpus[30]\n",
"_, count_of_negative_words, _ = count_pos_and_neg(text)\n",
"\n",
"ans1 = '4'\n",
"count_of_negative_words"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 3",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"### Sentiment Intensity\n",
"The lecture demonstrated using the `SentimentIntensityAnalyzer` from the `vaderSentiment` package.\n",
"In this assignment, `SentimentIntensityAnalyzer` will be imported from `nltk.sentiment`. However,\n",
"after that import, the functioning should be identical; it still uses the Vader Sentiment metrics."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"High-star: {'neg': 0.098, 'neu': 0.668, 'pos': 0.233, 'compound': 0.8346}\n",
"Low-star: {'neg': 0.066, 'neu': 0.793, 'pos': 0.141, 'compound': 0.9962}\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package vader_lexicon to\n",
"[nltk_data] C:\\Users\\Atal\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package vader_lexicon is already up-to-date!\n"
]
}
],
"source": [
"# Import and instantiate a SentimentIntensityAnalyzer\n",
"from nltk.sentiment import SentimentIntensityAnalyzer\n",
"nltk.download('vader_lexicon')\n",
"sia = SentimentIntensityAnalyzer()\n",
"\n",
"print(\"High-star: \", sia.polarity_scores(five_star_corpus[100]))\n",
"print(\"Low-star: \", sia.polarity_scores(low_star_corpus[100]))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"As is shown in simply counting positive and negative words; low-star reviews\n",
"might not always show as \"negative\".\n",
"#### Question 4"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"score = []\n",
"s = []\n",
"for i in range(0, len(low_star_corpus)):\n",
" a = sia.polarity_scores(low_star_corpus[i])\n",
" score = (low_star_corpus[i], a, a['compound'])\n",
" s.append(score)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9995\n",
"-0.9962\n"
]
}
],
"source": [
"com = []\n",
"for i in range(0, len(s)):\n",
" com.append(s[i][2])\n",
"com_max = max(com)\n",
"com_min = min(com)\n",
"print(com_max)\n",
"print(com_min)\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"### GRADING\n",
"\n",
"### What is the minimum score for 'compound'? Assign number to comp_min\n",
"### What is the maximum score for 'compound'? Assign number to comp_max\n",
"\n",
"### Covered early in Lecture 10-10\n",
"### YOUR ANSWER BELOW\n",
"\n",
"comp_min = '0.9995'\n",
"comp_max = '-0.9962'"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": true,
"grade_id": "Question 4",
"locked": true,
"points": "10",
"solution": false
}
},
"outputs": [],
"source": [
"###\n",
"### AUTOGRADER TEST - DO NOT REMOVE\n",
"###\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"#### Question 5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"### GRADING\n",
...