IS6713 - Spring 2021 Requirements for removing the incomplete Assignment #1 (120 points total available) This assignment focuses on data visualization tasks. Please refer to the labs and posted videos...

Python Data Analysis with given Files


IS6713 - Spring 2021 Requirements for removing the incomplete Assignment #1 (120 points total available) This assignment focuses on data visualization tasks. Please refer to the labs and posted videos for hints. Write your name and abc123 Name: Rodriguez Utsa ID: gdh898 Collecting python-LevenshteinNote: you may need to restart the kernel to use updated pac kages. Downloading python-Levenshtein-0.12.2.tar.gz (50 kB) Requirement already satisfied: setuptools in c:\users\domr2\anaconda3\lib\site-packages (from python-Levenshtein) (50.3.1.post20201107) Building wheels for collected packages: python-Levenshtein Building wheel for python-Levenshtein (setup.py): started Building wheel for python-Levenshtein (setup.py): finished with status 'done' Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp38-cp38-win _amd64.whl size=83820 sha256=765388bd7bba42e70ecd7335f9f71a0e01b35815f94f8aaf1980e84e2fc a195a Stored in directory: c:\users\domr2\appdata\local\pip\cache\wheels\d7\0c\76\042b46eb0d f65c3ccd0338f791210c55ab79d209bcc269e2c7 Successfully built python-Levenshtein Installing collected packages: python-Levenshtein Successfully installed python-Levenshtein-0.12.2 Task 1. Read the data file (10 points) Read the data file reviews.csv into a pandas dataframe.Check the structure of the dataset and look at some of the observations. Task 2. Decode the date column "at" (20 points) In [2]: pip install python-Levenshtein In [3]: import pandas as pd import numpy as np import seaborn as sns from gensim.utils import simple_preprocess In [ ]: # your code here In [ ]: In [ ]: Check out the content of the column "at" and perform the following steps: 1. Create two new numeric columns, "year_at" and "month_at", containing the year and month of the review respectively (hint: you can "slice" these values out of the "at" column) 2. Find the review's initial year (hint: you can use the min function) 3. Create a new column "time" calculated as (year_at - initial_year) * 12 + month_at Task 3. Lexicon-based sentiment analysis (30 points) Using the lexicons for negative and positive words provided with this notebook, do the following: 1. Create a new numeric column containing the number of positive words for each tweet 2. Create a new numeric column containing the number of negative words for each tweet 3. Create a new numeric column containing the total sentiment score (positive-negative) for each tweet 4. Create a new text column with the predicted sentiment class for each tweet (positive, neutral, negative) In [ ]: # 1. your code here In [ ]: # 2. your code here In [ ]: # 3. your code here In [ ]: ## Loading the lexicons def file_to_set(file): c = set() for x in file: c.add(x.strip()) return c # You should return a set positive_file = open('./positive-words.txt', encoding='utf8') positive_words = file_to_set(positive_file) positive_file.close() negative_file = open('./negative-words.txt', encoding='iso-8859-1') # If you get a weir negative_words = file_to_set(negative_file) negative_file.close() In [ ]: # 1. your code here In [ ]: # 2. your code here In [ ]: # 3. your code here In [ ]: # 4. your code here In [ ]: In [ ]: Task 4. Exploratory Data Analysis (30 points) This task focuses on the numeric and categorical columns, namely: score (numeric) thumbsUpCount (numeric) appId (categorical) at_year (numeric, created in task 2) at_month (numeric, created in task 2) time (numeric, created in task 2) positive_words (numeric, created in task 3) negative_words (numeric, created in task 3) sentiment_score (numeric, created in task 3) sentiment (categorical, created in task 3) For each column, describe the basic statistics of the numeric columns (min, max, average, etc...) and the description for the categorical columns (number of categories, top category, frequency, etc...). Task 5. Visually represent your data (30 points) 1. Visually represent each of the following columns by selecting the appropriate type of graph for each column: score (numeric) thumbsUpCount (numeric) appId (categorical) at_year (numeric, created in task 2) at_month (numeric, created in task 2) time (numeric, created in task 2) positive_words (numeric, created in task 3) negative_words (numeric, created in task 3) sentiment_score (numeric, created in task 3) sentiment (categorical, created in task 3) 2. Plot the trend over time (x=time) for the following columns. Hint: you may need to groupby the data using the mean before plotting the trend of a column. i.e. groupby("time", as_index=False).agg('mean'): score thumbsUpCount positive_words negative_words In [ ]: # 1. your code here - numeric columns In [ ]: # 1. your code here - categorical columns In [ ]: sentiment_score 3. Plot the score over time by appId (15 plots, one for each appId) 4. Use the seaborn lmplot function to plot a linear regression for score over time (x='time', y='score'). Do you think that the linear relationship fits the data? Note: For each plot, add a comment to explain what you are trying to show (example: "This graph plots the average score vs time" or "This graph shows the number of reviews per appId over time"). In [ ]: # 1. your code here In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: # 2. your code here In [ ]: # Plot over time here. Remember, you may have to groupby('time', as_index=False) your d In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [1]: ## Plot the score over time by appId (15 plots, can be together, with one line of code, # 3. your code here In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: # 4. Write your code here In [ ]: ## Write your conclusions about the plot scrore vs time. Does the linear relationship f
May 28, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here