Python Code for Data Wrangling and Machine Learning with Report. All information on the task has been included in the file named 'Task Info' in the zipped file attached. A marking rubric has also been...

1 answer below »
Python Code for Data Wrangling and Machine Learning with Report. All information on the task has been included in the file named 'Task Info' in the zipped file attached. A marking rubric has also been included for reference.
Answered Same DayMay 31, 2021

Answer To: Python Code for Data Wrangling and Machine Learning with Report. All information on the task has...

Sandeep Kumar answered on Jun 03 2021
138 Votes
Introduction
With the boom in services, which can be easily accessed through a mobile application it has become more than vital to test their reliability, traditionally reviews and ratings have been
the most robust ways to assess a service’s value. With tens of thousands of users, and their reviews it has been difficult to track their reliability and helpfulness. Such a task would take a human years to complete and is hence is needed to be automated. So, in this project I will be applying various machine learning algorithms to train a model to perform sentiment analysis on and predict the review ratings.
Data Source
For this project, the yelp dataset will be used, which comprises of 28068 training datasets and 7018 testing datasets of review and review metadata each. The review metadata contains the columns: date, business_id, review_id, reviewer_id, vote_cool, vote_useful, vote_cool. While the review dataset has only review column.
Features and Preprocessing
In the beginning, the model was trained with the data from review texts and review ratings. So we combined the differing dataset for review and review metadata and created a new column for the number of characters in the review. The review text was also cleaned. The format, style and whitespace were discarded. Word vectorization was implemented and text collection was converted to a matrix of token counts with over 75657 units, after erasing the word suffixes to find the root of words. Also, stopwords which are words that don’t have any informative value but appear regularly in the language wee discarded as well. The frequency of every word were counted and the sparse matrix was extracted.
Models
The focus of the models was positive, neutral and negative review rating, so the various models that have been used were of sentiment analysis and classification orientation, like random forest, decision tree, support vector machine and gradient boosting classifier as well as K nearest neighbor and XGBoost...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here