C++ Programming Assignment 5
Sentiment Analysis
Instructions:
This assignment uses pointers to perform something done often in computer applications: the parsing of text to find “words” (i.e., strings delineated by some delimiter).
Assume you are working in the marketing department of a company that sells computers and accessories. Your team has been tasked with determining how customers perceive your products based on customer surveys. The questions in the surveys are open-ended. That is, customers write product reviews as paragraphs about their perceptions of your products. Your job is to prepare the customer responses to facilitate a sentiment analysis.
If you are unfamiliar with the term “Sentiment Analysis,” a very good description of it is provided by Revuze (https://www.revuze.it/blog/what-is-sentiment-analysis/?gclid=EAIaIQobChMIt_T_j--C5QIVFIiGCh00ng7fEAAYAiAAEgJX3vD_BwE
). I’ve pasted an excerpt of it below:
[Note: Don’t panic as you read through this description. I included it only to provide you with a “big picture” idea of the tasks and challenges involved in performing a full-blown sentiment analysis. In this assignment, you are only required to perform the preliminary task of parsing the data to prepare it for further analysis.]
Sentiment analysis is the automated process to analyze a text and interpret the sentiments behind it. Through machine learning, algorithms can classify statements as positive, negative, and neutral.
This process, also known as “opinion mining,” is often used by companies and brands as a strategy for social media monitoring to manage large amounts of data and gain consumer insights to learn more about customers and competitors.
What Is Sentiment Analysis Used For?
Sentiment analysis is used to analyze social media posts, tweets, and online product reviews, as a way to track opinions, reactions, and ultimately improve customer service and experience. It’s great for market research, brand and product reputation monitoring, and customer experience analysis.
These tools are not only used for analysis purposes, but also for predictions. Previous research suggests, for example, that positive sentiments may have an upward effect on stock prices.
The Definition of Sentiment Analysis
Millions of people around the globe today express their feelings on products and brands through the internet, whether it’s in a Yelp review, in a Twitter thread, or in a Facebook post. Companies have a strong interest in intercepting these online “conversations,” so that they can learn more about their customers and users, as well as the customers of their competitors in the market.
Given the magnitude of available data, it’s impossible for companies to manually search for reviews and comments, analyze them, and classify them as positive or negative. Thanks to sentiment analysis, this process can be
automated, so that these insights can be gathered and evaluated through algorithms.
Of course, training machines to interpret sentiments written in textual form can be challenging, and that’s why not all companies that offer sentiment analysis solutions actually succeed at performing this task.
How Does Sentiment Analysis Work, Exactly?
Let’s say you own a restaurant and you scout for online reviews. Sentiment analysis can analyze them and quickly classify them as “positive,” “negative,” or “neutral.” For example, “The food was delicious!” can be easily classified as strongly positive, while “The service [stinks]” will be identified as a strongly negative comment. Thanks to a “sentiment library,” a sentiment analysis tool can easily identify nouns, verbs, adjectives, and adverbs in these texts and recognize that “delicious” is an indicator of a positive reaction, while “[stinks]” is an indicator of a negative one.
If all reviews were so straightforward, it would be quite easy to train a machine to do the job. However, most reviews are more subtle and nuanced.
For instance, one reviewer may say, “The food was good, but the music was too loud.” Another might call the restaurant “Not bad.”
Sentiment analysis usually assesses the “score” of a text, placing it on a spectrum of attitudes that goes between +1 (totally positive) and -1 (totally negative). This way, machines are able to distinguish between an enthusiastic comment and a milder, still positive one.
For instance, let’s say your brand has recently put out a new commercial that has been played on television. You can use social media listening to see if people on Twitter have been commenting on your new ad. A sentiment analysis tool will be able to distinguish between different scores of positivity in the two following comments: (1) “I’m obsessed with this new commercial!” and (2) “That’s a cute commercial.” While both of them are positive, the first one will receive a higher score, as it’s clearly more enthusiastic.
Some Of The Challenges In Sentiment Analysis
As we mentioned earlier, a text can be quite hard for a machine to dissect and interpret.
A user may write: “We had to wait 45 minutes to get a table. Great!” To a human being, it’s clear that the adjective “Great!” is used in a sarcastic way. How do we know it? Because of
context. We read the previous sentence, which talks about a long wait time, and we understand that the comment is not positive at all. A good sentiment analysis tool has to be able to detect sarcasm from the broader context, otherwise you’ll end up getting inaccurate data about your brand at the end of the analysis.
Another issue has to do with nuance. The comment “The movie was not bad” is literally saying that the movie was
not
bad, maybe even
good; but it’s also implying that the expectations regarding this movie were so low that the movie is not
as bad
as one would have expected it to be. This is called “negator.”
Also “intensifiers” can be challenging for sentiment analysis. A user who writes “The company’s comment on this issue was pretty good,” creates a nuance that would not be there if we read the same sentence without the word “pretty.”
In conclusion, it’s important
not
to rely on very basic and simple sentiment analysis tools, which are definitely not going to capture the complexity of human sentiments expressed through text.
[End of Revuze excerpt]
The Program: Setting up classes and understanding tokenization
The first step in a sentiment analysis is to perform text mining to parse sentences in a product review document into words and store them in an array (or vector) along with their frequencies.
Tokenization:
The process of parsing sentences into words is called tokenization. To tokenize a sentence into words, use the C++ function
strtok_s(). [Note: do not try to use the C++
strtok
function because it has been deemed unsafe and has therefore been deprecated.]
Your client code (main()) should contain a character array into which you will read an entire product review from a file. [You may define this character array to contain 1000 characters.] Using this character array, main() should call the function
strtok_s() to tokenize the character array into separate words. Here is an excellent discussion of tokenizing and the
strtok_s
function:
https://msdn.microsoft.com/en-us/library/ftsafwz3.aspx
Class Construction:
Write a class called
Word
that stores a word from a product review in a data member called “
word
.” The class should also contain an integer variable representing the number of times (i.e.
frequency
) that the word was found in a product review document. The class should have a one-argument constructor that receives a
pointer to a c-string (character array)
containing the word as its one parameter.
(Note that the output of the
strtok_s
function described above is a pointer to a c-string containing the word that was parsed. This is what you will pass in to the Word constructor.)
The Word constructor should also set the frequency of this Word object to 1. Appropriate set and get functions should be included for both the
word
and
frequency
data members.
Write a class called
Review
that contains a vector of objects of the
Word
class. The class should contain functions to add a new
Word
object to the vector and to print out all of the Words in the vector with their respective frequencies.
Logic of the program
Main()
In main(), create an object of the
Review
class. You will use this object to call the functions in the
Review
class that will manage the words retrieved from the document.
Data Inputs:
In this assignment, you will be reading in a single product review from a file.
You will prompt the user to enter a file that contains the product review.
(Be sure to include error checking to ensure that the file can be successfully opened.) Assume that the sentences in the review consist of words separated by blanks, commas, and periods. There are no punctuation marks, such as exclamation points, semi-colons, colons, etc. Only periods and commas exist in each sentence.
A product review is less than 1000 characters.
Here is an example of what a product review could look like:
The laptop I purchased was awesome. It had twice as much memory and processing speed as my last computer, yet it was small and lightweight. I am very pleased with this purchase and the awesome customer service I received. I also purchased some accessories that were really great, too. Thanks for being such a great and awesome company. I will be purchasing more from you in the future.
Processing:
After an entire product review is read in to your character array, print out the entire review in sentence form. Next, tokenize each word using the
strtok_s
function described above. As each word in the review is tokenized, call a function in the
Review
class (passing in the token pointer that is returned from the
strtok_s
function) to add the new word to the Review object’s vector of words. Call this function
AddWord
.
Before storing the word as an object, the
AddWord
function in the
Review
class should search the existing vector of
Word
objects to determine if the word already exists in it. To determine if the word already exists in the vector of
Word
objects, you should loop over the vector and compare each Word’s “
word
” with the new word to be added. If it
does not
already exist, the function should create a new
Word
object (passing in the pointer to the word as a parameter to the
constructor
of the
Word
class), assign it a frequency of 1, and add the new object to the vector in the
Review
class. If the word
does
already exist in the vector, you should merely call a function in the
Word
class that will increase the frequency of that
Word
object by 1.
One note of caution: C++ is case sensitive. Therefore, the word “thanks” is considered different from the word “Thanks”. Semantically, they are the same word. Therefore, we only want to store one instance of the word. My recommendation is to create a function in the
Word
class that will convert the word into lowercase before assigning it to the “
word
” data member. The constructor of the class would be an excellent place to call this function (i.e. before the word is ever stored in the member variable). To summarize: store only the lowercase versions of each word in the
Word
objects.
For information on how to convert a character array into all lowercase characters, this blog on StackOverflow is a good resource:
https://stackoverflow.com/questions/27054353/c-converting-array-of-char-into-lower-case
Be sure to cite it in the resources section of your program if you use any of this code.
Naturally, in real-life, you would want to eliminate words such as “the,” “an,” “and,” “but,” “or,” etc. You would also want to only include root words. That is, there is no need to store “enjoy,” “enjoyed,” and “enjoying” as separate words. This process is called “
stemming
.” However, pruning your list to exclude words with no meaning and including only root words are tasks that are beyond the scope of this assignment. Therefore, your vector should contain a unique list of all the words in the product review with their respective frequencies.
Output:
The output of the program should be a listing of all of the words in the review with a count of their frequencies. You do not have to assign meaning to any of these words as would be required in a full-blown sentiment analysis. However, this program performs the necessary first steps in preparing the data for a more extensive investigation of consumer sentiment.
Here is an example of how your output might look:
To give you an idea of the general criteria that will be used for grading, here is a checklist that
you might find helpful:
Program executes without crashing
|
Appropriate Internal Documentation
|
main()
|
Declares variables appropriately (uses pointers where needed for tokenization and creates a Review object for storage of Words)
|
Prompts user to enter a filename with appropriate error checking
|
Contains a loop to tokenize each word in the product review using the strtok_s function
|
Calls the AddWord function in the Review class (passing in each token)
|
Review class:
|
Constructor(s) as appropriate
|
Data members: vector of Word objects
|
Functions: AddWord, DisplayWords, getters/setters as appropriate
|
|
Word class
|
Constructor(s) as appropriate
Data members: word and frequency
|
Functions: ConvertToLowerCase, getters/setters as appropriate
|
Review the submission instructions document prior to uploading your work to Blackboard.
Submit this assignment by 11:59 p.m. (ET) on Monday of Module/Week 5.