Assignment Goals
The goal of this assignment is to write a class that 'learns' from a file of movie reviews (with a 0-4 score and some text), and uses that information to predict what score a new review might get.
From a technical point of view, this assignment practices structuring and developing a class, and builds on examples of using HashMaps and ArrayLists.
Getting Started
To start, download the files in thisfolder(Links to an external site.). It contains a Java file and two text files.
In Eclipse, make an a7 package. Add the three files to this package.
How the Predictor Works
The predictor reads in a files of reviews. Each review has a 0-4 score as the first item on each line.
When the predictor sees a word, it associates that word with the score of the review it is in. For example, the word "good" might be in a 3 star review, so we want to associate good with this 3 value.
However, the word good may appear in multiple reviews, or even multiple times in a single review. "Good" may appear in a review like "1 good idea, but bad movie". Rather than trying to understand the context of its use, the predictor takes the simple approach of averaging all the scores it is associated with. If "good" appears in a 2-star review, a 1-star review, and four 3-star reviews, the predictor adds up all those scores (2 + 1 + 3 + 3 + 3 + 3 = 15) and the number of times it appears (6) to calculate an average value of 15/6, or 2.5.
This process is repeated for all words. One adjustment to make is that average words (those with a value around 2) are ignored. For example, "a" and "the" appear in a lot of reviews, but don't really contribute to the meaning. They also tend to overwhelm the values of the interesting words. So the predictor ignores those. The code has some comments on how to approach this.
Finally, given a new review that needs a predicted score, each word in the review is checked for its value. The total word value is added up and then an average for the review is found. As an example, the predictor may have learned the following associations from the set of reviews: { "good" = 3, "bad" = 1, "wonderful" = 4, "film" = 3}. Then, given a review "a good film", the predictor would ignore "a" and add up the values of "good" and "film" to get a total score of 6. Since 2 words were used to get that total, the average is 3, which is the predicted score. If the review were a real review with a score, we can compare the predicted and actual score to see how the approach is working.
The Structure of the Starter Code
The beginning of a class structure is provided to you. You should follow this structure. There are Javadocs describing what each method does. In addition, there are comments giving the outline of a solution approach.
Here is the basic design of the class. It is intended that an instance of this class will have learned word values from a file and will store those values in a wordValue instance variable that maps between a word key and a score value.
The constructor's job is to make a ready-to-go predictor. It takes a filename in the constructor, and calls methods to read the file and build the HashMap.
The main method drives the overall program. It makes an instance of the predictor class, then reads in a file of reviews to test and print out the predicted and actual review scores.
Other methods need to be implemented as described in their Javadocs and comments.
This assignment closely follows the example of the vote counting program in the Week 9 recitation. You are urged to look carefully at that code and to adapt it as needed for this assignment.
Some variables are instance variable. The wordValue HashMap gets repeatedly used as each review asks for a prediction. Other data in the program is transitory - it gets used to build up the final word value, but it does not need to persist with each instance. In this case, the intent is for you to make local variables as needed in the methods.
Finally, note that the linesFromFile method is static. It has more of a helper status in the class, and doesn't need access to the instance variables.
And definitely finally, some of the methods are public that are not called in main. This suggests that they could have been made private, as they are only called from other methods of the class. I have kept them public as an aid to testing during submission.
Developing your program
My suggestion for developing the program is to:
- Review the vote counting recitation. There is a connection between counting the votes associated with a candidate and totaling the scores associated with a word.
- Work through an example by hand. Use the smallReviews.txt to make a totalScores HashMap, a wordCount HashMap, and then from those make a wordValue HashMap.
- Implement the code and check each method with the results you expect from doing it by hand.
- Try it on the full reviews file.
- Study the predicted results and actual scores.
At the very bottom of the java file, add a block comment and write a short paragraph describing how well you think the predictor works, any trends you see, and any explanation you have as to why the prediction and the actual score might not match.
You do not need to modify comments, but they should be in the correct place in the code.
Some example output from learning the smallReviews.txt and testing with that is
Word totals: {a=4, pretty=3, movie=7, bad=0, excellent=4, decent=3}
Word counts: {a=2, pretty=1, movie=3, bad=1, excellent=1, decent=1}
{pretty=3.0, movie=2.3333333333333335, bad=0.0, excellent=4.0, decent=3.0}
Predicted: 3.2 actual: 4 a excellent movie
Predicted: 1.2 actual: 0 a bad movie
Predicted: 2.8 actual: 3 pretty decent movie
Submitting
Submit just the java file with the working predictor. Make sure to remove the debugging prints I have in the computeScoreAndCounts method. Make sure you have some explanatory text at the bottom of the file. Add your name with an additional @author tag to the top.
The following content is