You can submit the assignment in groups of 2. I would strongly suggest you to work in groups of 2.
Using the GUM treebank from here:
https://github.com/UniversalDependencies/UD_English-GUM/blob/master/en_gum-ud-train.conllu(Links to an external site.)
The HMMs are well described here in the chapter 8.4. Link here: https://web.stanford.edu/~jurafsky/slp3/8.pdf
Components of a HMM tagger (40 points) [For everybody]
Undergrads and graduates: Use the equation in 8.4.3 to implement the emission and transition probabilities. Check equation 3.23 in chapter 3 in the book for implementing both the transition and emission probabilities if you want to add smoothing. Don't forget to add the
token when computing the transition probabilities.
Greedy Tagger (60 points) [For everybody]
Implement a greedy tagger. At each step, choose the tag that is the best. You don't have to implement the Viterbi algorithm to find the best tag sequence. At each step, select the tag that is the maximum of the product of the transition probability and the emission probability. Think greedy!
Viterbi Tagger (50 points) [For extra credit]
Implement the Viterbi tagger as given in 8.4.5. The backpointer part needs to be implemented for outputting the best sequence.
Reading: Read the section A.4 for worked out examples of the viterbi algorithm.
Don't hesitate to contact me for doubts about your code. Best of luck.
Testing:
Test your tagger on the test dataset here:
https://github.com/UniversalDependencies/UD_English-GUM/blob/master/en_gum-ud-test.conllu(Links to an external site.)
What is the accuracy and F-scores of your tagger? You can use sklearn's metrics to compute the metrics.
Grading: You will get partial credit for any submitted work.