Viterbi algorithm You will develop a first-order HMM (Hidden Markov Model) for POS (part of speech) tagging in Python. This involves: • counting occurrences of one part of speech following another in...


Viterbi algorithm<br>You will develop a first-order HMM (Hidden Markov Model) for POS (part of speech)<br>tagging in Python. This involves:<br>• counting occurrences of one part of speech following another in a training corpus,<br>• counting occurrences of words together with parts of speech in a training corpus,<br>• relative frequency estimation with smoothing,<br>• finding the best sequence of parts of speech for a list of words in the test corpus,<br>according to a HMM model with smoothed probabilities,<br>• computing the accuracy, that is, the percentage of parts of speech that is guessed<br>correctly.<br>As discussed in the lectures, smoothing is necessary to avoid zero probabilities for<br>events that were not witnessed in the training corpus. Rather than implementing a<br>form of smoothing yourself, you can for this assignment take the implementation of<br>Witten-Bell smoothing in NLTK (among the forms of smoothing in NLTK, this seems<br>to be the most robust one). An example of use for emission probabilities is in file<br>smoothing.py; one can similarly apply smoothing to transition probabilities.<br>Run your application on the English (EWT) training and testing corpora. You<br>should get an accuracy above 89%. If your accuracy is much lower, then you are<br>probably doing something wrong.<br>Comparisons between languages<br>Investigate, by visual inspection and by computational means, the upos parts of speech<br>in different treebanks from Universal Dependencies. (Take a few languages based on<br>your own interests, but no more than about 10. Go for the quality of your submission,<br>not quantity!) Two examples of specific questions you could address:<br>• Which of the chosen languages have a rich morphology and which have a poor<br>morphology?<br>• How similar are the chosen languages, in terms of bigram models of their parts<br>of speech?<br>For the first question, know that you can access the lemma of a token by<br>token ['lemma']. What can you say about the relation between forms and lemmas<br>in the case of languages with rich morphology?<br>2<br>For the second question, consider that the transition probabilities of two related<br>languages may be very similar, even though the emission probabilities may be incom-<br>parable due to the mostly disjoint vocabularies. How could we measure the similarity<br>between two bigram models trained from corpora?<br>Feel free to think of further questions to address. It is worth noting that next to the<br>('universal') upos tags, the Universal Dependencies treebanks sometimes also contain<br>language-specific (xpos) tags.<br>

Extracted text: Viterbi algorithm You will develop a first-order HMM (Hidden Markov Model) for POS (part of speech) tagging in Python. This involves: • counting occurrences of one part of speech following another in a training corpus, • counting occurrences of words together with parts of speech in a training corpus, • relative frequency estimation with smoothing, • finding the best sequence of parts of speech for a list of words in the test corpus, according to a HMM model with smoothed probabilities, • computing the accuracy, that is, the percentage of parts of speech that is guessed correctly. As discussed in the lectures, smoothing is necessary to avoid zero probabilities for events that were not witnessed in the training corpus. Rather than implementing a form of smoothing yourself, you can for this assignment take the implementation of Witten-Bell smoothing in NLTK (among the forms of smoothing in NLTK, this seems to be the most robust one). An example of use for emission probabilities is in file smoothing.py; one can similarly apply smoothing to transition probabilities. Run your application on the English (EWT) training and testing corpora. You should get an accuracy above 89%. If your accuracy is much lower, then you are probably doing something wrong. Comparisons between languages Investigate, by visual inspection and by computational means, the upos parts of speech in different treebanks from Universal Dependencies. (Take a few languages based on your own interests, but no more than about 10. Go for the quality of your submission, not quantity!) Two examples of specific questions you could address: • Which of the chosen languages have a rich morphology and which have a poor morphology? • How similar are the chosen languages, in terms of bigram models of their parts of speech? For the first question, know that you can access the lemma of a token by token ['lemma']. What can you say about the relation between forms and lemmas in the case of languages with rich morphology? 2 For the second question, consider that the transition probabilities of two related languages may be very similar, even though the emission probabilities may be incom- parable due to the mostly disjoint vocabularies. How could we measure the similarity between two bigram models trained from corpora? Feel free to think of further questions to address. It is worth noting that next to the ('universal') upos tags, the Universal Dependencies treebanks sometimes also contain language-specific (xpos) tags.
Jun 07, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here