Refer to Part-of-speech tagging, NLTK Chapter 5 to know more Part-of-Speech tagger
For this problem, you'll do lots of processing of the tagged version of the Universal Dependencies (UD)
corpora. Typically, large POS-tagged corpora are available as CONLL format which is explained here:
https://universaldependencies.org/docs/format.html Read the explanation so that you can answer the
questions below. You will download an example corpus of POS tags here.
https://github.com/UniversalDependencies/UD_English-GUM/blob/master/en_gum-ud-train.conllu
You will work with UPOSTAG field in the data.
You will need to submit your code as well as the answers to the questions.
HINT: Use a dictionary of dictionaries for many of the questions below.
1. Write programs to process the above corpus and find answers to the following questions:
1. Which nouns are more common in their plural form, rather than their singular form? (Only
consider regular plurals, formed with the –s suffix – don't worry about irregular plurals for
now.)
2. Which word has the greatest number of distinct tags? What are the tags, and which parts of
speech do they represent?
3. List the tags in order of decreasing frequency. Which parts of speech do the 20 most
frequent tags represent? Do you see any patterns?
4. Which tags are nouns most commonly found after, and which parts of speech do they
represent?
2. Generate some statistics for tagged data to answer the following questions:
1. What proportion of word types are always assigned the same part-of-speech tag?
2. How many words are ambiguous, in the sense that they appear with at least two tags?
3. What percentage of word tokens in the UD Corpus involve these ambiguous words?
3. Since you are used to the corpus, you will build a Most Frequent tagger. The most frequency
tagger does not take context into account. It is a simple dictionary based tagger that assigns the
most frequent tag to a word. You will get the word-tag counts from the downloaded corpora.
You will test your dictionary based tagger on the test corpus available here https://github.com/
UniversalDependencies/UD_English-GUM/blob/master/en_gum-ud-test.conllu:
1. You will write a function that takes a sentence and the dictionary as input. The output
should consist of a tag for each word in the sentence. Repeat the step for all the sentences in
the test corpus.
2. Now you need to evaluate the accuracy of the tagger that you just built.
1. Across all the sentences, how many words were tagged correctly? You can ignore
sentence distinction. What is the precision, recall, and F-score?
2. How many sentences have words that were tagged completely correct? You will need to
loop through each sentence to determine this count.