programming
CS 6320.002: Natural Language Processing Fall 2020 Homework 3 – 55 points Issued 04 Oct. 2021 Due 11:59pm CDT 18 Oct. 2021 Deliverables: Your completed hw3.py file, uploaded to Gradescope. Warning! This assignment uses a large dataset. Training and decoding can take a long time. Start early so you have time to train and run your model and decode its predictions! 0 Getting Started Make sure you have downloaded the test data file, test.txt. Make sure you have installed the following libraries: • NLTK, https://www.nltk.org/ • NLTK Corpora, https://www.nltk.org/data.html • Numpy, https://numpy.org/ • Scikit-Learn, https://scikit-learn.org/stable/ • Scipy, https://www.scipy.org/ We are going to train a bigram maximum entropy Markov model (MEMM) to do part of speech tagging. The function load training corpus() loads the training data for you. We are using the Brown corpus with the Universal Dependencies tagset. The function returns a tuple of lists (sents, tags), where sents is a list of sentences, ie. lists of strings (words), and tags is a list of tag sequences, ie. lists of strings (tags). Note that the Brown corpus is very large (just over 57k sentences), so we will not use the whole thing, or training would take forever. 1 Generating Features – 15 points The first thing we need to do is generate features for our input. Recall that an MEMM is just a multiclass logistic regression. The sequence information is part of the feature vector, not part of the model itself, so a training example is a single word/tag pair, not an entire sentence, and we will write functions to generate features at the word level. For ease of implementation, we are using simple features, not the pairwise features shown in the lecture slides. We break down the word-based features into two types: n-gram features and word property features. The n-gram features capture word sequence infor- mation, while the word property features capture the word’s general characteristics. We will also use tag bigram features. Fill in the function get ngram features(words, i). The argument words is a list of strings, and the argument i is an int indicating the the current word to generate features for. The function should return a list containing the following strings: • ‘prevbigram-[x]’, where [x] is the previous word wi−1 • ‘nextbigram-[x]’, where [x] is the next word wi+1 • ‘prevskip-[x]’, where [x] is the next previous word wi−2 • ‘nextskip-[x]’, where [x] is the next next word wi+2 • ‘prevtrigram-[x]-[y]’, where [x] is wi−1 and [y] is wi−2 • ‘nexttrigram-[x]-[y]’, where [x] is wi+1 and [y] is wi+2 • ‘centertrigram-[x]-[y]’, where [x] is wi−1 and [y] is wi+1 Make to check for corner cases where i± 1 and/or i± 2 are out of range of the sentence length. In these cases, use meta-words ‘
’ and ‘’ to fill in those features. Also, note that the square brackets in [x] and [y] are not part of the features; they are only to indicate that you are meant to replace them with the actual words. For example, if words = [‘the’, ‘happy’, ‘cat’] and i=0, the expected output would be [‘prevbigram-
’, ‘nextbigram-happy’, ‘prevskip-’, ‘nextskip-cat’, ‘prevtrigram--’, ‘nexttrigram-happy-cat’, ‘centertrigram--happy’]. Fill in the function get word features(word). The argument word is a string. The function should return a list containing the following strings: • ‘word-[word]’ • ‘capital’, if word is capitalized • ‘allcaps’, if word is all capital letters • ‘wordshape-[x]’, where [x] is an abstracted version of the word where lowercase letters are replaced with ‘x’, uppercase letters are replaced with ‘X’, and digits are replaced with ‘d’. • ‘short-wordshape-[x]’, which is like the previous word shape feature, except consecutive character types are merged into a single character (eg. “Hello” → ‘Xxxxx’ → ‘Xx’). • ‘number’, if word contains a digit • ‘hyphen’, if word contains a hyphen • ‘prefix[j]-[x]’, where [j] ranges from 1 to 4 and [x] is the substring containing the first [j] letters of word • ‘suffix[j]-[x]’, where [j] ranges from 1 to 4 and [x] is the substring containing the last [j] letters of word Make sure to check for corner cases where [j] is out of range of the length of word, in which case skip the prefix/suffix features for those values of [j]. As before, the square brackets are not part of the features; they indicate that you are meant to replace them with actual words or values. For example, if word = ‘UTDallas’, the expected output would be [‘word-UTDallas’, ‘capital’, ‘wordshape-XXXxxxxx’, ‘short-wordshape-Xx’, ‘prefix1-U’, ‘prefix2-UT’, ‘prefix3-UTD’, ‘prefix4-UTDa’, ‘suffix1-s’, ‘suffix2-as’, ‘suffix3-las’, ‘suffix4-llas’]. Finally, fill in the function get features(words, i, prevtag). The argument words is a list of strings, the argument i is an int, and the argument prevtag is a string (the tag of wi−1). The function should do the following: • Call the two feature functions we wrote earlier and combine them into a single list. • Add the tag bigram feature, ‘tagbigram-[prevtag]’. • Convert all features to lowercase except the word shape and short word shape features. • Return the full list of features. 2 Training – 25 points We are ready to work with the training data. First, we will assume we have generated the features for the entire training corpus (we will do this in a later function). Fill in the function remove rare features(corpus features, threshold=5). The argument corpus features is a list of lists, where each sublist represents a sentence and contains the individual feature lists for each word in the sentence (ie. the outputs of individual calls to get features()). For example, corpus features[0][0] is the feature list for the first word of the first sentence in the corpus. The function should do the following: • Iterate through corpus features and count the number of times each feature oc- curs. • Create a two sets, one containing the rare features (fewer than threshold occur- rences) and one containing the common features (threshold or more occurrences). • Create a new version of corpus features with the rare features removed. Make sure you don’t change the nested structure of corpus features. • Return a tuple (corpus features, common features) containing the new version of corpus features and the common feature set (we can reuse it to build the feature dictionary). Removing rare features is helpful because it reduces the overall size of the model by reducing the total number of features, and therefore the feature vector length. It also helps prevent overfitting on rare features that may be strongly correlated with some tag. Now we can generate the feature dictionary. This is also a good time to create the tag dictionary. In HW2, we didn’t need a label dictionary because there were only two labels, positive and negative, which we assigned to 1 and 0, respectively. Now we have 12 tags, and recall that the MEMM outputs a probability distribution over labels, so we need to assign an index/position to each tag. Fill in the function get feature and label dictionaries(common features, corpus tags). The argument common features is the set of common features from remove rare features(), and the argument corpus tags is a list of lists, where each sublist represents a sentence and contains the tags for that sentence (for example, corpus tags[0][0] is the tag for the first word of the first sentence in the corpus). The function should do the following: • Create a feature dictionary and assign an index to each feature in common features, as in HW2. • Create a tag dictionary and assign an index to each of the 12 tags. • Return a tuple (feature dict, tag dict). Now we have the tools to build X train and Y train. Again, note that a training example isn’t an entire sequence of words and tags, it’s a single word/tag pair; thus, the number of training examples isn’t the number of sentences in the training corpus, it’s the number of tokens. Y train is easy and pretty much the same as in HW2, so let’s do that first. Fill in the function build Y(corpus tags, tag dict). The argument corpus tags is a list of lists of tags, and the argument tag dict is the dictionary from get feature and label dictionaries(). The function should do the following: • Iterate through corpus tags by first iterating through the first sentence tags, then the second sentence tags, and so on. • For each tag, look up its index in the tag dictionary and append the index to a list. This “flattens” the nested list input format into a non-nested, single list for output. • Convert the completed list to a Numpy array using numpy.array() and return it. X train is going to be harder because we are going to use a sparse matrix to represent it. The features that we generated in Part 1 are all binary, ie. for a given word, the feature is either present (1) or not (0). So for any given word, we want its feature vector to have value 1 at the positions of its generated features, and 0 for all other features. Since most features will be 0 for a given word, we can simply save the positions of all entries that are 1, and all other positions are assumed to be 0. This representation is called a sparse matrix. Sparse matrices are much smaller in memory than dense (“normal”) matrices, which is important for us because the training corpus is so large. We are going to use the compressed sparse row (CSR) format for matrices. The idea is that we can construct a sparse matrix using three lists, row, col, and value, to indicate which positions are not 0: X[row[i][col[i]] = value[i], and all other positions in X are assumed to be 0. Of course, all non-0 positions have value 1 for us, so value will just be a list of 1s. Fill in the function build X(corpus features, feature dict). The argument corpus features is a list of lists of feature lists, where each sublist represents a sentence and contains the sublists representing the words in that sentence, and each word sublist contains features; for example, corpus features[0][0][0] is the first feature of the first word in the first sentence. The argument feature dictionary is the dictionary from get feature and label dictionaries(). The function should do the following: • Create two empty lists, rows and cols – recall that in an input matrix, the rows correspond to training examples, and the columns correspond to features. • Iterate through corpus features as we iterated through corpus tags in build Y(). – For each feature, append the index of the training example it belongs to to rows. –