Modify the skeleton .py file to translate words from English to Pig Latin. Please do not import anything other than what is already in the skeleton .py file. I've also attached the lecture slides related to this topic if it helps. If you can, please comment your code so that a first-year college student can understand. Thank you!
Today’s Topics • ?'48"; &589 '.@'99+-,# • Motivation: machine neural translation for long sentences • Decoder: attention • Transformer overview • Self-attention Slides Thanks to Dana Gurari Converting Text to Vectors 1. Tokenize training data; convert data into sequence of tokens (e.g., data ->“This is tokening”) 2. Learn vocabulary 3. Encode data as vectors Two common approaches: https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa Converting Text to Vectors 1. Tokenize training data 2. Learn vocabulary by identifying all unique tokens in the training data 3. Encode data as vectors Two common approaches: https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa Token a b c *** 0 1 *** ! @ *** Index 1 2 3 *** 27 28 *** 119 120 *** Token a an at *** bat ball *** zipper zoo *** Index 1 2 3 *** 527 528 *** 9,842 9,843 *** 1. Tokenize training data 2. Learn vocabulary by identifying all unique tokens in the training data 3. Encode data as one-hot vectors https://github.com/DipLernin/Text_Generation One-hot encodings Input sequence of 40 tokens representing characters or words Converting Text to Vectors Converting Text to Vectors What are the pros and cons for using word tokens instead of character tokens? - Pros: length of input/output sequences is shorter, simplifies learning semantics - Cons: “UNK” word token needed for out of vocabulary words; vocabulary can be large https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa Token a b c *** 0 1 *** ! @ *** Index 1 2 3 *** 27 28 *** 119 120 *** Token a an at *** bat ball *** zipper zoo *** Index 1 2 3 *** 527 528 *** 9,842 9,843 *** Converting Text to Vectors Word level representations are more commonly used https://nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa Token a b c *** 0 1 *** ! @ *** Index 1 2 3 *** 27 28 *** 119 120 *** Token a an at *** bat ball *** zipper zoo *** Index 1 2 3 *** 527 528 *** 9,842 9,843 *** Problems with One-Hot Encoding Words? Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019. • Huge memory burden • Computationally expensive Dimensionality = vocabulary size e.g., English has ~170,000 words with ~10,000 commonly used words Limitation of One-Hot Encoding Words • No notion of which words are similar, yet such understanding can improve generalization • e.g., “walking”, “running”, and “skipping” are all suitable for “He was ____ to school.” Walking Soap Fire Skipping The distance between all words is equal! Today’s Topics • Introduction to natural language processing • Text representation • Neural word embeddings • Programming tutorial Idea: Represent Each Word Compactly in a Space Where Vector Distance Indicates Word Similarity Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019. Inspiration: Distributional Semantics “The distributional hypothesis says that the meaning of a word is derived from the context in which it is used, and words with similar meaning are used in similar contexts.” - Origins: Harris in 1954 and Firth in 1957 Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019. Inspiration: Distributional Semantics “The distributional hypothesis says that the meaning of a word is derived from the context in which it is used, and words with similar meaning are used in similar contexts.” Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019. Inspiration: Distributional Semantics • What is the meaning of berimbau based on context? • Idea: context makes it easier to understand a word’s meaning Background music from a berimbau offers a beautiful escape. Many people danced around the berimbau player. I practiced for many years to learn how to play the berimbau. https://capoeirasongbook.wordpress.com/instruments/berimbau/[Adapted from slides by Lena Voita] Inspiration: Distributional Semantics “The distributional hypothesis says that the meaning of a word is derived from the context in which it is used, and words with similar meaning are used in similar contexts.” Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019. • What other words could fit into these context? Inspiration: Distributional Semantics [Adapted from slides by Lena Voita] 1. Background music from a _______ offers a beautiful escape. 2. Many people danced around the _______ player. 3. I practiced for many years to learn how to play the _______. 1 1 1 0 0 0 0 0 0 1 1 1 Berimbau Soap Fire Guitar 1 if a word can appear in the context 0 otherwise 1. 2. 3. Contexts Hypothesis is that words with similar row values have similar meanings Inspiration: Distributional Semantics “The distributional hypothesis says that the meaning of a word is derived from the context in which it is used, and words with similar meaning are used in similar contexts.” Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019. Approach • Learn a dense (lower-dimensional) vector for each word by characterizing its context, which inherently will reflect similarity/differences to other words Berimbau and guitar are the closest word pairBerimbau Soap Fire Guitar The distance between each pair of words differs! Note: many ways to measure distance (e.g., cosine distance) Approach • Learn a dense (lower-dimensional) vector for each word by characterizing its context, which inherently will reflect similarity/differences to other words We embed words in a shared space so they can be compared with a few features What features would discriminate these words? Berimbau Soap Fire Guitar Approach • Learn a dense (lower-dimensional) vector for each word by characterizing its context, which inherently will reflect similarity/differences to other words Berimbau Soap Fire Guitar Wooden Commodity Cleaner Food Temperature Noisy Weapon Potential, interpretable features Approach: Learn Word Embedding Space • An embedding space represents a finite number of words, decided in training • A word embedding is represented as a vector indicating its context • The dimensionality of all word embeddings in an embedding space match • What is the dimensionality for the shown example? … Approach: Learn Word Embedding Space • An embedding space represents a finite number of words, defined in training • A word embedding is represented as a vector indicating its context • The dimensionality of all word embeddings in an embedding space match ? ? ? ? ? ? ? In practice, the learned discriminating features are hard to interpret Embedding Matrix • The embedding matrix converts an input word into a dense vector Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019. Size of vocabulary Berimbau Soap Fire Guitar … Target dimensionality (e.g., 5) One hot encoding dictates the word embedding to use Embedding Matrix • It converts an input word into a dense vector Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019. Size of vocabulary Berimbau Soap Fire Guitar … Target dimensionality (e.g., 5) A word’s embedding can efficiently be extracted when we know the word’s index Popular Word Embeddings • Bengio method • Word2vec (skip-gram model) • And more… Popular Word Embeddings • Bengio method • Word2vec (skip-gram model) • And more… Idea: Learn Word Embeddings That Help Predict Viable Next Words e.g., 1. Background music from a _______ 2. Many people danced around the _______ 3. I practiced for many years to learn how to play the _______ Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. Task: Predict Next Word Given Previous Ones e.g., 1. Background music from a _______ 2. Many people danced around the _______ 3. I practiced for many years to learn how to play the _______ Task: Predict Next Word Given Previous Ones Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. e.g., a vocabulary size of 17,000 was used in experiments What is the dimensionality of the output layer? Architecture Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. Embedding matrix: Word embeddings: Architecture Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. e.g., a vocabulary size of 17,000 was used with embedding sizes of 30, 60, and 100 in experiments Assume a 30-d word embedding - what are the dimensions of the embedding matrix C? 30 x 17,000 (i.e., 510,000 weights) Architecture Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. e.g., a vocabulary size of 17,000 was used with embedding sizes of 30, 60, and 100 in experiments Assume a 30-d word embedding - what are the dimensions of each word embedding? 1 x 30 Architecture Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. Projection layer followed by a hidden layer with non-linearity Training Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. Input: tried 1, 3, 5, and 8 input words and used 2 datasets with ~1 million and ~34 million words respectively Use sliding window on input data; e.g., 3 words Background music from a berimbau offers a beautiful escape… Training Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. Input: tried 1, 3, 5, and 8 input words and used 2 datasets with ~1 million and ~34 million words respectively Use sliding window on input data; e.g., 3 words Background music from a berimbau offers a beautiful escape… Training Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. Input: tried 1, 3, 5, and 8 input words and used 2 datasets with ~1 million and ~34 million words respectively Use sliding window on input data; e.g., 3 words Background music from a berimbau offers a beautiful escape… Training Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. Input: tried 1, 3, 5, and 8 input words and used 2 datasets with ~1 million and ~34 million words respectively Use sliding window on input data; e.g., 3 words Background music from a berimbau offers a beautiful escape… Training Bengio et al. A Neural Probabilistic Language Model. JMLR 2003. Input: tried 1, 3, 5, and 8 input words and used 2 datasets with ~1 million and ~34 million words respectively Cost function: minimize cross entropy loss plus regularization (L2 weight decay) Word embedding iteratively updated Summary: Word Embeddings Are Learned that Support Predicting Viable Next Words e.g., 1. Background music from a _______ 2. Many people danced around the _______ 3. I practiced for many years to learn how to