Attached1. Answer the following to the best of your ability:(10points) a) Define Corpus: b) How...

Question

Attached1. Answer the following to the best of your ability:(10points) a) Define Corpus: b) How might you make a corpus for the following problem: I want to be able to learn characteristics of a politician’s language 2. a) Describe briefly 4 difficulties with identifying word boundaries algorithmically?(8points) i) ii) iii) iv) b) What is the possible differences in the following two implementations of a word identifier(5points) tokens = nltk.word_tokenize(sentence) and tokens = sentence.split(“ “) c) Why do we use ‘tokens’ instead of ‘word’(5points) 3. With the following sentence “The Cat in the Hat”(12points) a) List the Uni-grams b) List the Bi-grams c) List the Tri-grams 4. Answer the following about predictive models:(10points) a) What is a backoff model? b) Give an example of how a backoff may help your model. 5. Why do we need sent_tokenize_list = sent_tokenize(text) in NLTK instead of just breaking sentences apart by punctuation?(5points) 6. Briefly explain Transformation Based Tagging and how it differs from Ngram tagging for Part-of_Speech (8points) 7) Answer the following:(12points) a)What is a False Negative? b)What is a True Positive? c)When should Accuracy be used as a metric? d)What is the difference between Precision and Recall? When would you use them? 8. Fill in the 3 empty boxes for a typical machine learning cycle:(9points) 9. What are the two differences when you test on your training data versus testing on your test data? (4points) 10. Explain (or draw) k-fold validation when k=5 (6points) 11. Show 3 examples where a Named Entity System can get confused by ambiguity (6points) PART II 1. Use given script to download 1 Wikipedia page  2. Run NLTK’s NER tool. 3. Examine both the PERSON and LOCATION Entities 4. Calculate Precision, Recall, and F-Measure for both Person or Location (whichever your document has) pip install pymediawiki To Harvest WIKI from mediawiki import MediaWiki wikipedia = MediaWiki() p = wikipedia.page('Marymount University') content= p.content print(content) To Tag Named Entities 1) Word Tokenize a Sentence 2) POS tag the tokens 3) print(nltk.ne_chunk(tagged))

Karthi · Accepted Answer

1. Answer the following to the best of your ability:(10points)
a) Define Corpus:
Corpus is a collection of text understandable by machine created in a natural communication environment. Newspaper pieces, literary fiction, spoken dialogue, blogs and diaries, and court documents were chosen for their representational and balanced characteristics. If the content in a corpus can be generalised to a linguistic variety, it is considered to be "representative of that variety."
b) How might you make a corpus for the following problem: I want to be able to learn characteristics of a politician’s language
To investigate a major event in climate policy in UK, the Climate Change Bill, which in 2008 became an Act of Parliament, using corpus techniques and the Critical Discourse Analysis (CDA). This Act established and sets the independent climate change Committee to monitor and advise on climate change activities, including statutory carbon reduction objectives. Only five parliamentarians voted on each side against all major parties, showing that the Act had broad bipartisan support.
Analysis:
- Keyword analysis
- Collocation
- Semantic tagging
2. a) Describe briefly 4 difficulties with identifying word boundaries algorithmically?(8points)
I) There are different language reports because the failure reports come from all around the world. Despite the fact that the reports are filtered for English, due to regional linguistic influence, additional language words appear alongside the English reports.
ii)Algorithms misinterpret these non-English terms as English and attempt to partition them. For example, the Portuguese word engatava, which means engaged, can be broken into 'eng' 'at' 'a' 'v' 'a'. The frequency of occurrence of a word in the dictionary determines segmentation, and the dictionary only contains English terms. The difficulty can be solved by expanding the dictionary to include words from other languages, as the methodologies are language agnostic.
iii)The classification accuracy did not improve substantially after merging the word segmentation with the Auto-sort model. Because approximately 56% of the reports are free of mistakes in the first place. The bulk of the report descriptions, on the other hand, have no more than three inaccuracies.
iv)Failure reports cannot be segmented using context-based segmentation approaches. Even though the developed vocabulary includes technical terms and codes that are relevant to these reports, they are not contextually connected. Deep learning algorithms are ineffective for this data set as well. Dictionary-based segmentation algorithms, on the other hand, are quick, need less processing, and use less memory, making them ideal for this data set.
b) What is the possible differences in the following two implementations of a word identifier(5points)
tokens = nltk.word_tokenize(sentence)
and
tokens = sentence.split(“ “)
Tokenize() is a module in NLTK that is further divided into two sub-categories: Word tokenize: To separate a sentence into tokens or words, we utilise the word tokenize() method. Sentence tokenize: To separate a text or paragraph into sentences, we utilise the sent tokenize() method.
The split() function divides a string by the separator supplied and returns a list object with string elements. Any whitespace character, such as space, 	, 
, and so on, is used as the default separator.
c) Why do we use ‘tokens’ instead of ‘word’(5points)
Tokenization is the process of dividing a large volume of text into smaller tokens. These tokens serve as a starting point for stemming and lemmatization, and they are critical for pattern recognition. Tokenization can also be used to swap out sensitive data for non-sensitive data.
3. With the following sentence “The Cat in the Hat”(12points)
a) List the Uni-grams
('The',)
('Cat',)
('in',)
('the',)
('Hat',)
b) List the Bi-grams
('The', 'Cat')
('Cat', 'in')
('in', 'the')
('the', 'Hat')
c) List the Tri-grams
('The', 'Cat', 'in')
('Cat', 'in', 'the')
('in', 'the', 'Hat')
4. Answer the following about predictive models:(10points)
a) What is a backoff model?
Backoff means you go back to a n-1 gram level to calculate the probabilities when you encounter a word with prob=0. The most popular method is known as "stupid backoff." and whenever you go back 1 level you multiply the odds by 0.4.
b) Give an example of how a backoff may help your model.
Let's say you are using 4-grams to calculate the probability of a word in text. You have "this is a very" followed by "sunny". Let's say "sunny" never ocurred in the context "this is a very" so for the 4-grams model "sunny" has probability 0, and that's not good because we know that "sunny" is more probable than say "giraffe"
So in our case you will use a 3-gram model to calculate the probability of "sunny" in the context "is a very".
So if sunny exists in the 3-gram model the probability would be 0.4 * P("sunny"|"is a very").You can go back to the unigram model if needed multipliying by 0.4^n where n is the number of times you backed off.
5. Why do we need sent_tokenize_list = sent_tokenize(text) in NLTK instead of just breaking sentences apart by punctuation?(5points)
Because useless words like punctuation , signs, etc are taken as tokens
By using this function, the intelligence will be able to understand the core meaning of the sentence, because in this function it has a pre-trained model in it. Comparatively using sent_tokenize is better than punctuations.
6. Briefly explain Transformation Based Tagging and how it differs from Ngram tagging for Part-of_Speech (8points)
It's possible that the size of nth-order taggers will be an issue.
To utilise taging in a range of language technologies supplied on mobile computers, it is important to create ways to minimise model sizes without sacrifying performance. Trigram and bigram tables, which consist of enormous scattered arrays with hundreds of millions of entries, may be stored with a backoff nth tagger. Because the models are complicated, nth-order models simply cannot be conditioned in the context by the terms identities. Brill tagging is a technique of statistical tagging, using nth order taggers using models only one-4th of the size of nth order taggers.
ngram – It's an n-item subsequence. NgramTagger subclasses concept: The part-of-speech tag for the current word can be guessed by looking at the previous words and P-O-S tags. A context dictionary is kept by each tagger (ContextTagger parent class is used to implement it)
7) Answer the following:(12points)
a)What is a False Negative?
When a model mispredicts the positive class, a false positive emerges. A false negative arises if the model erroneously forecasts the negative class.
False negative: When a result appears negative when it isn't. A false negative occurs when a test designed to identify cancer delivers a negative result despite the fact that the person has cancer.
b)What is a True Positive?
If the model predicts the positive class accurately, it's termed a genuinely positive. On the other side, it is a genuine negative when the model properly predicts the negative class.
For example, if the condition is an illness, "true positive" indicates "properly identified as diseased."
c)When should Accuracy be used as a metric?
When assessed on a test set of 60 percent class A and 40 percent class B samples, the test accuracy reduces to 60 percent. Although classification precision is great, it gives the illusion of high precision.
The real problem emerges when the cost of misclassifying data from smaller classes is extremely great. The costs of failure to diagnose the disease of a sick person are considerably more than that of submitting a healthy person to further testing when dealing with a rare but lethal disease.
d)What is the difference between Precision and Recall? When would you use them?
Precision depends on the number of documents per search, split between the overall number of the documents per search, whereas recall represents the number of documents per search and the set of documents that were identified, divided between the range of those discovered.
When there is an unbalanced class and high true positives are required, precision is favoured above recall. Precision's formula does not include a false negative, which can have an influence.
8. Fill in the 3 empty boxes for a typical machine learning cycle:(9points)
1. Data
2. Feature Extractor
3. Classifier Model
9. What are the two differences when you test on your training data versus testing on your test data? (4points)
In test on your training data, the test accuracy will be high compared to the test dataset, this is because the model have already trained and seen the training datset, so it is easy for the model to classify the training dataset. But, when it’s coming to test dataset, all the data are new and unknown to the model. From the experience of the training dataset, it will be able to classify test dataset.
10. Explain (or draw) k-fold validation when k=5 (6points)
A K-fold CV divides a collection of data into K sections/folds, each of which is a separate test set. Considering a 5 times cross validation situation (K=5). The data here are divided into five folds. In the first loop, the first fold is utilised to test the hypothesis whereas others are used. The following fold serves as the validation dataset in the second iteration, and the remaining folds as the training dataset. This procedure is renewed to test all five folds.
11. Show 3 examples where a Named Entity System can get confused by ambiguity (6points)
The pen is good. This cost 10 USD.
 
The 10 $ is assigned to unknown entity
The flowers are good . I like the lilly.
Lily is never known as flower
The politician speech changed over time. Kennedy and Robert seems uninterested.
Kennedy and Robert can be politicians or reporters 
PART II
1. Use given script to download 1 Wikipedia page 
2. Run NLTK’s NER tool.
3. Examine both the PERSON and LOCATION Entities
4. Calculate Precision, Recall, and F-Measure for both Person or Location (whichever your document has)
pip install pymediawiki
To Harvest WIKI
from mediawiki import MediaWiki
wikipedia = MediaWiki()
p = wikipedia.page('Marymount University')
content= p.content
print(content)
To Tag Named Entities
1) Word Tokenize a Sentence
2) POS tag the tokens
3) print(nltk.ne_chunk(tagged))
PERSON Marymount
ORGANIZATION University
ORGANIZATION Catholic
GPE Arlington
GPE Virginia
PERSON Marymount
PERSON Marymount
ORGANIZATION Religious
ORGANIZATION Sacred Heart
GPE Mary
ORGANIZATION RSHM
PERSON Marymount College
ORGANIZATION Marymount
GPE New York
GPE California
ORGANIZATION Admiral
PERSON Naval Surgeon General
PERSON William McKinley
PERSON Rixey Mansion
PERSON Main House
PERSON Marymount
ORGANIZATION Physical Therapy
PERSON Marymount University
ORGANIZATION Center
PERSON Ethical Concerns
PERSON Marymount
ORGANIZATION Caruthers Hall
PERSON Rose Benté Lee Ostapenko Hall
ORGANIZATION Malek Plaza
PERSON Sister Majella Berg
ORGANIZATION RSHM
PERSON Marymount
PERSON Marymount
PERSON Ballston Center
ORGANIZATION LEED Gold
GPE Rixey
PERSON Ballston
PERSON Center
ORGANIZATION Reinsch Pierce Family Courtyard
PERSON Marymount University
ORGANIZATION Commission
ORGANIZATION Colleges
LOCATION Southern Association
GPE Colleges
PERSON Schools
ORGANIZATION College
GPE Health
PERSON Education
PERSON College
GPE Business
GPE Innovation
ORGANIZATION Leadership
ORGANIZATION Technology
ORGANIZATION College
GPE Science
ORGANIZATION Humanities
PERSON Marymount
ORGANIZATION Consortium
ORGANIZATION Universities
ORGANIZATION Washington Metropolitan Area
PERSON Campuses
PERSON Campus
PERSON Marymount
LOCATION North Arlington
GPE Arlington
GPE Virginia
PERSON Rose Benté Lee Ostapenko Hall
PERSON Rowley Hall
PERSON Butler Hall
GPE St. Joseph
PERSON Berg Hall
PERSON Gerard Phelan Hall
PERSON Rowley Academic Center
PERSON Caruthers Hall
PERSON Gailhac Hall
GPE St.

1. Answer the following to the best of your ability:(10points) a) Define Corpus: b) How might you make a corpus for the following problem: I want to be able to learn characteristics of a politician’s...

Answer To: 1. Answer the following to the best of your ability:(10points) a) Define Corpus: b) How might you...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment