605 There are 3 questions to answer below: We learned about text clustering methods for documents by representing each document as a vector of non-stop-words and comparing the similarity of documents...

1 answer below »
See Attachment


605 There are 3 questions to answer below: We learned about text clustering methods for documents by representing each document as a vector of non-stop-words and comparing the similarity of documents using the Tanimoto Cosine Distance metric. 1. Write pseudocode that takes as input a corpus (set) of document and creates vectors for each document where the vectors do not contain stop-words and are weighted by the term frequency multiplied by the log of inverse document frequency as described in the course module. DocumentVectorSet documentVectorSet = CreateDocumentVectors(documentSet); 2. Write pseudocode that takes two document vectors and measures their similarity. Similarity similarity = DocumentSimilarity(documentVectorA, documentVectorB); After performing K-means clusters, let us suppose that we examine the clusters by sight and assign names to them. For example, one cluster may represent documents about sports, another may represent documents about politics, and yet another may represent documents about animals. Let us assume that we assign each cluster a name such as sports, politics, and animals. Sometimes, words are used in multiple contexts. For example, the word duck is ambiguous. Sometimes it means a waterfowl and would fall into the animal category. Sometimes it is used in politics such as a lame duck congress and would fall into the politics category. Sometime it is used in sports such as the name of a National Hockey League team the Anaheim Ducks and would fall into the sports category. Knowing which context the word is used makes the clustering much better. To understand why, suppose that we had two documents, one with the words duck and water, and the other with the words duck and ice. Without understanding the context of the word duck, our similarity metric may actually find that these documents are similar. However, understanding that when duck appears with water, the word duck probably refers to an animal, whereas when duck appears with ice, the word duck probably refers to sports. With this knowledge, our similarity metric would find these documents not very similar at all. Suppose we had a library of words that are used in multiple contexts such as: String[] multiContextWords= {“duck”, “crane”, “book”, …}; Suppose also that we have a multi-dimensional array that shows the multi-context words and common words that are used with them: String[][] wordContext = { {“duck (animal)”, “zoo”, “feathers”, “water”, …}, {“duck (sports)”, “hockey”, “Anaheim", “ice”, …}, {“duck (politics)”, “congress”, “lame”, …}, {“crane (animal)”, “bird”, “water”, …}, {“crane (construction)”, “building”, “equipment”, …}, …}; 3. Modify the CreateDocumentVectors() pseudocode from above to take advantage of the multiContextWords[] and wordContext[][] arrays to create better document vectors so that the subsequent call to DocumentSimilarity() will better distinguish contexts. 1
Answered Same DayMay 06, 2020

Answer To: 605 There are 3 questions to answer below: We learned about text clustering methods for documents by...

Abr Writing answered on May 08 2020
143 Votes
605
There are 3 questions to answer below:
We learned about text clustering methods for documents by representing each d
ocument as a vector of non-stop-words and comparing the similarity of documents using the Tanimoto Cosine Distance metric.
1. Write pseudocode that takes as input a corpus (set) of document and creates vectors for each document where the vectors do not contain stop-words and are weighted by the term frequency multiplied by the log of inverse document frequency as described in the course module.
DocumentVectorSet documentVectorSet =
CreateDocumentVectors(documentSet);
Solution:
def CreateDocumentVectors(documentSet)
    tf = dict()
    length = 0
    for document in documentSet
        doclist = list(document)
        for word in doclist
            tf[word] += 1
            length += 1
    tfidf = dict()
    for word in keys of tf
        idf = log(length/tf[word])
        tfidf[word] = tf[word] * idf
    return tfidf
    
2. Write pseudocode that takes two document vectors and measures their similarity.
Similarity similarity = DocumentSimilarity(documentVectorA, documentVectorB);
Solution:
def DocumentSimilarity(documentVectorA, documentVectorB)
    documentVectorA = items of documentVectorA
    documentVectorB = items of...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here