The instructions are in the project-news-stance.pdf file. I have also attached the relevant lecture...

Question

The instructions are in the project-news-stance.pdf file. I have also attached the relevant lecture notes. The Q&A document has notes from the lecturer on what can and cannot be done. Finally I have attached the data files.

Course Project: News Stance Detection Qiang Zhang, Bill Lampos February 23, 2018 1 Task Definition In context of news, a claim is made in a news headline, as well as in the piece of text in an article body. Quite often, the headline of a news article is created so that it is attractive to the readers, even though the body of the article may be about a different subject/may have another claim than the headline. Stance Detection involves estimating the relative perspective (or stance), of two pieces of text relative, i.e. do the two pieces agree, disagree, discuss or are unrelated to one another. Your task in this project is to estimate the stance of a body text from a news article relative to a headline. The goal in stance detection is to detect whether the headline and the body of an article have the same claim. The stance can be categorized as one of the four labels: “agree”, “disagree”, “discuss” and “unrelated”. Formal definitions of the four stances are as: • “agree” – the body text agrees with the headline; • “disagree” – the body text disagrees with the headline; • “discuss” – the body text discusses the same claim as the headline, but does not take a position; • “unrelated" – the body text discusses a different claim but not that in the headline. 2 Dataset We will be using the publicly available FNC-1 dataset 1. This dataset is divided into a training set and a testing set. The ratio of training data over testing data is about 2:1. Every data sample is a pair of a headline and a body. There are 49972 pairs in the training set, with 49972 unique headlines and 1683 unique bodies. This means that an article body can be seen in more than one pair. “unrelated” data takes the majority (over 70%) in both sets while the percentage of “disagree” is less than 3%. The percentage of “agree” and “discuss” are less than 20% and 10%, respectively. Severe class imbalance exits in the FNC-1 dataset. FNC-1 implements an official baseline 2 that may be helpful to read files, and to split the train dataset into a training subset and a validation subset. 3 Involved Subtasks The course project involves several subtasks that are required to be solved. This is a research oriented project so you are expected to be creative and coming up with your own solutions is strongly encouraged for any part of the project. • Split the training set into a training subset and a validation subset with the data number proportion about 9:1.The training subset and the validation subset should have similar ratios of the four classes. Statistics of the ratios should be presented. 1https://github.com/FakeNewsChallenge/fnc-1/ 2https://github.com/FakeNewsChallenge/fnc-1-baseline 1 mailto:[email protected] mailto:[email protected] https://github.com/FakeNewsChallenge/fnc-1/ https://github.com/FakeNewsChallenge/fnc-1-baseline • Extract vector representation of headlines and bodies in the all the datasets, and compute the cosine similarity between these two vectors. You can use representations based on bag-of- words or other methods like Word2Vec for vector based representations. You are encouraged to explore alternative representations as well. • Establish language model based representations of the headlines and the article bodies in all the datasets and calculate the KL-divergence for each pair of headlines and article bodies. Feel free to explore different smoothing techniques for language model based representations. • Propose and implement alternative features/distances that might be helpful for the stance detection task. Describe feature meaning and extraction process. • Choose two kinds of representative distances/features that you think may be most important for stance detection and plot the distance distribution for the four stances. Comment on why you think these are the important features and try to validate their importance using the data. • Using the features that you have created, implement a linear regression and a logistic re- gression model using gradient descent for stance classification. The implementations of these learning algorithms should be your own. • Analyse the performance of your models using the test set. Describe the evaluation metric you use and explain why you think would be suited for this task. Feel free to use alternative metrics that you think may fit. Compare and contrast the performance of the two models you have implemented. Analyse the effect of learning rate on both models. • Explore which features are the most important for the stance detection task by analysing their importance for the machine learning models you have built. • Do a literature review regarding the stance detection task, briefly summarize and compare the features and models that have been proposed for this task. • Propose ways to improve the machine learning models you have implemented. You can either propose new machine learning models, new ways of sampling/using the training data, or propose new features. You are allowed to use existing libraries/packages for this part. 4 What to submit You are expected to submit all the code you have written, together with a written report up to 5 pages. Your report should describe the work you have done for each of the aforementioned steps. Unless otherwise stated above, all the code should be your own and you are not allowed to reuse any code that is available online. You are allowed to use both Python and Java as the programming language. 5 Deadline The deadline for submitting your project is midnight on April 6th. 2 Task Definition Dataset Involved Subtasks What to submit Deadline Machine Learning for Data Mining and Information Retrieval Association Rule Mining and Machine Learning Emine Yilmaz [email protected] Some slides courtesy Andrew Ng@Stanford, Bing Liu@UIC mailto:[email protected] 2 Identifying Relationships Between Items: Association rule mining • Proposed by Agrawal et al in 1993. • It is an important data mining model studied extensively by the database and data mining community. • Assume all data are categorical. • Initially used for Market Basket Analysis to find how items purchased by customers are related. Bread  Milk [sup = 5%, conf = 100%] 3 Transaction data: supermarket data • Market basket transactions: t1: {bread, cheese, milk} t2: {apple, eggs, salt, yogurt} … … tn: {biscuit, eggs, milk} • Concepts: • An item i: an item/article in a basket • I = {i1, i2, …, im}: : the set of all items sold in the store • A transaction t and t  I : items purchased in a basket • A transactional dataset: A set of transactions T = {t1, t2, …, tn} 4 Transaction data: a set of documents • A text document data set. Each document is treated as a “bag” of keywords doc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketball doc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Team doc7: Basketball, Team, City, Game 5 The model: rules • A transaction t contains X, a set of items (itemset) in I, if X  t. • An association rule is an implication of the form: X  Y, where X, Y  I, and X Y =  • An itemset is a set of items. • E.g., X = {milk, bread, cereal} is an itemset. • A k-itemset is an itemset with k items. • E.g., {milk, bread, cereal} is a 3-itemset 6 Rule strength measures • Support: The rule holds with support sup in T (the transaction data set) if sup% of transactions contain X  Y. • sup = Pr(X  Y) • Confidence: The rule X->Y holds in T with confidence conf if conf% of transactions that contain X also contain Y. • conf = Pr(Y | X) • An association rule is a pattern that states when X occurs, Y occurs with certain probability. 7 Support and Confidence • Support count: The support count of an itemset X, denoted by X.count, in a data set T is the number of transactions in T that contain X. Assume T has n transactions. • Then, n countYX support ). (   countX countYX confidence . ). (   8 Goal and key features • Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf). • Key Features • Completeness: find all rules. • Mining with data on hard disk (not in memory) 9 An example • Transaction data • Assume: minsup = 30% minconf = 80% • An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7] • Association rules from the itemset: Clothes  Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken Milk, [sup = 3/7, conf = 3/3] t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes 10 Many mining algorithms • There are a large number of them!! • They use different strategies and data structures. • Their resulting sets of rules are all the same. • Given a transaction data set T, and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined. • Any algorithm should find the same set of rules although their computational efficiencies and memory requirements may be different. • We study only one: the Apriori Algorithm 11 The Apriori algorithm • Probably the best known algorithm • Two steps: • Find all itemsets that have minimum support (frequent itemsets, also called large itemsets). • Use frequent itemsets to generate rules. • E.g., a frequent itemset {Chicken, Clothes, Milk} [sup = 3/7] and one rule from the frequent itemset Clothes  Milk, Chicken [sup = 3/7, conf = 3/3] 12 Step 1: Mining all frequent itemsets • A frequent itemset is an itemset whose support is ≥ minsup. • Key idea: The apriori property (downward closure property): any subsets of a frequent itemset are also frequent itemsets AB AC AD BC BD CD A B C D ABC ABD ACD BCD 13 The Algorithm • Iterative algo. (also called level-wise search): Find all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on. • In each iteration k, only

unifolks_636573478999325169_116109_1.pdf unifolks_636573478999325169_116109_2.pdf unifolks_636573478999325169_116109_3.docx unifolks_636573478999325169_116109_4.zip

Saurabh · Accepted Answer

Solution/baseline.pyimport numpy as np
from nltk import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.neighbors import KNeighborsClassifier
from scipy.sparse import hstack
from utils.dataset import DataSet
from utils.generate_test_splits import split
from utils.score import report_score
dataset = DataSet()
data_splits = split(dataset)
training_data = data_splits['training']
dev_data = data_splits['dev']
test_data = data_splits['test']
LABELS = ['agree', 'disagree', 'discuss', 'unrelated']
class Preprocessor(object):
    def __init__(self):
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.stopwords_eng = stopwords.words('english')
        self.lemmatizer = WordNetLemmatizer()
        
    def __call__(self, doc):
        return [self.lemmatizer.lemmatize(t) for t in self.tokenizer.tokenize(doc)]
        
    def process(self, text):
        tokens = self.tokenizer.tokenize(text.lower())
        tokens_processed = []
        for t in tokens:
            if t in self.stopwords_eng: continue
            tokens_processed.append(self.lemmatizer.lemmatize(t))
        return tokens_processed
        
class Document(object):
    def __init__(self, data):
        self.stances = []
        self.headlines = []
        self.body_texts = []
        self.size = 0
        for dict_item in data:
            label_index = LABELS.index(dict_item['Stance'])
            headline = dict_item['Headline']
            body = dataset.articles[dict_item['Body ID']]
            self.stances.append(label_index)
            self.headlines.append(headline)
            self.body_texts.append(body)
        self.size = len(self.stances)
        self.stances = np.asarray(self.stances)
        
    def get_full_text(self):
        full_texts = []
        for i in range(self.size):
            text = '
'.join((self.headlines[i], self.body_texts[i]))
            full_texts.append(text)
        return full_texts
if __name__ == '__main__':
    #preprocessor = Preprocessor()
    training_doc = Document(training_data)
    test_doc = Document(test_data)
    
    vectorizer = CountVectorizer(ngram_range=(1,2), min_df=2, 
                                 stop_words='english')
    train_headline = vectorizer.fit_transform(training_doc.headlines)
    test_headline = vectorizer.transform(test_doc.headlines)
    train_body = vectorizer.fit_transform(training_doc.body_texts)
    test_body = vectorizer.transform(test_doc.body_texts)
    
    ch2 = SelectKBest(chi2, k=1000)
    ch2.fit(train_headline, training_doc.stances)
    train_headline = ch2.transform(train_headline)
    test_headline = ch2.transform(test_headline)
    ch2.fit(train_body, training_doc.stances)
    train_body = ch2.transform(train_body)
    test_body = ch2.transform(test_body)
    
    train_features = hstack((train_headline, train_body))
    test_features = hstack((test_headline, test_body))
    
    classifier = KNeighborsClassifier(n_neighbors=5)
    classifier.fit(train_features, training_doc.stances)
    
    prediction = classifier.predict(test_features)
    
    actual_label = [LABELS[x] for x in test_doc.stances]
    predicted_label = [LABELS[x] for x in prediction]
    report_score(actual_label, predicted_label)
Solution/fnc-1.pyfrom __future__ import print_function
import numpy as np
from gensim.models import KeyedVectors
#from keras.preprocessing import sequence
#from keras.models import Sequential
#from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
#from keras.datasets import imdb
from nltk import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
#from sklearn.preprocessing import StandardScaler
#from sklearn.neural_network import MLPClassifier
from scipy.sparse import hstack, csr_matrix
from utils.dataset import DataSet
from utils.generate_test_splits import split
from utils.

Course Project: News Stance Detection Qiang Zhang, Bill Lampos February 23, 2018 1 Task Definition In context of news, a claim is made in a news headline, as well as in the piece of text in an article...

Answer To: Course Project: News Stance Detection Qiang Zhang, Bill Lampos February 23, 2018 1 Task Definition...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment