Answer To: SET11121 School of Computing, Napier University Assessment Brief 1. Module number SET11121 /...
Ximi answered on Mar 08 2021
eda_1.pdf
INTRODUCTION
Detecting and fighting Abusive languages on internet sources, regardless social media,
news, articles, etc has been a rising problem lately. Researchers have been using AI
based methods continuously to fight this, detect and remove such articles from the
internet sources.
We will be discussing here 3 new methods post 2016 to describe how the problem of
Abusive language detection is being solved in literature.
1st approach involves solving this problem using various feature selection based methods
using Convolutional neural nets (CNN).
2nd approach involves cross domain identification of abusive languages in various
aspects or domains. They propose to validate the training model over some other domain
and introduce mixture of training sets from various domain data.
3rd approach focusses on the specific interests of the abusive languages being written on
social media or news articles. This approach tends to identify the interest of the topic on
which the abusive language was written about. They have analysed the articles on basis,
if the language is directed towards someone or some entity or a generalised one.
BACKGROUND
Now we will dive into the methods that have been used in the approaches as described in
the introduction.
1st approach :-
They proposed to implement three CNN-based models to classify sexist and racist
abusive language: CharCNN, WordCNN, and HybridCNN. The major difference among
these models is whether the input features are characters, words, or both.
CNNs provide a range of filters which helps in capturing the “window” of words or chars
or both as per their models and make the judgement call.
The problem of capturing n-grams in text analytics feature extraction and selection
process gets reduced with this process.
Char CNN is a character-level convolutional network in (Zhang et al. 2015). Each
character from the input sentence is transformed into a one-hot encoding of 70
characters, including 26 English letters, 10 digits, 33 other characters, and a newline
character (punctuations and special characters). Word CNN is a CNN net where the input
sentence is first segmented into words and converted into a 300-dimensional embedding
word2vec trained on 100 billion words from Google News (Mikolov et al., 2013).
Incorporating pre-trained vectors is a widely-used method to improve performance,
especially when using a relatively small dataset. We set the embedding to be non-
trainable since our dataset is small.
HybridCNN, a variation of WordCNN, since WordCNN has a limitation of only taking word
features as input. Abusive language often contains either purposely or mistakenly
misspelled words and made-up vocabularies such as #feminazi. Therefore, since these
above two concepts don’t use character and word inputs at the same time, they designed
the HybridCNN to experiment whether the model can capture features from both fir of
inputs.
2nd Approach -
The aim of the paper was to asses how well the models trained on a particular dataset of
abusive language perform on a different test dataset. The differences in performance can
be traced back to the following factors:
(1) the differences in the types of abusive language that the dataset was labeled with and
(2) the differences in dataset sizes. In this work we observe the joint effect of both
factors.
They used a linear Support Vector Machine (SVM), which has already been successfully
applied to the task of abusive text classification and detection in literature.
3rd Approach -
Much of the work on abusive language subtasks can be synthesized in a two-fold
typology that considers whether (i) the abuse is directed at a specific target, and (ii) the
degree to which it is explicit.
DISCUSSION
The 1st and 2nd methods describe the problem in more of a technical fashion. The 3rd
research is just a hypothesis and no technical implementation has been done. But as I
think, the 3rd approach if implemented can result into more good insights about the
directions and specifics of the abusive language being spread over the internet.
The 1st approach deals with a precise and one of the advance approaches used in the
industry. Its exploration of the methodologies can help in deciding the quality and hyper
parameter tunings of the models. The 2nd approach was just dealing with a typical
machine learning generalisation problem. Although this method should be encouraged
and deployed in practice to deal with real abusive language detection problem. I did
some quick analytics on the dataset and loaded the JSON files with pandas.
The JSON files with line separated and hence easy to load with pandas. Pandas is a
framework to deal with record type data types and provides medium data scalability to
deal with data problems.
Pandas provides a dataframe format to deal with data. The analytics brought me the set
of unique tokens, total tokens, the frequency distribution which actually enabled me to
see which words were actually used so often in a particular context and also the least
frequent words being used. This gives a quick peek around the data. Furthermore there
are numerous other ways to explore and play around data and visualise them also.
This approach is known as Exploratory Data Analysis (EDA).
__MACOSX/._eda_1.pdf
eda_2.pdf
Data Loading
Pandas provides a dataframe format to deal with data. The analytics brought me the set
of unique tokens, total tokens, the frequency distribution which actually enabled me to
see which words were actually used so often in a particular context and also the least
frequent words being used. This gives a quick peek around the data. Furthermore there
are numerous other ways to explore and play around data and visualise them also.
This approach is known as Exploratory Data Analysis (EDA).
Data Preprocessing
Some basic preprocessing of removing stop words and regular expressions helped out
clear out some vocabulary so that features get automatically reduced.
The text data was converted into tokens and formed a corpus.
The data frames were concatenated and a label column was added to deal with the
same.
Feature Engineering
The feature extraction part was done in two ways according to the model used.
The first one was to use the scikit-learn’s Count Vectorizer which just converts words into
vectors with their respective word count.
Whereas in the second scenario the CNN was used so accordingly word embeddings
were created. Where that means to create word level vectors which are learnt by the
model itself.
Word vectors in the deep learning world are alone very powerful features to be used in
any machine learning or deep learning model. The data was separated into training and
testing sets.
Model Training and Evaluation
The model was trained on two models. First model was trained using sklearn’s logistic
regression classifier. The metric used for the evaluation was the accuracy score.
The second model was a single layer CNN model which had word embeddings as a
feature layer.
The same model was trained for 10 epochs to quickly train and evaluate model. The
accuracy metric was used here as well to evaluate the model. During training the cross
validation set was used to generate a trade off b/w the bias and variance in the model.
__MACOSX/._eda_2.pdf
abusive_language_detection_1.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import nltk"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /Users/Ximi-\n",
"[nltk_data] Hoque/nltk_data...\n",
"[nltk_data] Unzipping tokenizers/punkt.zip.\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.download('punkt')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"#Reading data in pandas dataframe as it allows applying operations in a proper way\n",
"data = pd.read_json(\"abusive_data/racism.json\", lines=True)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Annotation', 'contributors', 'coordinates', 'created_at', 'entities',\n",
...