Answer To: Microsoft Word - NIT3202 Data Analytics for Cyber Security Assignment.docx NIT3202 Data Analytics...
Aditya Kumar answered on Feb 25 2021
Comparison of Different Classifiers for Filtering Spam Twitter Account
Introduction: In the last decade, social life has become an integrated part of our daily life. Different platforms such as Facebook, Twitter, Instagram etc. have grown a lot in this context. Now, many companies and institutions want to mine text data from different online sources to know what people really thinks about their organization. Among these online sources, different social platforms have gained the highest priority. Twitter is one of the first choices that most organizations make. In this report, I have a set of twitter data indicating which account are spam and which are not, along with detail information of that particular account which consist of “Account Age”, “No of Followers”, “No of Following”, “No of User Favourites”, “No of Lists”, “No of Tweets” and also some detail information about the content of those accounts such as “No of Retweets”, “No of Tweet Favourites”, “No of Hash Tag”, “No of User Mention”, “No of URLs”, “No of Characters”, and “No of Digits”. Here I have performed the 5 algorithms to classify whether an account is spam or not, they are Logistic Regression, Naïve Bayes, Support Vector Machine (SVM), k-Nearest Neighbours (k-NN) and Decision Tree.
Literature Review: In [1] this work, the authors found that in general people tweet about the trending topics but in most of the cases, it is hard to found what is the topic is about, so this work is one step forward to generalization of trending topics with higher accuracy for better information retrieval.
Here [2], different evaluation metrics have been discussed which maybe used for classification purpose.
[3] is based on Logistic Regression which is a well-known machine learning algorithm for classification if there are only two levels in the response variable.
Here, [4] the authors have analysed the data characteristics which may affect the performance of naive Bayes.
Here, [5] a detailed analysis has been done on which Naïve Bayes classifier would be best for spam filtering purpose.
Here, [6] a detailed information on Support Vector Machine (SVM) is provided.
Here, [7] Decision Tree approach has been used to solve land cover mapping problems.
Here, [8] k-NN algorithm has been used for breast cancer classification problem.
Here, [9] k-NN classifier has been used to predict water pollution index.
Here, [10] a new k-NN classifier has been introduced for Big Data based on Efficient Data Pruning.
Technical Demonstration: In this work, I have used the R software for coding purpose in order to perform the classification. As the given is in .txt format, “read.table()” function is used to import the dataset into R. And next to it, the variable names are also been assigned to the corresponding column of the datasets. The Train dataset have also been attached as it helps to directly call the variables instead of calling the data variable everytime.
Next, a thorough chekup has been done if there is any missing data in the training dataset as it could mislead the analysis.
And, there was no missing values in the data, which means the analysis can be carried on without any missing value treatment being performed.
1. Logistic Regression: As, in this dataset, the response variable has only two levels such as Spam and Not Spam, so a logistic regression can be performed as this model is capable to classify two levels of data having binomial property, which means occurance of one level leads to non-occurance of the other level. Here, a tweeter acount can not be spam and not spam simultaneously, hence the binomial property holds. The following screenshot shows an initial Logistic Regression model.
In this block of code all the features have been taken to check whther they are important to predict the class or not. The model result is as follows:
Here, the right most column of the table shows the p-value of the particular predictor. If it is more than 0.05, then in 95% cases, this predictor is not important to predict the class of the response variable. Hence I am going to keep only those predictors which have p-value less than 0.05. And the final model is as follows:
And the model output is as follows:
Here, in the furnished model, all the predictors are more than 95% important to predict the class of the response variable.
So, we can see for the Logistic Regression approach, “Account Age”, “No of Tweets”, “No of Retweets”, “No of Hashtag”, ”No of URLs”, “No of Characters”, “No of Digits” are the predictor variables which plays vital role in classification whether a tweetwr account is spam or not.
Here, the above code performs the prediction based on the furnished Logistic model over the two given datasets and calculate the 5 metrics for both the prediction. The ROC Curve for the two test datasets are given below:
The metrics value for both the test sets will be provided together at last.
2. Support Vector Machine (SVM): A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for classification purposes. SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes.
The following code shows the model in this algorithm:
All the predictor variables have been used here except “No of Tweet Favourites” as this column does not have any values rather than 0, and hence this column becomes unnecessary for...