SIT 384 Data Analytics for Cyber Security Assignment 2Trimester 1, 2017Objectives•To apply skills...

Question

SIT 384 Data Analytics for Cyber Security Assignment 2Trimester 1, 2017Objectives•To apply skills and knowledge acquired throughout the trimester in classification algorithms and machine learning process.•To rationalize the use of machine learning algorithms to effectively and efficiently process ofdatain big size.•To demonstrate ability to use R to perform email classification tasks that are common for corporate security analyst.•To scientifically conduct and document machine learning experiments for analytics purposes.Due Date: 2pm, Friday, May 19, 2017This assignment consists of a report worth 20 marks.Delays caused by student's own computer downtime cannot be accepted as a valid reason for late submission without penalty. Students must plan their work to allow for both scheduled and unscheduled downtime.Submission instructions:You must submit an electronic copy of all your assignment files via Cloud-Deakin. You must include both your report, source codes, necessary data filesand optionally presentation file. Assignments will not be accepted through any other mannerof submission. Students should note that email and paper based submissions will ordinarily be rejected.Special requirements to prove the originality of your work:On-campus students (B and G) are required to demonstrate the execution of your classification programsin R to your tutor in Week 10; Cloud students are required to attach a 3-5minutes Video presentation to demonstrate how your R codes are executed to derive the claimed results.The video should be uploaded to a cloud storage (You can find outhow to upload a video from https://video.deakin.edu.au/.)Failure to do so will result a delayed assessment of your submission.Late submissions: Submissions received after the due date are penalized at a rate of 5% (out of the full mark) per day, no exceptions. Late submission after 5days would be penalized at a rate of 100% out of the full mark. Close of submissions on the due date and each day thereafter for penalties will occur at 05:00 pm Australian Eastern Time (UTC +10 hours). Students outside of Victoria should note that the normal time zone in Victoria is UTC+10 hours. No extension will be granted.It is the student's responsibility to ensure that they understand the submission instructions. If you have ANY difficulties ask the Lecturer/Tutor for assistance (prior to the submission date).Copying, Plagiarism NoticeThis is an individual assignment. You are not permitted to work as a part of a group when writing thisassignment. The University's policy on plagiarism can be viewed online at http://www.deakin.edu.au/students/study-support/referencing/plagiarism
OverviewThe popularity of social media networks, such as Twitter, leads to an increasing number of spamming activities. Researchers employed various machine learning methodsto detect Twitter spams. In this assignment, you are required to classify spam tweets by using provided datasets. The features have been extracted and clearly structured inJSON format. The extracted features can be categorizedinto two groups: user profile-based features and tweet content-based features as summarized in Table 1.The provided training dataset and testing dataset are separately listed in Table 2and Table 3. In testing dataset, we can findthat the ratio of spam to non-spam is 1:1 inDataset1, while the ratio is 1:19 in Dataset 2. In most of previouswork,thetestingdatasets are nearly evenly distributed. However, in real world, there are only around 5% spam tweetsin Twitter, which indicates thattesting Dataset2 simulatesthe real-worldscenario. You are required to classify spam tweets, evaluate the classifiers’ performance and compare the Dataset 1 and Dataset 2 outcomes by conducting experiments.

005_oihxazo-s2d1uyeh.pdf 005_mslxazo-se1d1vbo.zip

David · Accepted Answer

Summary
Introduction 
This paper gives an overview of the state of the art of machine learning applications for spam 
detection in Twitter dataset using five different machine learning classifiers as listed below: 
1. K-Nearest Neighbors (KNN) 
2. Support Vector Machines (SVM) 
3. Naïve Bayes (NB) Classifier 
4. Adaptive Boosting (AdaBoost) 
5. Random Forest Classifier 
For evaluating and comparing the performance of various classifiers, five different performance 
evaluation metrics were used, and are defined below: 
1. Accuracy 
It is the total number of correct prediction to the total number of cases examined. That is: 
         
     
           
 
2. Recall 
It is the ratio of number of correctly detected spam to the actual number of spam in the 
dataset. That is: 
       
  
     
 
3. Precision 
It is the ratio of number of correctly detected spam to the total number of spam predicted. 
That is: 
          
  
     

4. Error rate 
It is the number of incorrect prediction to the total number of cases in the dataset. That is: 
           
     
           

5. False positive rate 
It is the ratio of incorrectly labelled legitimate data-point (account/mail/ …) to the total 
number of legitimate data-points. That is: 
                    
  
     

Where, 
    Prediction condition 
    positive negative 
Actual 
Condition 
positive 
TP 
(True Positive) 
FN 
(False Negative) 
negative 
FP 
 (False Positive) 
TN 
 (True Negative) 
The data for training and testing the spam filter consists of six Account based feature and 7 
Content based features as list below: 
  Features Description 
Account 
based 
Features 
account_age The age of an account 
no_follower # of followers 
no_following # of following 
no_userfavorites # of favorites the user received 
no_lists 
# of lists in which the user is a member 
of 
no_tweets 
# of tweets that has been posted by the 
user 
Content 
based 
Features 
no_retweets # of times this tweet has been retweeted 
no_tweetfavorites # of favorites this tweet received 
no_hashtag # of hashtags in this tweet 
no_usermention # of times this tweet being mentioned 
no_urls # of URLs contained in this tweet 
no_char # of characters in this tweet 
no_digits # of digits in this tweet
The Principal Component Analysis (PCA) was carried out using above 13 features and of the 
new principal components received from analysis, 3 components were used that were able to 
describe almost 99% of the variation in the data.  
The following table shows of the datasets used for training and testing the Twitter spam filters: 
Dataset 
# of Spam 
tweets 
# of Non-spam 
tweets 
Training 1000 1000 
Testing 
1000 1000 
190 1900 
The second training dataset was much more realistic because in real life the number of spam 
tweets are much lower than the non-spam tweets. 
The chapter technical demonstration contains all the details of PCA analysis and step by step 
training and prediction on the both the testing dataset using the five classification method listed 
above. The Performance evaluation chapter shows the performance of each Classifier based on 
the five evaluation metric listed above.  
Literature review 
Spamming is one of the major problem of the Information age, has been tapped rigorously by the 
researchers and practitioners in the field of Security data analytics using different Machine 
Learning Techniques. According to Alexa, Twitter is one of the most visited websites in the 
world. The ever increasing traffic on Twitter also attracts the spammers and therefore it becomes 
very important to detect and remove any spammers on the site to ensure quality time of general 
public on Twitter.  
The problem of spam detection is difficult because it is very easy for the spammers to fabricate 
the features of a benign account. So to detect the spammers, the researchers have to consider 
many different features such as account-based features like the age of an account, number of 
follower, number of following etc.

SIT 384 Data Analytics for Cyber Security Assignment 2Trimester 1, 2017Objectives•To apply skills and knowledge acquired throughout the trimester in classification algorithms and machine learning...

Answer To: SIT 384 Data Analytics for Cyber Security Assignment 2Trimester 1, 2017Objectives•To apply skills...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment