SIT 384 Data Analytics for Cyber Security Assignment 2Trimester 1, 2017Objectives•To apply skills and knowledge acquired throughout the trimester in classification algorithms and machine learning...

1 answer below »


SIT 384 Data Analytics for Cyber Security Assignment 2Trimester 1, 2017Objectives•To apply skills and knowledge acquired throughout the trimester in classification algorithms and machine learning process.•To rationalize the use of machine learning algorithms to effectively and efficiently process ofdatain big size.•To demonstrate ability to use R to perform email classification tasks that are common for corporate security analyst.•To scientifically conduct and document machine learning experiments for analytics purposes.Due Date: 2pm, Friday, May 19, 2017This assignment consists of a report worth 20 marks.Delays caused by student's own computer downtime cannot be accepted as a valid reason for late submission without penalty. Students must plan their work to allow for both scheduled and unscheduled downtime.Submission instructions:You must submit an electronic copy of all your assignment files via Cloud-Deakin. You must include both your report, source codes, necessary data filesand optionally presentation file. Assignments will not be accepted through any other mannerof submission. Students should note that email and paper based submissions will ordinarily be rejected.Special requirements to prove the originality of your work:On-campus students (B and G) are required to demonstrate the execution of your classification programsin R to your tutor in Week 10; Cloud students are required to attach a 3-5minutes Video presentation to demonstrate how your R codes are executed to derive the claimed results.The video should be uploaded to a cloud storage (You can find outhow to upload a video from https://video.deakin.edu.au/.)Failure to do so will result a delayed assessment of your submission.Late submissions: Submissions received after the due date are penalized at a rate of 5% (out of the full mark) per day, no exceptions. Late submission after 5days would be penalized at a rate of 100% out of the full mark. Close of submissions on the due date and each day thereafter for penalties will occur at 05:00 pm Australian Eastern Time (UTC +10 hours). Students outside of Victoria should note that the normal time zone in Victoria is UTC+10 hours. No extension will be granted.It is the student's responsibility to ensure that they understand the submission instructions. If you have ANY difficulties ask the Lecturer/Tutor for assistance (prior to the submission date).Copying, Plagiarism NoticeThis is an individual assignment. You are not permitted to work as a part of a group when writing thisassignment. The University's policy on plagiarism can be viewed online at http://www.deakin.edu.au/students/study-support/referencing/plagiarism
OverviewThe popularity of social media networks, such as Twitter, leads to an increasing number of spamming activities. Researchers employed various machine learning methodsto detect Twitter spams. In this assignment, you are required to classify spam tweets by using provided datasets. The features have been extracted and clearly structured inJSON format. The extracted features can be categorizedinto two groups: user profile-based features and tweet content-based features as summarized in Table 1.The provided training dataset and testing dataset are separately listed in Table 2and Table 3. In testing dataset, we can findthat the ratio of spam to non-spam is 1:1 inDataset1, while the ratio is 1:19 in Dataset 2. In most of previouswork,thetestingdatasets are nearly evenly distributed. However, in real world, there are only around 5% spam tweetsin Twitter, which indicates thattesting Dataset2 simulatesthe real-worldscenario. You are required to classify spam tweets, evaluate the classifiers’ performance and compare the Dataset 1 and Dataset 2 outcomes by conducting experiments.
Answered Same DayDec 26, 2021

Answer To: SIT 384 Data Analytics for Cyber Security Assignment 2Trimester 1, 2017Objectives•To apply skills...

David answered on Dec 26 2021
122 Votes
Summary
Introduction
This paper gives an overview of the state of the art of machine learning applications for spam
detection in Twitter dataset using five different machine learning classifiers as listed below:
1. K-Nearest Neighbors (KNN)
2. Support Vector Machines (SVM)
3. Naïve Bayes (NB) Classifier
4. Adaptive Boosting (AdaBoost)
5. Random Forest Classifier
For evaluating and comparing the pe
rformance of various classifiers, five different performance
evaluation metrics were used, and are defined below:
1. Accuracy
It is the total number of correct prediction to the total number of cases examined. That is:




2. Recall
It is the ratio of number of correctly detected spam to the actual number of spam in the
dataset. That is:




3. Precision
It is the ratio of number of correctly detected spam to the total number of spam predicted.
That is:



4. Error rate
It is the number of incorrect prediction to the total number of cases in the dataset. That is:



5. False positive rate
It is the ratio of incorrectly labelled legitimate data-point (account/mail/ …) to the total
number of legitimate data-points. That is:



Where,
Prediction condition
positive negative
Actual
Condition
positive
TP
(True Positive)
FN
(False Negative)
negative
FP
(False Positive)
TN
(True Negative)
The data for training and testing the spam filter consists of six Account based feature and 7
Content based features as list below:
Features Description
Account
based
Features
account_age The age of an account
no_follower # of followers
no_following # of following
no_userfavorites # of favorites the user received
no_lists
# of lists in which the user is a member
of
no_tweets
# of tweets that has been posted by the
user
Content
based
Features
no_retweets # of times this tweet has been retweeted
no_tweetfavorites # of favorites this tweet received
no_hashtag # of hashtags in this tweet
no_usermention # of times this tweet being mentioned
no_urls # of URLs contained in this tweet
no_char # of characters in this tweet
no_digits # of digits in this tweet
The Principal Component Analysis (PCA) was carried out using above 13 features and of the
new principal components received from analysis, 3 components were used that were able to
describe almost 99% of the variation in the data.
The following table shows of the datasets used for training and testing the Twitter spam filters:
Dataset
# of Spam
tweets
# of Non-spam
tweets
Training 1000 1000
Testing
1000 1000
190 1900
The second training dataset was much more realistic because in real life the number of spam
tweets are much lower than the non-spam tweets.
The chapter technical demonstration contains all the details of PCA analysis and step by step
training and prediction on the both the testing dataset using the five classification method listed
above. The Performance evaluation chapter shows the performance of each Classifier based on
the five evaluation metric listed above.
Literature review
Spamming is one of the major problem of the Information age, has been tapped rigorously by the
researchers and practitioners in the field of Security data analytics using different Machine
Learning Techniques. According to Alexa, Twitter is one of the most visited websites in the
world. The ever increasing traffic on Twitter also attracts the spammers and therefore it becomes
very important to detect and remove any spammers on the site to ensure quality time of general
public on Twitter.
The problem of spam detection is difficult because it is very easy for the spammers to fabricate
the features of a benign account. So to detect the spammers, the researchers have to consider
many different features such as account-based features like the age of an account, number of
follower, number of following etc. and content based features like number of times a tweet has
been...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here