security and privacy issues1.1P Basic Linux Security SIT719 Security and Privacy Issues in...

Question

security and privacy issues1.1P Basic Linux Security     SIT719 Security and Privacy Issues in Analytics      Distinction/Higher Distinction Task 5.1 End-to-end project  delivery on cyber-security data analytics    Overview    Do you know what is an end-to-end data science project? See the lifecycle of an end-to-end data  science project. If you are doing data science application for security analysis, your problem will  be related to the cybersecurity and your data analysis needs to follow the below steps. See the  task description for the detailed instructions.        Figure 1: Data Science Lifecycle [source: http://sudeep.co/]    In this Distinction/Higher Distinction Task, you will experiment with Machine Learning  classification algorithms. Please see more details in the Task description. Before attempting this  task, please make sure you are already up to date with all Credit and Pass tasks.        http://sudeep.co/   Task Description    Instructions:     Suppose, you are working in an organization as a security analyst. You need to conduct an end  to end project on “cyber-attack classification in the network traffic database”. To complete the  project you follow the steps in Figure 1. Here, some of the steps are already solved for you (by  the teaching team, you don’t need to take any action) and the remaining you need to complete  (highlighted in blue) by yourself to submit this task.       Step 1: Business Understanding (Problem Definitions)     Your task is to develop a 5-class machine learning-based classification model to identify the  normal network traffics and attack classes.       Step 2: Data Gathering (Identify the source of data)    In the industry, you need to communicate either with your manager, client, other stakeholder  and/or IT team to understand the source of data and to gather it.   Here, the teaching team already gathered data for you. Please see the below link to obtain  the NSL-KDD dataset with 5 classes.     Dataset Link (cloudDeakin):      https://d2l.deakin.edu.au/d2l/le/content/881325/home?ou=881325    If you are interested to learn more about the dataset, please visit the website below:  https://www.unb.ca/cic/datasets/nsl.html     ***A starting example code for the 5 class classification is also given for your benefit,  where some of the steps are already implemented. Please see the following link to  access it:       https://d2l.deakin.edu.au/d2l/le/content/881325/home?ou=881325    Step 3: Data Cleaning (Filtering anomalous data)    You need to take care of missing values and inconsistent data. In week 2, you have learnt  how to deal with missing values and manipulate a database. Here, it has already been  taken care of for this dataset (so no action is needed for this task).     Step 4: Data Exploration (Understanding the data)  Here, you need to do the following tasks and write in your report:  1. Identify the attribute names (Header)  2. Check the length of the Train and Test dataset  3. Check the total number of samples that belong to each of the five classes of the training  dataset.      https://d2l.deakin.edu.au/d2l/le/content/881325/home?ou=881325 https://www.unb.ca/cic/datasets/nsl.html https://d2l.deakin.edu.au/d2l/le/content/881325/home?ou=881325     Step 5: Feature Engineering (Select Important Feature)     You may need to do feature extraction or selection during your data analysis process. Some  fundamental concepts of feature engineering using python is discussed in the following link:    https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html    Here, relevant feature engineering is already done for you in the sample code.      Step 6: Predictive Modelling (Prediction of the classes)    The DecisionTreeClassifier has been implemented for you. Now, you need to implement  other techniques and compare. Please do the following tasks:    1.  Implement at least 5 benchmark classification algorithms.   2. Tune the parameters if applicable to obtain a good solution.  3. Obtain the confusion matrix for each of the scenarios.  4. Calculate the performance measures of the each of the classification algorithms that  includes Accuracy (%), Precision (%), Recall (%), F-Score (%), False Alarm- FPR (%)    You need to compare the results following the table below. Create one table for each  algorithm.  Attack  Class  Accu racy  (%)  Preci sion  (%)  Recall  (%)…  … … … ...  DoS         Norm al Prob         R2L         U2R           Finally, you summarize the results similar to the below table:    Algorit hms  Accu racy  (%)  Preci sion  (%)  Recall  (%)…  … … … ...  Alg 1         Alg 2         …         …         …               https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html     Your results need to be comparable against benchmark algorithms. For example, see the  below results obtained from a recent article “An Adaptive Ensemble Machine Learning  Model for Intrusion Detection” published in IEEE ACCESS, July 2019.          Step 7: Data Visualization    Perform the following tasks:      1. Visualize and compare the accuracy of different algorithms.    2. Plot the confusion matrix for each scenarios.       Step 8: Results delivery:  Once you have completed the data analysis task for your security project, you need to  deliver the outcome. Results can be delivered as a product/tool/web-app or through a  presentation or by submitting the report.     Here, you need to write a report (at least 2000 word) based on the outcome and results you  obtained by performing the above steps. The report will describe the algorithms used, their  working principle, key parameters, and the results. Results should consider all the key  performance measures and comparative results in the form of tables, graphs, etc.    Compile everything in a PDF and submit through onTrack. You also need to include  the python script during submission.   	Overview 	Task Description

Sandeep Kumar · Accepted Answer

The list of columns or the features for the NSL-KDD dataset are:
'duration',    'protocol_type',    'service',    'flag',    'src_bytes',    'dst_bytes',  'land',  'wrong_fragment', 'urgent',  'hot',    'num_failed_logins',    'logged_in',    'num_compromised', 'root_shell',  'su_attempted', 'num_root', 'num_file_creations',  'num_shells',    'num_access_files',     'num_outbound_cmds',   'is_host_login', 'is_guest_login',     'count',    'srv_count',    'serror_rate',    'srv_serror_rate',     'rerror_rate',    'srv_rerror_rate',    'same_srv_rate',    'diff_srv_rate',    'srv_diff_host_rate',    'dst_host_count',    'dst_host_srv_count',    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'outcome', 'difficulty'
The number of rows of the training set are 125973 while for the testing set has 22543 rows.
Also, the training set has 23 possible outcomes namely-normal, neptune, warezclient, ipsweep, portsweep, teardrop, nmap, satan, smurf, pod, back, guess_passwd, ftp_write, multihop, rootkit, buffer_overflow, imap, warezmaster, phf, land, loadmodule, spy, perl while the testing set has 38 possible outcomes: 
neptune, normal, saint, mscan, guess_passwd, smurf, apache2, satan, buffer_overflow, back, warezmaster, snmpgetattack, processtable, pod, httptunnel, nmap, ps, snmpguess, ipsweep, mailbomb, portsweep, multihop, named, sendmail, loadmodule, xterm, worm, teardrop, rootkit, xlock, perl, land, xsnoop, sqlattack, ftp_write, imap, udpstorm, phf.
For the preprocessing of the dataset the following operations are needed that are extracting the labels as depicted previously the testing set has an additional 15 attack types which are absent in training data hence, we will need more general labels to train the model for the classification task.
All the 37 attack types present in the given dataset have been condensed and categorized into five general attack types, or the 5 classes of the attacks for this report:
· Denial of service attacks
· Remote to Local attacks
· User to Root
· Probe attacks
· Normal attacks
Our models namely the decision tree model, adaboost classifier, gradient boosting classifier, random forest classifier and extra tree classifier will perform classification of the data to five classes indicating whether the traffic is normal or the four attacks mentioned above, however we will use the five attack types to analyze the results and calculate performance metrics for each general attack type.
The next section replaces the current outcome field with a Class field that has one of the following values:
· Normal
· Dos
· R2L
· U2R
· Probe
Feature Engineering
For continuous features we use the MinMaxScaler provided by the scikit-learn library, we only allow the scaler to fit the training set values and then we use it to scale both the training and testing sets. The minmax_scale_values helper function does this task. As for the discrete features we use one hot encoding. The encode_text function achieves this.
The Model
In order to avoid the imbalance of the samples representing each attack type in the training data, and to avoid the model’s inability to learn about new attack types by observing existing ones, we present an approach of classifier-based machine learning algorithms provided by scikit-learn library like the decision tree classifier, ada boost classifier, gradient boosting classifier, random forest classifier and extra tree classifier.
1) Decision Tree classifier: The Decision Trees are a form of non-parametric supervised learning technique that is used for the purpose of classifying as well as regression. The objective of the operation is to generate a model which can predict the value of a target variable by learning simple decision rules that it learns from the data features. It is a simple to interpret model, as it can be easily visualized and it required no to little data engineering, and it is ideal to perform multi-class classification like the 5-class classification done in this assignment, the input features for this classifier as array of the size of the sample dataset and number of features, and in the parameters of the model it is random_state, for this model the random state is 17.
2) Random Forest Classifier:

Sun	Mon	Tue	Wed	Thu	Fri	Sat
30	31	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	1	2	3

1.1P Basic Linux Security SIT719 Security and Privacy Issues in Analytics Distinction/Higher Distinction Task 5.1 End-to-end project delivery on cyber-security data analytics Overview Do you know what...

Answer To: 1.1P Basic Linux Security SIT719 Security and Privacy Issues in Analytics Distinction/Higher...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment