Answer To: 1.1P Basic Linux Security SIT719 Security and Privacy Issues in Analytics Distinction/Higher...
Sandeep Kumar answered on Jun 08 2021
The list of columns or the features for the NSL-KDD dataset are:
'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'outcome', 'difficulty'
The number of rows of the training set are 125973 while for the testing set has 22543 rows.
Also, the training set has 23 possible outcomes namely-normal, neptune, warezclient, ipsweep, portsweep, teardrop, nmap, satan, smurf, pod, back, guess_passwd, ftp_write, multihop, rootkit, buffer_overflow, imap, warezmaster, phf, land, loadmodule, spy, perl while the testing set has 38 possible outcomes:
neptune, normal, saint, mscan, guess_passwd, smurf, apache2, satan, buffer_overflow, back, warezmaster, snmpgetattack, processtable, pod, httptunnel, nmap, ps, snmpguess, ipsweep, mailbomb, portsweep, multihop, named, sendmail, loadmodule, xterm, worm, teardrop, rootkit, xlock, perl, land, xsnoop, sqlattack, ftp_write, imap, udpstorm, phf.
For the preprocessing of the dataset the following operations are needed that are extracting the labels as depicted previously the testing set has an additional 15 attack types which are absent in training data hence, we will need more general labels to train the model for the classification task.
All the 37 attack types present in the given dataset have been condensed and categorized into five general attack types, or the 5 classes of the attacks for this report:
· Denial of service attacks
· Remote to Local attacks
· User to Root
· Probe attacks
· Normal attacks
Our models namely the decision tree model, adaboost classifier, gradient boosting classifier, random forest classifier and extra tree classifier will perform classification of the data to five classes indicating whether the traffic is normal or the four attacks mentioned above, however we will use the five attack types to analyze the results and calculate performance metrics for each general attack type.
The next section replaces the current outcome field with a Class field that has one of the following values:
· Normal
· Dos
· R2L
· U2R
· Probe
Feature Engineering
For continuous features we use the MinMaxScaler provided by the scikit-learn library, we only allow the scaler to fit the training set values and then we use it to scale both the training and testing sets. The minmax_scale_values helper function does this task. As for the discrete features we use one hot encoding. The encode_text function achieves this.
The Model
In order to avoid the imbalance of the samples representing each attack type in the training data, and to avoid the model’s inability to learn about new attack types by observing existing ones, we present an approach of classifier-based machine learning algorithms provided by scikit-learn library like the decision tree classifier, ada boost classifier, gradient boosting classifier, random forest classifier and extra tree classifier.
1) Decision Tree classifier: The Decision Trees are a form of non-parametric supervised learning technique that is used for the purpose of classifying as well as regression. The objective of the operation is to generate a model which can predict the value of a target variable by learning simple decision rules that it learns from the data features. It is a simple to interpret model, as it can be easily visualized and it required no to little data engineering, and it is ideal to perform multi-class classification like the 5-class classification done in this assignment, the input features for this classifier as array of the size of the sample dataset and number of features, and in the parameters of the model it is random_state, for this model the random state is 17.
2) Random Forest Classifier: A random forest classifier is sort of a meta estimator which is used to fit a wide range of decision tree classifiers on...