pre-processing on NSL_KDD data set

0 votes

I want to load the NSL_KDD dataset contained in this link with using the Python programming.

https://github.com/smellslikeml/deepIDS/blob/master/deep_IDS.ipynb

In this database, 22 features for training and testing data are classified into 5 separate classes(Normal, DOS, U2R, R2L, Probe)
But when I run this line of code y_test = pd.get_dummies(y_test), instead of being categorized into 5 classes, it shows me the same 22 features, while I did the same thing train data (target = pd.get_dummies(target) and crrocet result), it using for the test data.
with open(‘G:/RUN_PYTHON/kddcup.names.txt’, ‘r’) as infile:
kdd_names = infile.readlines()
kdd_cols = [x.split(’:’)[0] for x in kdd_names[1:]]

The Train+/Test+ datasets include sample difficulty rating and the attack class

kdd_cols += [‘class’, ‘difficulty’]

kdd = pd.read_csv(‘G:/RUN_PYTHON/KDDTrain+.txt’, names=kdd_cols)
kdd_t = pd.read_csv(‘G:/RUN_PYTHON/KDDTest+.txt’, names=kdd_cols)
#kdd = pd.read_csv(‘G:/RUN_PYTHON/kddcup.txt.data_10_percent_corrected’, names=kdd_cols)
#kdd_t = pd.read_csv(‘G:/RUN_PYTHON/kddcup.testdata.unlabeled_10_percent’, names=kdd_cols)

Consult the linked references for attack categories:

https://www.researchgate.net/post/What_are_the_attack_types_in_the_NSL-KDD_TEST_set_For_example_processtable_is_a_attack_type_in_test_set_Im_wondering_is_it_prob_DoS_R2L_U2R

The traffic can be grouped into 5 categories: Normal, DOS, U2R, R2L, Probe

or more coarsely into Normal vs Anomalous for the binary classification task

kdd_cols = [kdd.columns[0]] + sorted(list(set(kdd.protocol_type.values))) + sorted(list(set(kdd.service.values))) + sorted(list(set(kdd.flag.values))) + kdd.columns[4:].tolist()
attack_map = [x.strip().split() for x in open(‘G:/RUN_PYTHON/training_attack_types.txt’, ‘r’)]
attack_map = {x[0]: x[1] for x in attack_map if x}

Here we opt for the 5-class problem

kdd[‘class’] = kdd[‘class’].replace(attack_map)
kdd_t[‘class’] = kdd_t[‘class’].replace(attack_map)

def cat_encode(df, col):
return pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col].values)], axis=1)

def log_trns(df, col):
return df[col].apply(np.log1p)

cat_lst = [‘protocol_type’, ‘service’, ‘flag’]
for col in cat_lst:
kdd = cat_encode(kdd, col)
kdd_t = cat_encode(kdd_t, col)

log_lst = [‘duration’, ‘src_bytes’, ‘dst_bytes’]
for col in log_lst:
kdd[col] = log_trns(kdd, col)
kdd_t[col] = log_trns(kdd_t, col)

kdd = kdd[kdd_cols]
for col in kdd_cols:
if col not in kdd_t.columns:
kdd_t[col] = 0
kdd_t = kdd_t[kdd_cols]

Now we have used one-hot encoding and log scaling

difficulty = kdd.pop(‘difficulty’)
target = kdd.pop(‘class’)
y_diff = kdd_t.pop(‘difficulty’)
y_test = kdd_t.pop(‘class’)

target = pd.get_dummies(target)
print(target)
y_test = pd.get_dummies(y_test)
print(y_test)

the output of target:
Out[27]:
dos normal probe r2l u2r
0 0 1 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
7 1 0 0 0 0
8 1 0 0 0 0
9 1 0 0 0 0
10 1 0 0 0 0
11 1 0 0 0 0
12 0 1 0 0 0
13 0 0 0 1 0
14 1 0 0 0 0
15 1 0 0 0 0
16 0 1 0 0 0
17 0 0 1 0 0
18 0 1 0 0 0
19 0 1 0 0 0
20 1 0 0 0 0
21 1 0 0 0 0
22 0 1 0 0 0
23 0 1 0 0 0
24 1 0 0 0 0
25 0 1 0 0 0
26 1 0 0 0 0
27 0 1 0 0 0
28 0 1 0 0 0
29 0 1 0 0 0
… … … … …
125943 0 1 0 0 0
125944 0 1 0 0 0
125945 0 1 0 0 0
125946 1 0 0 0 0
125947 0 0 1 0 0
125948 1 0 0 0 0
125949 0 1 0 0 0
125950 1 0 0 0 0
125951 0 1 0 0 0
125952 0 1 0 0 0
125953 1 0 0 0 0
125954 0 1 0 0 0
125955 0 1 0 0 0
125956 0 1 0 0 0
125957 0 1 0 0 0
125958 1 0 0 0 0
125959 0 1 0 0 0
125960 0 1 0 0 0
125961 0 1 0 0 0
125962 0 1 0 0 0
125963 0 1 0 0 0
125964 1 0 0 0 0
125965 0 1 0 0 0
125966 1 0 0 0 0
125967 0 1 0 0 0
125968 1 0 0 0 0
125969 0 1 0 0 0
125970 0 1 0 0 0
125971 1 0 0 0 0
125972 0 1 0 0 0

[125973 rows x 5 columns]

the output of y_test:
print(y_test)
apache2 dos httptunnel mailbomb mscan named normal probe
0 0 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0
5 0 0 0 0 0 0 1 0
6 0 0 0 0 0 0 1 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 1 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 0 1 0
12 0 1 0 0 0 0 0 0
13 0 1 0 0 0 0 0 0
14 0 0 0 0 0 0 1 0
15 0 0 0 0 0 0 1 0
16 0 0 0 0 0 0 1 0
17 0 0 0 0 0 0 1 0
18 0 0 0 0 0 0 1 0
19 0 1 0 0 0 0 0 0
20 0 1 0 0 0 0 0 0
21 0 0 0 0 1 0 0 0
22 0 0 0 0 0 0 1 0
23 0 0 0 0 0 0 1 0
24 0 1 0 0 0 0 0 0
25 0 1 0 0 0 0 0 0
26 0 0 0 0 0 0 1 0
27 0 0 0 0 0 0 1 0
28 0 1 0 0 0 0 0 0
29 0 0 0 0 0 0 1 0
… … … … … … … …
22514 0 0 0 0 0 0 1 0
22515 1 0 0 0 0 0 0 0
22516 0 0 0 0 0 0 1 0
22517 0 0 0 0 0 0 0 0
22518 0 0 0 0 0 0 1 0
22519 0 0 0 0 0 0 0 0
22520 0 0 0 0 0 0 0 1
22521 0 0 0 0 0 0 0 1
22522 0 1 0 0 0 0 0 0
22523 0 0 0 0 0 0 1 0
22524 0 0 0 0 0 0 0 0
22525 1 0 0 0 0 0 0 0
22526 0 0 0 0 0 0 1 0
22527 0 0 0 0 0 0 1 0
22528 0 1 0 0 0 0 0 0
22529 0 0 0 0 0 0 1 0
22530 0 1 0 0 0 0 0 0
22531 0 1 0 0 0 0 0 0
22532 0 0 0 0 0 0 1 0
22533 0 0 0 0 0 0 1 0
22534 0 1 0 0 0 0 0 0
22535 0 0 0 0 0 0 1 0
22536 0 1 0 0 0 0 0 0
22537 0 0 0 1 0 0 0 0
22538 0 1 0 0 0 0 0 0
22539 0 0 0 0 0 0 1 0
22540 0 0 0 0 0 0 1 0
22541 0 1 0 0 0 0 0 0
22542 0 0 0 0 0 0 1 0
22543 0 0 0 0 1 0 0 0

   processtable  ps  ...    sendmail  snmpgetattack  snmpguess  sqlattack  \

0 0 0 … 0 0 0 0
1 0 0 … 0 0 0 0
2 0 0 … 0 0 0 0
3 0 0 … 0 0 0 0
4 0 0 … 0 0 0 0
5 0 0 … 0 0 0 0
6 0 0 … 0 0 0 0
7 0 0 … 0 0 0 0
8 0 0 … 0 0 0 0
9 0 0 … 0 0 0 0
10 0 0 … 0 0 0 0
11 0 0 … 0 0 0 0
12 0 0 … 0 0 0 0
13 0 0 … 0 0 0 0
14 0 0 … 0 0 0 0
15 0 0 … 0 0 0 0
16 0 0 … 0 0 0 0
17 0 0 … 0 0 0 0
18 0 0 … 0 0 0 0
19 0 0 … 0 0 0 0
20 0 0 … 0 0 0 0
21 0 0 … 0 0 0 0
22 0 0 … 0 0 0 0
23 0 0 … 0 0 0 0
24 0 0 … 0 0 0 0
25 0 0 … 0 0 0 0
26 0 0 … 0 0 0 0
27 0 0 … 0 0 0 0
28 0 0 … 0 0 0 0
29 0 0 … 0 0 0 0
… … … … … … …
22514 0 0 … 0 0 0 0
22515 0 0 … 0 0 0 0
22516 0 0 … 0 0 0 0
22517 1 0 … 0 0 0 0
22518 0 0 … 0 0 0 0
22519 1 0 … 0 0 0 0
22520 0 0 … 0 0 0 0
22521 0 0 … 0 0 0 0
22522 0 0 … 0 0 0 0
22523 0 0 … 0 0 0 0
22524 0 0 … 0 0 0 0
22525 0 0 … 0 0 0 0
22526 0 0 … 0 0 0 0
22527 0 0 … 0 0 0 0
22528 0 0 … 0 0 0 0
22529 0 0 … 0 0 0 0
22530 0 0 … 0 0 0 0
22531 0 0 … 0 0 0 0
22532 0 0 … 0 0 0 0
22533 0 0 … 0 0 0 0
22534 0 0 … 0 0 0 0
22535 0 0 … 0 0 0 0
22536 0 0 … 0 0 0 0
22537 0 0 … 0 0 0 0
22538 0 0 … 0 0 0 0
22539 0 0 … 0 0 0 0
22540 0 0 … 0 0 0 0
22541 0 0 … 0 0 0 0
22542 0 0 … 0 0 0 0
22543 0 0 … 0 0 0 0

   u2r  udpstorm  worm  xlock  xsnoop  xterm  

0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0
9 0 0 0 0 0 0
10 0 0 0 0 0 0
11 0 0 0 0 0 0
12 0 0 0 0 0 0
13 0 0 0 0 0 0
14 0 0 0 0 0 0
15 0 0 0 0 0 0
16 0 0 0 0 0 0
17 0 0 0 0 0 0
18 0 0 0 0 0 0
19 0 0 0 0 0 0
20 0 0 0 0 0 0
21 0 0 0 0 0 0
22 0 0 0 0 0 0
23 0 0 0 0 0 0
24 0 0 0 0 0 0
25 0 0 0 0 0 0
26 0 0 0 0 0 0
27 0 0 0 0 0 0
28 0 0 0 0 0 0
29 0 0 0 0 0 0
… … … … … …
22514 0 0 0 0 0 0
22515 0 0 0 0 0 0
22516 0 0 0 0 0 0
22517 0 0 0 0 0 0
22518 0 0 0 0 0 0
22519 0 0 0 0 0 0
22520 0 0 0 0 0 0
22521 0 0 0 0 0 0
22522 0 0 0 0 0 0
22523 0 0 0 0 0 0
22524 1 0 0 0 0 0
22525 0 0 0 0 0 0
22526 0 0 0 0 0 0
22527 0 0 0 0 0 0
22528 0 0 0 0 0 0
22529 0 0 0 0 0 0
22530 0 0 0 0 0 0
22531 0 0 0 0 0 0
22532 0 0 0 0 0 0
22533 0 0 0 0 0 0
22534 0 0 0 0 0 0
22535 0 0 0 0 0 0
22536 0 0 0 0 0 0
22537 0 0 0 0 0 0
22538 0 0 0 0 0 0
22539 0 0 0 0 0 0
22540 0 0 0 0 0 0
22541 0 0 0 0 0 0
22542 0 0 0 0 0 0
22543 0 0 0 0 0 0

[22544 rows x 22 columns]
best regards

May 13 by arezoo
• 220 points
49 views

1 answer to this question.

0 votes

Hi@arezoo,

I don't know why  y_test = pd.get_dummies(y_test) is not giving you proper output. But you can this task in another way like this.

y_test = kdd['class']
Y_test = pd.get_dummies(y_test)

It will give you the categorical output.

answered May 13 by MD
• 24,500 points

Hi@MD ,

when I replace the command line y_test = pd.get_dummies(y_test) with 

y_test = kdd['class']
Y_test = pd.get_dummies(y_test)

I received this error:

 y_test = kdd['class']
Traceback (most recent call last):

  File "<ipython-input-19-b667b206b69b>", line 1, in <module>
    y_test = kdd['class']

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1964, in __getitem__
    return self._getitem_column(key)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1971, in _getitem_column
    return self._get_item_cache(key)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1645, in _get_item_cache
    values = self._data.get(item)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3590, in get
    loc = self.items.get_loc(item)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2444, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))

  File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1210, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1218, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'class'

could you please help me?

I like to suggest you that create your model in jupyter notebook and run your code step by step. It will be more helpful to troubleshoot the issue. If you find your code is running good then save that code as .py extension.

Regarding the error, the above code should work. kdd is a dataframe and we can slice one column.

Thanks for your response.

Related Questions

0 votes
0 answers

try except is not working while using hdfs command

Hi,  I am trying to run following things ...READ MORE

Mar 6, 2019 in Python by anonymous
133 views
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 6, 2019 in Python by Neha
• 330 points

edited Jul 8, 2019 by Kalgi 963 views
0 votes
0 answers
0 votes
0 answers

Load and pre-process NSL_KDD data set

since I am a newbie in python ...READ MORE

May 26 in Python by arezoo
• 220 points
24 views
0 votes
1 answer

Building Random Forest on a data-set comprising of missing(NA) values

You have two options, either impute the ...READ MORE

answered Apr 2, 2018 in Data Analytics by Bharani
• 4,560 points
342 views
0 votes
1 answer
0 votes
2 answers

How to arrange a data set in ascending order based on a variable?

In your case it'll be, orderedviews = arrange(movie_views, ...READ MORE

answered Nov 27, 2018 in Data Analytics by Kalgi
• 51,850 points
129 views
0 votes
1 answer
+1 vote
0 answers

ValueError help with Simple Exponential Smoothing analysis on my data set.

I'm very new, and attempting to teach ...READ MORE

Jul 30, 2019 in Python by Declan

edited Jul 31, 2019 553 views