pre-processing on NSL KDD data set

0 votes

I want to load the NSL_KDD dataset contained in this link with using the Python programming.

In this database, 22 features for training and testing data are classified into 5 separate classes(Normal, DOS, U2R, R2L, Probe)
But when I run this line of code y_test = pd.get_dummies(y_test), instead of being categorized into 5 classes, it shows me the same 22 features, while I did the same thing train data (target = pd.get_dummies(target) and crrocet result), it using for the test data.
with open(‘G:/RUN_PYTHON/kddcup.names.txt’, ‘r’) as infile:
kdd_names = infile.readlines()
kdd_cols = [x.split(’:’)[0] for x in kdd_names[1:]]

The Train+/Test+ datasets include sample difficulty rating and the attack class

kdd_cols += [‘class’, ‘difficulty’]

kdd = pd.read_csv(‘G:/RUN_PYTHON/KDDTrain+.txt’, names=kdd_cols)
kdd_t = pd.read_csv(‘G:/RUN_PYTHON/KDDTest+.txt’, names=kdd_cols)
#kdd = pd.read_csv(‘G:/RUN_PYTHON/kddcup.txt.data_10_percent_corrected’, names=kdd_cols)
#kdd_t = pd.read_csv(‘G:/RUN_PYTHON/kddcup.testdata.unlabeled_10_percent’, names=kdd_cols)

Consult the linked references for attack categories:

The traffic can be grouped into 5 categories: Normal, DOS, U2R, R2L, Probe

or more coarsely into Normal vs Anomalous for the binary classification task

kdd_cols = [kdd.columns[0]] + sorted(list(set(kdd.protocol_type.values))) + sorted(list(set(kdd.service.values))) + sorted(list(set(kdd.flag.values))) + kdd.columns[4:].tolist()
attack_map = [x.strip().split() for x in open(‘G:/RUN_PYTHON/training_attack_types.txt’, ‘r’)]
attack_map = {x[0]: x[1] for x in attack_map if x}

Here we opt for the 5-class problem

kdd[‘class’] = kdd[‘class’].replace(attack_map)
kdd_t[‘class’] = kdd_t[‘class’].replace(attack_map)

def cat_encode(df, col):
return pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col].values)], axis=1)

def log_trns(df, col):
return df[col].apply(np.log1p)

cat_lst = [‘protocol_type’, ‘service’, ‘flag’]
for col in cat_lst:
kdd = cat_encode(kdd, col)
kdd_t = cat_encode(kdd_t, col)

log_lst = [‘duration’, ‘src_bytes’, ‘dst_bytes’]
for col in log_lst:
kdd[col] = log_trns(kdd, col)
kdd_t[col] = log_trns(kdd_t, col)

kdd = kdd[kdd_cols]
for col in kdd_cols:
if col not in kdd_t.columns:
kdd_t[col] = 0
kdd_t = kdd_t[kdd_cols]

Now we have used one-hot encoding and log scaling

difficulty = kdd.pop(‘difficulty’)
target = kdd.pop(‘class’)
y_diff = kdd_t.pop(‘difficulty’)
y_test = kdd_t.pop(‘class’)

target = pd.get_dummies(target)
y_test = pd.get_dummies(y_test)

May 13, 2020 by arezoo
• 220 points

edited Jun 25, 2020 by MD 1,008 views

1 answer to this question.

0 votes


I don't know why  y_test = pd.get_dummies(y_test) is not giving you proper output. But you can this task in another way like this.

y_test = kdd['class']
Y_test = pd.get_dummies(y_test)

It will give you the categorical output.

answered May 13, 2020 by MD
• 95,300 points

Hi@MD ,

when I replace the command line y_test = pd.get_dummies(y_test) with 

y_test = kdd['class']
Y_test = pd.get_dummies(y_test)

I received this error:

 y_test = kdd['class']
Traceback (most recent call last):

  File "<ipython-input-19-b667b206b69b>", line 1, in <module>
    y_test = kdd['class']

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\", line 1964, in __getitem__
    return self._getitem_column(key)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\", line 1971, in _getitem_column
    return self._get_item_cache(key)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\", line 1645, in _get_item_cache
    values = self._data.get(item)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\", line 3590, in get
    loc = self.items.get_loc(item)

  File "C:\Python\Anaconda3\lib\site-packages\pandas\core\indexes\", line 2444, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))

  File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1210, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1218, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'class'

could you please help me?

I like to suggest you that create your model in Jupyter notebook and run your code step by step. It will be more helpful to troubleshoot the issue. If you find your code is running well then save that code as .py extension.

Regarding the error, the above code should work. Kdd is a data frame and we can slice one column.

Thanks for your response.

Related Questions

0 votes
0 answers

try except is not working while using hdfs command

Hi,  I am trying to run following things ...READ MORE

Mar 6, 2019 in Python by anonymous
0 votes
2 answers
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 7, 2019 in Python by Neha
• 330 points

edited Jul 8, 2019 by Kalgi 2,481 views
0 votes
1 answer
0 votes
0 answers

Load and pre-process NSL_KDD data set

since I am a newbie in python ...READ MORE

May 27, 2020 in Python by arezoo
• 220 points
0 votes
1 answer

Building Random Forest on a data-set comprising of missing(NA) values

You have two options, either impute the ...READ MORE

answered Apr 3, 2018 in Data Analytics by Bharani
• 4,620 points
0 votes
1 answer
0 votes
2 answers

How to arrange a data set in ascending order based on a variable?

In your case it'll be, orderedviews = arrange(movie_views, ...READ MORE

answered Nov 27, 2018 in Data Analytics by Kalgi
• 52,370 points
0 votes
1 answer
+1 vote
0 answers

ValueError help with Simple Exponential Smoothing analysis on my data set.

I'm very new, and attempting to teach ...READ MORE

Jul 31, 2019 in Python by Declan

edited Jul 31, 2019 1,739 views