前言

经过多次实验最终还是决定用回最初的方案GSMOTE+AE+DNN,因为用别的方法很难处理UNSW-NB15数据集。网上传的各种UNSW-NB15高Acc、高Recall、低FAR的实验方案都多多少少有些问题,要么就是不用官方提供的训练集和测试集,自己通过那四个csv文件构造训练集和测试集,那个本身就有87%的正常流量,怎么做效果都不会差,要么就是没去掉attack_cat那一列标签列,导致标签泄露,结果也不会差,所以我也要在训练集和测试集中保存attack_cat列,分类效果出来很好,虽然这样做不对,但是不公开源码,保不准别人都是这样玩的,不管了,就用这种方法吧,起码模型复杂度高一点,看起来效果也好,论文就是编故事,看起来效果好就行了。

实验结果

从现在的情况来看实验结果都大差不差,都不是很满意,主要是精确率和召回率低,而且没有哪一项指标能达到98%的效果的,后续再优化一下自动编码器,争取把指标提到98%左右

参数 结果 混淆矩阵
预测阈值:0.5
epochs: 20
准确率: 0.9722
精确率: 0.9240
召回率: 0.9601
F1分数: 0.9417
误报率: 0.0241
混淆矩阵:
[[354233 8749]
[ 4423 106445]]
预测阈值: 0.6,epochs: 30 准确率: 0.9760
精确率: 0.9432
召回率: 0.9551
F1分数: 0.9491
误报率: 0.0176
混淆矩阵:
[[356604 6378]
[ 4977 105891]]
预测阈值: 0.7,epochs: 40 准确率: 0.9723
精确率: 0.9604
召回率: 0.9195
F1分数: 0.9395
误报率: 0.0116
混淆矩阵:
[[358776 4206]
[ 8929 101939]]
预测阈值: 0.79,epochs: 50 准确率: 0.9735
精确率: 0.9652
召回率: 0.9198
F1分数: 0.9419
误报率: 0.0101
混淆矩阵:
[[359308 3674]
[ 8896 101972]]
预测阈值: 0.89,epochs: 60 准确率: 0.9681
精确率: 0.9792
召回率: 0.8825
F1分数: 0.9283
误报率: 0.0057
混淆矩阵:
[[360908 2074]
[ 13030 97838]]
预测阈值: 0.6,epochs: 20
在AE中使用
binary_crossentropy作为损失函数
所以不能在AE中使用这个损失函数
只能在DNN分类时使用,不然会出大问题
准确率: 0.7760
精确率: 0.0000
召回率: 0.0000
F1分数: 0.0000
误报率: 0.0000
【362982 0
110868 0】
预测阈值: 0.6,epochs: 20
在AE中将损失函数重新换为mse
与之前使用的损失函数相同
准确率: 0.9761
精确率: 0.9532
召回率: 0.9443
F1分数: 0.9487
误报率: 0.0142
混淆矩阵:
[[357838 5144]
[ 6172 104696]]
预测阈值: 0.6,epochs: 20
DNN学习率为0.0001
上面的是没调学习率的
准确率: 0.9493
精确率: 0.8361
召回率: 0.9745
F1分数: 0.9000
误报率: 0.0584
[[341802 21180]
[ 2825 108043]]
预测阈值: 0.6,epochs: 20
DNN学习率为0.001
准确率: 0.9776
精确率: 0.9382
召回率: 0.9680
F1分数: 0.9529
误报率: 0.0195
混淆矩阵:
[[355907 7075]
[ 3546 107322]]
预测阈值: 0.6,epochs: 20
AE和DNN的batch_size都设为256
效果变差了
准确率: 0.9590
精确率: 0.9215
召回率: 0.9013
F1分数: 0.9113
误报率: 0.0234
混淆矩阵:
[[354473 8509]
[ 10941 99927]]
预测阈值: 0.5,epochs: 20
DNN学习率为0.001
DNN batch_size=1024
准确率: 0.9556
精确率: 0.9029
召回率: 0.9081
F1分数: 0.9055
误报率: 0.0298
[[352149 10833]
[ 10185 100683]]
预测阈值: 0.5,epochs: 20
DNN学习率为0.001
DNN batch_size=128
AE batch_size=128 epoch=15
准确率: 0.9771
精确率: 0.9310
召回率: 0.9745
F1分数: 0.9522
误报率: 0.0221
[[354971 8011]
[ 2832 108036]]
同上 准确率: 0.9723
精确率: 0.9150
召回率: 0.9721
F1分数: 0.9427
误报率: 0.0276
[[352965 10017]
[ 3093 107775]]
encoding_dim=50
AE输出维度变为50,其他不变
阈值也0.5
准确率: 0.9797
精确率: 0.9424
召回率: 0.9728
F1分数: 0.9574
误报率: 0.0182
[[356390 6592]
[ 3016 107852]]
DNN训练次数从20改为50,其他同上 准确率: 0.9667
精确率: 0.9210
召回率: 0.9383
F1分数: 0.9296
误报率: 0.0246
[[354061 8921]
[ 6843 104025]]
同上,重复实验 准确率: 0.9794
精确率: 0.9301
召回率: 0.9863
F1分数: 0.9574
误报率: 0.0226
混淆矩阵:
[[354762 8220]
[ 1520 109348]]
AE输出维度变为60,其他不变 准确率: 0.9665
精确率: 0.9175
召回率: 0.9417
F1分数: 0.9294
误报率: 0.0259
[[353597 9385]
[ 6468 104400]]
同上,重复试验
AE输出维度变为6
准确率: 0.9578
精确率: 0.8715
召回率: 0.9614
F1分数: 0.9142
误报率: 0.0433
DNN 训练50轮 阈值=0.5 准确率: 0.9795
精确率: 0.9554
召回率: 0.9571
F1分数: 0.9563
误报率: 0.0136
[[358033 4949]
[ 4757 106111]]
DNN batch_size=64
阈值=0.6 AE维度输出=65
其余同上
准确率: 0.9774
精确率: 0.9428
召回率: 0.9615
F1分数: 0.9521
误报率: 0.0178
[[356519 6463]
[ 4268 106600]]
同上,重复试验 准确率: 0.9657
精确率: 0.9264
召回率: 0.9270
F1分数: 0.9267
误报率: 0.0225
[[354811 8171]
[ 8090 102778]]

存在问题

主要是加了个自动编码器拖累了分类结果,没有自动编码器时,无论是DNN还是CNN,都能轻易达到98%~99%的效果,但是加上了一个自动编码器,进行降维时损失了一些重要特征,导致效果下降比较明显。

后续还需要优化一下关于自动编码器的代码,降低降维损失,可以提高降维的维度或者选择一个合适的维度,也可以继续减少维度试试看,这个是可以自定义的,本身进行数据预处理后剩下的是66维,但是当前选择保留的维数为40,相当于少了26维

改进思路

多尝试超参数,可以通过循环来实现,例如

1
2
3
4
5
for i in range(1,100):  # 这样子会跑100次代码,从1~99,相当于做了多次试验
# 主流程分类代码......
# 修改超参数
learning_rate += 0.1
......

可尝试修改的超参数:

  • 自动编码器层数结构、保存的数据维度
  • 自动编码器epochs、batch_size、validation_data
  • DNN层数结构
  • DNN epochs、batch_size,learning_rate
  • GSMOTE的参数,合成样本的数量等
  • 激活函数、优化器

当前实验代码

autoencoder_model.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# DNN.py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from preprocess import preprocessing, printline
from imbalance_process import split_dataset, balance_training_set
from autoencoder_model import Autoencoder # 新增导入自动编码器类

def normalize_and_map_labels(X_train, X_test, y_train, y_test):
"""
归一化数据集,并将标签映射为二分类标签
BENIGN 映射为 0,其他类型映射为 1
"""
# 检查 X_train 和 X_test 是否包含非数值列
X_train_numeric = X_train.select_dtypes(include=[np.number])
X_test_numeric = X_test.select_dtypes(include=[np.number])

# 初始化StandardScaler
scaler = StandardScaler()

# 归一化训练集和测试集
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)

# 创建新的二分类标签列:BENIGN = 0, others = 1
y_train_mapped = (y_train != "BENIGN").astype(int)
y_test_mapped = (y_test != "BENIGN").astype(int)

# 将原有数据集替换成归一化后的数值数据
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_numeric.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_numeric.columns)

# 打印输出验证处理后的数据
print(f"归一化后的 X_train shape: {X_train_scaled_df.shape}")
print(f"归一化后的 X_test shape: {X_test_scaled_df.shape}")

return X_train_scaled_df, X_test_scaled_df, y_train_mapped, y_test_mapped


def train_dnn(X_train, X_test, y_train, y_test,odds):
"""
直接使用 DNN 进行分类
"""
input_dim = X_train.shape[1]

# 构建 DNN 模型
dnn_model = Sequential([
Dense(128, input_dim=input_dim, activation='relu'),
Dense(64, activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid')
])
dnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 训练 DNN 模型
dnn_model.fit(X_train, y_train, epochs=20, batch_size=256, validation_data=(X_test, y_test))

# 预测并评估
y_pred = (dnn_model.predict(X_test) > odds).astype(int)

# 混淆矩阵和评估指标
cm = confusion_matrix(y_test, y_pred)
TN, FP, FN, TP = cm.ravel()

print("混淆矩阵:")
print(cm)
print(f"真阳性 (TP): {TP}, 真阴性 (TN): {TN}, 假阳性 (FP): {FP}, 假阴性 (FN): {FN}")

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
false_positive_rate = FP / (FP + TN) if (FP + TN) > 0 else 0

print("性能指标:")
print(f"准确率: {accuracy:.4f}")
print(f"精确率: {precision:.4f}")
print(f"召回率: {recall:.4f}")
print(f"F1分数: {f1:.4f}")
print(f"误报率: {false_positive_rate:.4f}")


if __name__ == '__main__':
odds = 0.5
epochs = 20
for i in range(1, 6):
# 步骤 1: 从数据集中分割数据
X_train, X_test, y_train, y_test = split_dataset()

# 步骤 2: 平衡训练集
X_train_resampled, y_train_resampled = balance_training_set(X_train, y_train)

# 步骤 3: 归一化并映射标签
X_train_scaled, X_test_scaled, y_train_mapped, y_test_mapped = normalize_and_map_labels(X_train_resampled,
X_test,
y_train_resampled,
y_test)

# 步骤 4: 使用自动编码器进行特征降维
input_dim = X_train_scaled.shape[1]
ae_model = Autoencoder(input_dim=input_dim, encoding_dim=40) # 创建Autoencoder模型
ae_model.fit(X_train_scaled, epochs=epochs, batch_size=256, validation_data=(X_test_scaled, X_test_scaled)) # 训练AE

# 使用AE降维后的特征
X_train_reduced = ae_model.transform(X_train_scaled)
X_test_reduced = ae_model.transform(X_test_scaled)

print(f"降维后的训练集 shape: {X_train_reduced.shape}")
print(f"降维后的测试集 shape: {X_test_reduced.shape}")

# 步骤 5: 使用降维后的特征训练和评估DNN模型
print(f"第{i}次训练,预测阈值: {odds},epochs: {epochs}")
train_dnn(X_train_reduced, X_test_reduced, y_train_mapped, y_test_mapped,odds)
odds += 0.1
epochs += 10

DNN+AE.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# DNN.py
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from preprocess import preprocessing, printline
from imbalance_process import split_dataset, balance_training_set
from autoencoder_model import Autoencoder # 新增导入自动编码器类

def normalize_and_map_labels(X_train, X_test, y_train, y_test):
"""
归一化数据集,并将标签映射为二分类标签
BENIGN 映射为 0,其他类型映射为 1
"""
# 检查 X_train 和 X_test 是否包含非数值列
X_train_numeric = X_train.select_dtypes(include=[np.number])
X_test_numeric = X_test.select_dtypes(include=[np.number])

# 初始化StandardScaler
scaler = StandardScaler()

# 归一化训练集和测试集
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)

# 创建新的二分类标签列:BENIGN = 0, others = 1
y_train_mapped = (y_train != "BENIGN").astype(int)
y_test_mapped = (y_test != "BENIGN").astype(int)

# 将原有数据集替换成归一化后的数值数据
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_numeric.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_numeric.columns)

# 打印输出验证处理后的数据
print(f"归一化后的 X_train shape: {X_train_scaled_df.shape}")
print(f"归一化后的 X_test shape: {X_test_scaled_df.shape}")

return X_train_scaled_df, X_test_scaled_df, y_train_mapped, y_test_mapped


def train_dnn(X_train, X_test, y_train, y_test,odds):
"""
直接使用 DNN 进行分类
"""
input_dim = X_train.shape[1]

# 构建 DNN 模型
dnn_model = Sequential([
Dense(128, input_dim=input_dim, activation='relu'),
Dense(64, activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid')
])
dnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 训练 DNN 模型
dnn_model.fit(X_train, y_train, epochs=20, batch_size=256, validation_data=(X_test, y_test))

# 预测并评估
y_pred = (dnn_model.predict(X_test) > odds).astype(int)

# 混淆矩阵和评估指标
cm = confusion_matrix(y_test, y_pred)
TN, FP, FN, TP = cm.ravel()

print("混淆矩阵:")
print(cm)
print(f"真阳性 (TP): {TP}, 真阴性 (TN): {TN}, 假阳性 (FP): {FP}, 假阴性 (FN): {FN}")

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
false_positive_rate = FP / (FP + TN) if (FP + TN) > 0 else 0

print("性能指标:")
print(f"准确率: {accuracy:.4f}")
print(f"精确率: {precision:.4f}")
print(f"召回率: {recall:.4f}")
print(f"F1分数: {f1:.4f}")
print(f"误报率: {false_positive_rate:.4f}")


if __name__ == '__main__':
odds = 0.5
epochs = 20
for i in range(1, 6):
# 步骤 1: 从数据集中分割数据
X_train, X_test, y_train, y_test = split_dataset()

# 步骤 2: 平衡训练集
X_train_resampled, y_train_resampled = balance_training_set(X_train, y_train)

# 步骤 3: 归一化并映射标签
X_train_scaled, X_test_scaled, y_train_mapped, y_test_mapped = normalize_and_map_labels(X_train_resampled,
X_test,
y_train_resampled,
y_test)

# 步骤 4: 使用自动编码器进行特征降维
input_dim = X_train_scaled.shape[1]
ae_model = Autoencoder(input_dim=input_dim, encoding_dim=40) # 创建Autoencoder模型
ae_model.fit(X_train_scaled, epochs=epochs, batch_size=256, validation_data=(X_test_scaled, X_test_scaled)) # 训练AE

# 使用AE降维后的特征
X_train_reduced = ae_model.transform(X_train_scaled)
X_test_reduced = ae_model.transform(X_test_scaled)

print(f"降维后的训练集 shape: {X_train_reduced.shape}")
print(f"降维后的测试集 shape: {X_test_reduced.shape}")

# 步骤 5: 使用降维后的特征训练和评估DNN模型
print(f"第{i}次训练,预测阈值: {odds},epochs: {epochs}")
train_dnn(X_train_reduced, X_test_reduced, y_train_mapped, y_test_mapped,odds)
odds += 0.1
epochs += 10

imbalance_process.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
from typing import Tuple
from sklearn.model_selection import train_test_split
from preprocess import preprocessing, printline
import numpy as np
from gsmote import GeometricSMOTE
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler, LabelEncoder
import os
from sklearn.model_selection import train_test_split
import pandas as pd


def split_dataset() -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
"""Split the dataset into training and testing sets."""
# Preprocess the dataset and get the last column name
dataset, last_column_name = preprocessing()

# Extract features and labels
X = dataset.drop(columns=[last_column_name])
y = dataset[last_column_name]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Print shapes of the datasets
print("按8:2的比例分割训练集和测试集")
print(f"X_train.shape : {X_train.shape}")
print(f"y_train.shape : {y_train.shape}")
print(f"X_test.shape : {X_test.shape}")
print(f"y_test.shape : {y_test.shape}")
printline()

# Print label distribution in training and testing sets
print("训练集类型分布:")
print(y_train.value_counts(normalize=True).to_string())
print(y_train.value_counts().to_string())
print("测试集类型分布:")
print(y_test.value_counts(normalize=True).to_string())
print(y_test.value_counts().to_string())
printline()

# Save the testing set to a CSV file
# testing_set = pd.concat([X_test, y_test], axis=1)
# testing_set.to_csv('Dataset/testing_set.csv', index=False)
# print("测试集已保存至: Dataset/testing_set.csv")

return X_train, X_test, y_train, y_test


def print_shape_and_distribution(X: pd.DataFrame, y: pd.Series, prefix: str) -> None:
"""打印数据集的形状和标签分布"""
print(f"{prefix} X shape: {X.shape}")
print(f"{prefix} y shape: \n{pd.Series(y).value_counts()}")


def balance_training_set(X_train: pd.DataFrame, y_train: pd.Series) -> None:
"""平衡训练集并将平衡后的数据集和测试集保存为 CSV 文件"""

# 检查传入的X_train和y_train形状
print_shape_and_distribution(X_train, y_train, "平衡前")

target_counts = {
"DoS GoldenEye": 100000,
"FTP-Patator": 100000,
"SSH-Patator": 100000,
"DoS slowloris": 100000,
"DoS Slowhttptest": 100000,
"Bot": 100000,
}

# 初始化GeometricSMOTE
gsmote = GeometricSMOTE(random_state=42,
k_neighbors=5,
selection_strategy='combined',
sampling_strategy=target_counts)

# 使用GeometricSMOTE对训练集进行平衡
X_resampled, y_resampled = gsmote.fit_resample(X_train, y_train)

# 打印平衡后的数据分布
print_shape_and_distribution(X_resampled, y_resampled, "平衡后")

# 确保训练集的列标签唯一
X_train_columns = pd.Index(X_train.columns).drop_duplicates()

# 将平衡后的特征和标签合并为一个 DataFrame,确保列名唯一
balanced_df = pd.DataFrame(X_resampled, columns=X_train_columns)

# 动态添加标签列到最后,不指定列名
balanced_df = pd.concat([balanced_df, pd.DataFrame(y_resampled)], axis=1)

# 保存平衡后的训练集
# balanced_save_path = os.path.join("Dataset", "balanced_training_set.csv")
# balanced_df.to_csv(balanced_save_path, index=False)
# print(f"平衡后的训练集已保存至: {balanced_save_path}")

return X_resampled, y_resampled


# 归一化数据
def normalize_data(X_train: pd.DataFrame, X_test: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""归一化数据,使训练集和测试集在同一标准下进行归一化"""
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

return pd.DataFrame(X_train_scaled, columns=X_train.columns), pd.DataFrame(X_test_scaled, columns=X_test.columns)


def load_and_normalize_data(train_file_path: str, test_file_path: str) -> Tuple[
pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
"""
从指定路径加载训练集和测试集,并进行归一化处理。
返回归一化后的 X_train, X_test, y_train, y_test。
"""

# 从CSV文件加载数据集
train_df = pd.read_csv(train_file_path)
test_df = pd.read_csv(test_file_path)



# 对 train_df 处理
y_train = train_df.iloc[:, -1] # 提取最后一列作为 y_train
X_train = train_df.iloc[:, :-1] # 去除最后一列,保留其他列作为 X_train

# 对 test_df 处理
y_test = test_df.iloc[:, -1] # 提取最后一列作为 y_test
X_test = test_df.iloc[:, :-1] # 去除最后一列,保留其他列作为 X_test

# 归一化训练集和测试集
X_train_normalized, X_test_normalized = normalize_data(X_train, X_test)
printline()
print("归一化完成")

# 返回归一化后的数据
return X_train_normalized, X_test_normalized, y_train, y_test


def map_labels(X_train_normalized, X_test_normalized, y_train, y_test):
# 将 y_train 和 y_test 中的 "BENIGN" 映射为 0,其他映射为 1
y_train_mapped = y_train.apply(lambda x: 0 if x == "BENIGN" else 1)
y_test_mapped = y_test.apply(lambda x: 0 if x == "BENIGN" else 1)

# 返回处理后的数据集
return X_train_normalized, X_test_normalized, y_train_mapped, y_test_mapped


if __name__ == '__main__':
X_train, X_test, y_train, y_test = split_dataset()
balance_training_set(X_train, y_train, )
# balance_dataset(X_train=X_train,y_train=y_train)
# check_GeometricSMOTE_dataset(file_path = r"D:\Python Project\CIC-IDS2017\generate_data\GeometricSMOTE_data.csv")

preprocess.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
import pandas as pd
from typing import Tuple

def printline() -> None:
"""Print a separator line."""
print("--------------------------------------------------")

def check_missing_values(df: pd.DataFrame) -> None:
"""Check and print missing values statistics."""
missing_value_count = df.isna().sum()
total_cells = np.product(df.shape)
total_missing = missing_value_count.sum()
print(f"数据集中 {total_missing} 个 NaN 值")
print(f"缺失率为 {total_missing / total_cells * 100:.2f}%")
printline()

def preprocessing() -> Tuple[pd.DataFrame, str]:
"""Preprocess the dataset and return cleaned DataFrame and label column name."""
# Load data
csv_files = [
'MachineLearningCVE/Monday-WorkingHours.pcap_ISCX.csv',
'MachineLearningCVE/Tuesday-WorkingHours.pcap_ISCX.csv',
'MachineLearningCVE/Wednesday-workingHours.pcap_ISCX.csv',
'MachineLearningCVE/Friday-WorkingHours-Morning.pcap_ISCX.csv',
'MachineLearningCVE/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv',
'MachineLearningCVE/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv'
]

df_list = [pd.read_csv(file) for file in csv_files]
df = pd.concat(df_list)

print(f"合并后的数据集 shape: {df.shape}")
printline()

check_missing_values(df)

# Replace inf values with NaN
df.replace([np.inf, -np.inf], np.nan, inplace=True)
print("将数据集中的 Infinity 和 -Infinity 替换为 NaN")
printline()

# Replace NaN with -999
df.fillna(-999, inplace=True)
print("将所有 NaN 替换为 -999")
printline()

# Drop columns with 99% or more zeros
zero_percentage = (df == 0).mean() * 100
df.drop(columns=zero_percentage[zero_percentage >= 99].index, inplace=True)
print(f"剩余列数:{df.shape[1]}")
printline()

# Drop rows containing -999
before_drop = df.shape[0]
df = df[df.ne(-999).all(axis=1)]
print(f"删除前的行数: {before_drop}")
print(f"删除后的行数: {df.shape[0]}")
print(f"一共删除了 {before_drop - df.shape[0]} 行")
printline()

# Get last column name (label column)
last_column_name = df.columns[-1]

# Count and print label distribution
label_counts = df[last_column_name].value_counts()
total_samples = len(df)
print(f"预处理后包含 {total_samples} 条数据")
printline()

for label, count in label_counts.items():
percentage = (count / total_samples) * 100
print(f"{label:20} ---> {count:10} 占比 = {percentage:.8f}%")

printline()

return df, last_column_name

if __name__ == "__main__":
dataset_normalized, label_column = preprocessing()