数据集介绍

CIC-IDS2017数据集数据采集期从2017年7月3日星期一上午9点开始,到2017年7月7日星期五下午5点结束,共5天。星期一这天只包括正常的流量。该数据集实现的攻击包括暴力FTP、暴力SSH、DoS、Heartbleed、Web攻击、渗透、僵尸网络和DDoS。他们分别于周二、周三、周四和周五上午和下午被执行。

  • 周一:只包含良性流量
  • 周二:攻击+正常活动
    • 上午:FTP-Patator
    • 下午: SSH-Patator
  • 周三:攻击+正常活动

    • DoS / DDoSHeartbleed
  • 周四:攻击+正常活动

    • 上午:Web AttackBrute ForceXSSSql Injection
    • 下午:渗透(Cool diskDropbox download
  • 周五:攻击+正常活动

    • 上午:僵尸网络(Botnet ARES
    • 下午:Port ScanDDoS LOIT

后续只使用周一、周二、周三、周五的数据进行实验 ,不包括周四,因为周四的种类又多每种类型的数量又少,麻烦

特征

数据集一共有79个特征,最后一列特征 Label是标签,BENIGN属于正常流量,其他值都是异常流量。所有特征值都是数值型(除了标签列Label),包含大量的0和负值,数据差异很大,要进行归一化处理。

数据集攻击类型分布情况

类型 数量 百分比 Week Days
BENIGN 2273097 80.300366 Monday to Friday
DoS Hulk 231073 8.162981 Wednesday
PortScan 158930 5.614427 Friday
DDoS 128027 4.522735 Friday
DoS GoldenEye 10293 0.363615 Wednesday
FTP-Patator 7938 0.280421 Tuesday
SSH-Patator 5897 0.208320 Tuesday
DoS slowloris 5796 0.204752 Wednesday
DoS Slowhttptest 5499 0.194260 Wednesday
Bot 1966 0.069452 Friday
Web Attack � Brute Force 1507 0.053237 Thursday
Web Attack � XSS 652 0.023033 Thursday
Infiltration 36 0.001272 Thursday
Web Attack � Sql Injection 21 0.000742 Thursday
Heartbleed 11 0.000389 Wednesday

CIC-IDS2017标签值

检查标签列Label分布情况代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import pandas as pd
import os

# 设置文件夹路径
folder_path = r'D:\Python Project\CIC-IDS2017\MachineLearningCVE'

# 获取所有 CSV 文件
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# 合并所有 CSV 文件
df_list = []
week_day_dict = {}

# 提取文件中的星期几
for file in csv_files:
df = pd.read_csv(os.path.join(folder_path, file))
df['filename'] = file # 在每个df里添加文件名列
df_list.append(df)

# 提取星期几
if 'Monday' in file:
week_day = 'Monday'
elif 'Tuesday' in file:
week_day = 'Tuesday'
elif 'Wednesday' in file:
week_day = 'Wednesday'
elif 'Thursday' in file:
week_day = 'Thursday'
elif 'Friday' in file:
week_day = 'Friday'
elif 'Saturday' in file:
week_day = 'Saturday'
elif 'Sunday' in file:
week_day = 'Sunday'
else:
week_day = 'Unknown'

# 为每个文件记录对应的星期几
week_day_dict[file] = week_day

# 合成一个大的 DataFrame
big_df = pd.concat(df_list, ignore_index=True)

# 查看最后一列的标签列
label_column = big_df.columns[-2] # 最后一列是文件名,所以倒数第二列是标签列


# 获取标签与周几的对应关系
def get_week_days_for_label(df, label_column, week_day_dict):
label_week_days = {}

for label in df[label_column].unique():
if label == 'BENIGN':
continue # BENIGN 不需要说明
week_days = df[df[label_column] == label]['filename'].apply(lambda x: week_day_dict[x]).unique()
label_week_days[label] = ', '.join(week_days)

return label_week_days


# 获取每个标签出现在周几
label_week_days = get_week_days_for_label(big_df, label_column, week_day_dict)

# 计算各标签的数量和占比
label_counts = big_df[label_column].value_counts()
label_percentages = big_df[label_column].value_counts(normalize=True) * 100

# 创建 DataFrame
result = pd.DataFrame({
'Count': label_counts,
'Percentage': label_percentages,
'Week Days': label_counts.index.map(label_week_days).fillna('BENIGN') # 给每个类型加上出现在周几
})

# 打印结果
print(result)

数据预处理

预处理流程如下

  • 数据加载:加载周一、周二、周三和周五的子数据集,将它们合并为一个大的 DataFrame
  • 缺失值检查与替换
    • 检查每列中的缺失值(NaN),统计数据集中总的缺失值比例
    • 查找并替换数据集中出现的正无穷大和负无穷大值,将它们替换为 NaN
  • 填充缺失值:将所有的 NaN 替换为 -999,便于后续删除(不替换为-1,因为数据集中真有一些值是-1,防止这些行被误删)。
  • 删除多余的列
    • 删除99%以上的值都是0的列。
    • 删除缺失值和无限值(填充为-1)超过30%的列。
  • 删除存在缺失或无限值的行:去除所有存在 -999 的行。
  • 统计标签分布:统计并输出最后一列(标签列)的各类别数量和占比。
  • 归一化:将整个DataFrame进行归一化,方便后续划分训练集和测试集

存在NaN值和Infinity值的行

归一化与smote的先后顺序

不能先归一化在使用smote平衡数据集

在处理分类任务时,归一化和SMOTE的顺序通常是:

  1. 先进行SMOTE:先对训练集进行SMOTE采样来平衡样本。这样生成的合成样本会基于原始数据的特征值分布进行生成。
  2. 再进行归一化:SMOTE处理完后再对数据进行归一化。这样可以确保所有样本(包括原始和合成的样本)在同一归一化的范围内。

如果先归一化再使用SMOTE,生成的合成样本可能不会完全遵循归一化后的数据分布,导致数据表示不一致。因此,推荐先用SMOTE平衡样本,再对数据进行归一化处理。

def preprocessing()返回结果

return dataset,last_column_name

  • dataset:预处理后的数据集,已经完成归一化,但未划分训练集和测试集
  • last_column_name:最后一列的名称Label,可以不要

预处理代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
import pandas as pd

# 分隔线
def printline():
print("--------------------------------------------------\n")

def preprocessing():
# 加载数据
df_monday = pd.read_csv('MachineLearningCVE/Monday-WorkingHours.pcap_ISCX.csv')
df_tuesday = pd.read_csv('MachineLearningCVE/Tuesday-WorkingHours.pcap_ISCX.csv')
df_wednesday = pd.read_csv('MachineLearningCVE/Wednesday-workingHours.pcap_ISCX.csv')
df_friday_morning = pd.read_csv('MachineLearningCVE/Friday-WorkingHours-Morning.pcap_ISCX.csv')
df_friday_afternoon_ddos = pd.read_csv('MachineLearningCVE/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv')
df_friday_afternoon_portscan = pd.read_csv('MachineLearningCVE/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv')

# 合并数据集
data = [df_monday, df_tuesday, df_wednesday, df_friday_morning, df_friday_afternoon_ddos, df_friday_afternoon_portscan]
df = pd.concat(data)

# 输出数据集的形状
print(f"合并后的数据集 shape: {df.shape}")
printline()

# 检查缺失值
missing_value_count = df.isna().sum()
total_cells = np.product(df.shape)
total_missing = missing_value_count.sum()
print(f"数据集中有 {total_cells} 个元素, 其中包含 {total_missing} 个 NaN 值")
print(f"缺失率为 {total_missing / total_cells * 100}%")
printline()

# 查找并替换正无穷和负无穷值为 NaN
df_clean = df.replace([np.inf, -np.inf], np.nan)
print("将数据集中的 Infinity 和 -Infinity 替换为 NaN")
printline()

# 将所有 NaN 替换为 -999
df_clean = df_clean.replace(np.nan, -999)
print("将所有 NaN 替换为 -999")
printline()

# 删除 99%以上值为0的列
print("删除99%以上值为0的列")
for column in df_clean.columns:
count = (df_clean[column] == 0).sum()
percent_of_zeros = (count / df_clean.shape[0]) * 100
if percent_of_zeros >= 99.0:
print(f"删除列 : {column}, 百分比为 {percent_of_zeros}%")
df_clean.drop(column, inplace=True, axis=1)

print(f"删除列后,剩余列数:{df_clean.shape[1]}")
printline()

# 删除包含 -999 的行
print("删除包含 -999 的行")
print(f"删除前的行数: {df_clean.shape[0]}")
before_drop = df_clean.shape[0]
df_clean = df_clean[(df_clean != -999).all(axis=1)]
print(f"删除后的行数: {df_clean.shape[0]}")
print(f"一共删除了 {before_drop - df_clean.shape[0]} 行")
printline()

# 获取最后一列的列名(标签列)
last_column_name = df_clean.columns[-1]

# 统计标签列中各个类别的数量和占比
label_counts = df_clean[last_column_name].value_counts()
total_samples = len(df_clean)
print(f"预处理后包含 {total_samples} 条数据")
printline()

# 输出标签的分布和占比
for label, count in label_counts.items():
percentage = (count / total_samples) * 100
print(f"{label:28} ---> {count:10} 占比 = {percentage:.8f}%")

printline()

# 归一化数据
# dataset = min_max_normalize(df_clean)
# print("数据归一化完成")
# printline()
# 保存预处理后的数据到本地
# dataset.to_csv('MachineLearningCVE/Dataset.csv', index=False)
return df_clean, last_column_name

# 数据归一化函数
def min_max_normalize(df):
df_normalized = df.copy()
for column in df_normalized.columns:
if df_normalized[column].dtype != 'object': # 排除非数值类型的列
min_val = df_normalized[column].min()
max_val = df_normalized[column].max()
if max_val != min_val:
df_normalized[column] = (df_normalized[column] - min_val) / (max_val - min_val)
return df_normalized

if __name__ == "__main__":
dataset_normalized, last_column_name = preprocessing()
print(f"最后一列(标签列)为: {last_column_name}")

划分训练集和测试集

按8:2的比例划分训练集和测试集,划分后的结果如下(未平衡数据集)

  • x_train.shape : (1895400, 66) 训练集的X,1895400个样本,66个特征变量
  • y_train.shape : (1895400,) 训练集的y,1895400个样本,1个标签列
  • X_test.shape : (473850, 66) 测试集的X,473850个样本,66个特征变量
  • y_test.shape : (473850,) 测试集的y,473850个样本,1个标签列

训练集

说明:因为没选用周四的数据,所以只有11种类型,而不是原来的15种

类型 样本数 比例
BENIGN 1451928 0.766027
DoS Hulk 184099 0.097129
PortScan 127043 0.067027
DDoS 102420 0.054036
DoS GoldenEye 8234 0.004344
FTP-Patator 6348 0.003349
SSH-Patator 4718 0.002489
DoS slowloris 4637 0.002446
DoS Slowhttptest 4399 0.002321
Bot 1565 0.000826
Heartbleed 9 0.000005
异常样本总数 443472 0.233972

测试集

类型 样本数 比例
BENIGN 362982 0.766027
DoS Hulk 46025 0.097130
PortScan 31761 0.067028
DDoS 25605 0.054036
DoS GoldenEye 2059 0.004345
FTP-Patator 1587 0.003349
SSH-Patator 1179 0.002488
DoS slowloris 1159 0.002446
DoS Slowhttptest 1100 0.002321
Bot 391 0.000825
Heartbleed 2 0.000004
异常样本总数 110,868 0.233972

平衡训练集

使用Gsmote平衡数据集代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def balance_training_set(X_train, y_train):
"""
平衡训练数据集,使得每个类别的样本数量大致相等。

参数:
X_train (pandas.DataFrame): 训练特征数据。
y_train (pandas.Series): 训练目标标签。

返回:
str: 平衡后的数据集保存路径。
"""
# 检查传入的X_train和y_train形状
print(f"平衡前 X_train shape: {X_train.shape}")
print(f"平衡前 y_train shape: \n{pd.Series(y_train).value_counts()}")

# 平衡目标数量,根据需要调整sampling_strategy
target_counts = {
"DoS GoldenEye": 100000,
"FTP-Patator": 100000,
"SSH-Patator": 100000,
"DoS slowloris": 100000,
"DoS Slowhttptest": 80000,
"Bot": 80000,
"Heartbleed": 10000,
}

# 初始化GeometricSMOTE
gsmote = GeometricSMOTE(random_state=42,
k_neighbors=5,
selection_strategy='combined',
sampling_strategy=target_counts)

# 使用GeometricSMOTE对训练集进行平衡
X_resampled, y_resampled = gsmote.fit_resample(X_train, y_train)

# 打印平衡后的数据分布
print(f"平衡后 X_train shape: {X_resampled.shape}")
print(f"平衡后 y_train shape: \n{pd.Series(y_resampled).value_counts()}")

# 将重新采样后的数据合并为新的DataFrame
df_resampled = pd.DataFrame(X_resampled, columns=X_train.columns)
df_resampled['attack_cat'] = y_resampled

# 打乱数据集顺序
df_resampled = df_resampled.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"df_resampled shape: {df_resampled.shape}")
print(f"{df_resampled['attack_cat'].value_counts()}")

# 保存重新采样后的数据集到CSV文件
balanced_dataset_path = r'.\balanced_training_set.csv'
df_resampled.to_csv(balanced_dataset_path, index=False)
return balanced_dataset_path

模型评估指标

评估入侵检测系统性能时最常用的指标如下:

  • Accuracy

  • Precision

  • Recall

  • F-Measure

  • FAR/FPR

    常用性能评估指标

AE+DNN进行分类

代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

from data_preprocessing import preprocessing, printline
from imbalance_process import split_dataset, balance_training_set
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense
from sklearn.metrics import classification_report, confusion_matrix
from imbalance_process import split_dataset, balance_training_set


def normalize_and_map_labels(X_train, X_test, y_train, y_test):
"""
归一化数据集,并将标签映射为二分类标签
BENIGN 映射为 0,其他类型映射为 1
"""
# 检查 X_train 和 X_test 是否包含非数值列,如果有需要处理掉
X_train_numeric = X_train.select_dtypes(include=[np.number])
X_test_numeric = X_test.select_dtypes(include=[np.number])

# 初始化StandardScaler
scaler = StandardScaler()

# 归一化训练集和测试集
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)

# 创建新的二分类标签列:BENIGN = 0, others = 1
y_train_mapped = (y_train != "BENIGN").astype(int)
y_test_mapped = (y_test != "BENIGN").astype(int)

# 将原有数据集替换成归一化后的数值数据
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_numeric.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_numeric.columns)

# 打印输出验证处理后的数据
print(f"归一化后的 X_train shape: {X_train_scaled_df.shape}")
print(f"归一化后的 X_test shape: {X_test_scaled_df.shape}")

return X_train_scaled_df, X_test_scaled_df, y_train_mapped, y_test_mapped

def build_autoencoder(input_dim, encoding_dim):
"""
构建并返回一个简单的自编码器模型
"""
input_layer = Input(shape=(input_dim,))
encoder = Dense(encoding_dim, activation="relu")(input_layer)
decoder = Dense(input_dim, activation="sigmoid")(encoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder


def train_ae_dnn(X_train, X_test, y_train, y_test, encoding_dim=30):
"""
使用自编码器进行降维后,构建一个简单的DNN模型进行分类
"""
input_dim = X_train.shape[1]

# 构建自编码器
autoencoder = build_autoencoder(input_dim, encoding_dim)

# 训练自编码器
autoencoder.fit(X_train, X_train, epochs=50, batch_size=256, shuffle=True, validation_data=(X_test, X_test))

# AE降维后的数据
encoder_model = Model(inputs=autoencoder.input, outputs=autoencoder.layers[1].output)
X_train_encoded = encoder_model.predict(X_train)
X_test_encoded = encoder_model.predict(X_test)

# 构建DNN模型
dnn_model = Sequential([
Dense(128, input_dim=encoding_dim, activation='relu'),
Dense(64, input_dim=encoding_dim, activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid')
])
dnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 训练DNN模型
dnn_model.fit(X_train_encoded, y_train, epochs=80, batch_size=256, validation_data=(X_test_encoded, y_test))

# 预测并评估
y_pred = (dnn_model.predict(X_test_encoded) > 0.7).astype(int)

# 混淆矩阵和评估指标
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

if __name__ == '__main__':
X_train, X_test, y_train, y_test = split_dataset()

# 步骤 2: 平衡训练集
X_train_resampled, X_test, y_train_resampled, y_test = balance_training_set(X_train, X_test, y_train, y_test)

# 步骤 3: 归一化并映射标签
X_train_scaled, X_test_scaled, y_train_mapped, y_test_mapped = normalize_and_map_labels(X_train_resampled, X_test,
y_train_resampled, y_test)

# 步骤 4: 使用 AE + DNN 进行训练和评估
train_ae_dnn(X_train_scaled, X_test_scaled, y_train_mapped, y_test_mapped)

第一次

配置如下:

结果如下:

Confusion Matrix:
[[358699 4283]
[ 1320 109548]]
Accuracy: 0.9881755829903978
Precision: 0.962374045734466
Recall: 0.9880939495616409
F1 Score: 0.9750644195123254

第一次实验结果

第二次

Confusion Matrix:
[[358016 4966]
[ 1272 109596]]
Accuracy: 0.9868354964651261
Precision: 0.9566522930814755
Recall: 0.9885268968503085
F1 Score: 0.9723284389832764

第二次实验结果

第三次

配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def build_autoencoder(input_dim, encoding_dim):
"""
构建并返回一个简单的自编码器模型
"""
input_layer = Input(shape=(input_dim,))
encoder = Dense(encoding_dim, activation="relu")(input_layer)
decoder = Dense(input_dim, activation="sigmoid")(encoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder


def train_ae_dnn(X_train, X_test, y_train, y_test, encoding_dim=30):
"""
使用自编码器进行降维后,构建一个简单的DNN模型进行分类
"""
input_dim = X_train.shape[1]

# 构建自编码器
autoencoder = build_autoencoder(input_dim, encoding_dim)

# 训练自编码器
autoencoder.fit(X_train, X_train, epochs=30, batch_size=256, shuffle=True, validation_data=(X_test, X_test))

# AE降维后的数据
encoder_model = Model(inputs=autoencoder.input, outputs=autoencoder.layers[1].output)
X_train_encoded = encoder_model.predict(X_train)
X_test_encoded = encoder_model.predict(X_test)

# 构建DNN模型
dnn_model = Sequential([
Dense(128, input_dim=encoding_dim, activation='relu'),
Dense(64, input_dim=encoding_dim, activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid')
])
dnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 训练DNN模型
dnn_model.fit(X_train_encoded, y_train, epochs=50, batch_size=256, validation_data=(X_test_encoded, y_test))

# 预测并评估
y_pred = (dnn_model.predict(X_test_encoded) > 0.8).astype(int)

# 混淆矩阵和评估指标
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

结果如下:

Confusion Matrix:
[[360425 2557]
[ 6934 103934]]
Accuracy: 0.9799704547852696
Precision: 0.9759885811946549
Recall: 0.9374571562578923
F1 Score: 0.9563349113678291

第三次

仅使用DNN

代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

from data_preprocessing import preprocessing, printline
from imbalance_process import split_dataset, balance_training_set
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


def normalize_and_map_labels(X_train, X_test, y_train, y_test):
"""
归一化数据集,并将标签映射为二分类标签
BENIGN 映射为 0,其他类型映射为 1
"""
# 检查 X_train 和 X_test 是否包含非数值列,如果有需要处理掉
X_train_numeric = X_train.select_dtypes(include=[np.number])
X_test_numeric = X_test.select_dtypes(include=[np.number])

# 初始化StandardScaler
scaler = StandardScaler()

# 归一化训练集和测试集
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)

# 创建新的二分类标签列:BENIGN = 0, others = 1
y_train_mapped = (y_train != "BENIGN").astype(int)
y_test_mapped = (y_test != "BENIGN").astype(int)

# 将原有数据集替换成归一化后的数值数据
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_numeric.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_numeric.columns)

# 打印输出验证处理后的数据
print(f"归一化后的 X_train shape: {X_train_scaled_df.shape}")
print(f"归一化后的 X_test shape: {X_test_scaled_df.shape}")

return X_train_scaled_df, X_test_scaled_df, y_train_mapped, y_test_mapped


def train_dnn(X_train, X_test, y_train, y_test):
"""
直接使用 DNN 进行分类
"""
input_dim = X_train.shape[1]

# 构建 DNN 模型
dnn_model = Sequential([
Dense(128, input_dim=input_dim, activation='relu'),
Dense(64, activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='sigmoid')
])
dnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# 训练 DNN 模型
dnn_model.fit(X_train, y_train, epochs=50, batch_size=256, validation_data=(X_test, y_test))

# 预测并评估
y_pred = (dnn_model.predict(X_test) > 0.5).astype(int)

# 混淆矩阵和评估指标
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))


if __name__ == '__main__':
X_train, X_test, y_train, y_test = split_dataset()

# 步骤 2: 平衡训练集
X_train_resampled, X_test, y_train_resampled, y_test = balance_training_set(X_train, X_test, y_train, y_test)

# 步骤 3: 归一化并映射标签
X_train_scaled, X_test_scaled, y_train_mapped, y_test_mapped = normalize_and_map_labels(X_train_resampled, X_test,
y_train_resampled, y_test)

# 步骤 4: 直接使用 DNN 进行训练和评估
train_dnn(X_train_scaled, X_test_scaled, y_train_mapped, y_test_mapped)

不使用AE, 用完AE就是负优化,垃圾,直接用DNN就能到99%,用完AE变97%,再看看能套什么别的

仅使用DNN

仅DNN(二)

为什么只做了数据处理和平衡数据集就能达到如此夸张的识别率?

数据集中并没有像UNSW-NB15一样既有攻击类型,又有0-1标签列;之前使用UNSW-NB15划分数据集测试集后准确率接近100%是因为有attack_cat列没有去除,而标签列是label,所以导致数据泄露直接根据attack_cat就能识别出0或1,但是CIC-IDS2017不存在这种问题,为什么还是有如此夸张的识别率?跟平衡数据集有无关系?不适用Gsmote平衡数据集还能实现该结果吗?

多分类

多分类第一次

进行多分类结果也很好

多分类结果

多分类混淆矩阵一

多分类第二次

多分类结果二

多分类混淆矩阵二

各类别分类结果

  • BENIGN
    准确率Accuracy: 0.9894
    精确率Precision: 0.9989
    召回率Recall: 0.9904
    F1分数: 0.9947
    误报率FAR: 0.0034

  • Bot
    准确率Accuracy: 0.2160
    精确率Precision: 0.2172
    召回率Recall: 0.9744
    F1分数: 0.3552
    误报率FAR: 0.0029

  • DDoS
    准确率Accuracy: 0.9597
    精确率Precision: 0.9618
    召回率Recall: 0.9977
    F1分数: 0.9794
    误报率FAR: 0.0023

  • DoS GoldenEye
    准确率Accuracy: 0.9518
    精确率Precision: 0.9631
    召回率Recall: 0.9879
    F1分数: 0.9753
    误报率FAR: 0.0002

  • DoS Hulk
    准确率Accuracy: 0.9715
    精确率Precision: 0.9807
    召回率Recall: 0.9904
    F1分数: 0.9855
    误报率FAR: 0.0021

  • DoS Slowhttptest
    准确率Accuracy: 0.8507
    精确率Precision: 0.8665
    召回率Recall: 0.9791
    F1分数: 0.9193
    误报率FAR: 0.0004

  • DoS slowloris
    准确率Accuracy: 0.8741
    精确率Precision: 0.8878
    召回率Recall: 0.9827
    F1分数: 0.9328
    误报率FAR: 0.0003

  • FTP-Patator
    准确率Accuracy: 0.9862
    精确率Precision: 0.9931
    召回率Recall: 0.9931
    F1分数: 0.9931
    误报率FAR: 0.0000

  • Heartbleed
    准确率Accuracy: 1.0000
    精确率Precision: 1.0000
    召回率Recall: 1.0000
    F1分数: 1.0000
    误报率FAR: 0.0000

  • PortScan
    准确率Accuracy: 0.9965
    精确率Precision: 0.9993
    召回率Recall: 0.9972
    F1分数: 0.9983
    误报率FAR: 0.0000

  • SSH-Patator
    准确率Accuracy: 0.9165
    精确率Precision: 0.9275
    召回率Recall: 0.9873
    F1分数: 0.9565
    误报率FAR: 0.0002

  • 总体准确率: 0.9912

多分类混淆矩阵三