Revisiting the Microsoft Malware Classification Challenge (BIG 2015) in 2022

In 2015, Microsoft provided the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families. Formatted as a Kaggle Competition, it featured a very large (for that time) dataset comprising of almost 40GB of compressed files containing disarmed malware samples and their corresponding disassembled ASM code.

At that time, my submitted solution had only a dozen of heuristic features and used a simple Random Forest as model, enough to secure a top 30 score (since it was my first Kaggle competition I ended up overfitting the public set and dropped 29 more positions, but that is a different story :) ). Revisiting publicly available feature sets from top solutions (e.g. top 10 with almost perfect scores) I was curious to see what type of features they would rely on the most.

In order to do that I quickly trained a LightGBM model and plotted the feature importance:

293	section_names_header	1707
1367	Offset.1	839
430	VirtualAlloc	594
284	Entropy	504
37	DllEntryPoint	483
81	misc1_assume	461
126	ent_q_diff_diffs_12	427
189	ent_q_diff_diffs_1_median	426
1371	dc_por	414
1393	string_len_counts_2	414
1272	regs_esp	400
22	byte	387
263	ent_p_19	387
0	Virtual	386
287	section_names_.edata	377
237	ent_q_diff_block_3_19	350
148	ent_q_diff_block_0_8	342
991	TB_00	342
1377	db3_rdata	339
19	DATA	339
107	ent_q_diffs_19	318
1398	string_len_counts_7	314
45	void	301
1387	db3_NdNt	297
1258	regs_bh	285
23	word	283
112	ent_q_diffs_max	282
290	section_names_.rsrc	280
1304	asm_commands_jnb	277
1296	asm_commands_in	274
165	ent_q_diff_diffs_0_min	244
1374	dd_text	239
1366	FileSize	235
135	ent_q_diff_diffs_mean	228
1196	TB_cd	222
1246	TB_ff	219
300	Unknown_Sections_lines_por	215
295	Unknown_Sections	213
426	GetProcAddress	211
1686	contdll	208
1331	asm_commands_std	208
1257	regs_ax	203
163	ent_q_diff_diffs_0_median	197
1381	dd5	188
2	loc	181
79	misc_visualc	181
1322	asm_commands_ror	179
468	FindFirstFileA	177
5	var	174
59	entry	173

Going over the top features it is clear that they would not be particularly resistant to adversarial modifications, e.g. adding fake imports, renaming section names or adding extra padding bytes would likely upset the model predictions and would be very easy to perform. Entropy-based features e.g. (ent_q_diff_diffs_12) should be in theory more robust, depending on which portions of the executable are computed though.

Overall, it seems that this dataset hasn't aged very well, however it is still surprising to see recently published papers using it to evaluate new detection approaches.

References

https://www.kaggle.com/code/x75a40890/ms-malware-big-2015-0-0067-score

Alejandro Mosquera | Blog

Wednesday, December 7, 2022

Revisiting the Microsoft Malware Classification Challenge (BIG 2015) in 2022

No comments:

Post a Comment

Links

Blog Archive

Labels