Alejandro Mosquera López is an online safety expert and Kaggle Grandmaster working in cybersecurity. His main research interests are Trustworthy AI and NLP. ORCID iD icon https://orcid.org/0000-0002-6020-3569

Wednesday, December 7, 2022

Revisiting the Microsoft Malware Classification Challenge (BIG 2015) in 2022

 In 2015, Microsoft provided the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families. Formatted as a Kaggle Competition, it featured a very large (for that time) dataset comprising of almost 40GB of compressed files containing disarmed malware samples and their corresponding disassembled ASM code.

At that time, my submitted solution had only a dozen of heuristic features and used a simple Random Forest as model, enough to secure a top 30 score (since it was my first Kaggle competition I ended up overfitting the public set and dropped 29 more positions, but that is a different story :) ). Revisiting publicly available feature sets from top solutions (e.g. top 10 with almost perfect scores) I was curious to see what type of features they would rely on the most.

In order to do that I quickly trained a LightGBM model and plotted the feature importance:

293section_names_header1707
1367Offset.1839
430VirtualAlloc594
284Entropy504
37DllEntryPoint483
81misc1_assume461
126ent_q_diff_diffs_12427
189ent_q_diff_diffs_1_median426
1371dc_por414
1393string_len_counts_2414
1272regs_esp400
22byte387
263ent_p_19387
0Virtual386
287section_names_.edata377
237ent_q_diff_block_3_19350
148ent_q_diff_block_0_8342
991TB_00342
1377db3_rdata339
19DATA339
107ent_q_diffs_19318
1398string_len_counts_7314
45void301
1387db3_NdNt297
1258regs_bh285
23word283
112ent_q_diffs_max282
290section_names_.rsrc280
1304asm_commands_jnb277
1296asm_commands_in274
165ent_q_diff_diffs_0_min244
1374dd_text239
1366FileSize235
135ent_q_diff_diffs_mean228
1196TB_cd222
1246TB_ff219
300Unknown_Sections_lines_por215
295Unknown_Sections213
426GetProcAddress211
1686contdll208
1331asm_commands_std208
1257regs_ax203
163ent_q_diff_diffs_0_median197
1381dd5188
2loc181
79misc_visualc181
1322asm_commands_ror179
468FindFirstFileA177
5var174
59entry173

  Going over the top features it is clear that they would not be particularly resistant to adversarial modifications, e.g. adding fake imports, renaming section names or adding extra padding bytes would likely upset the model predictions and would be very easy to perform. Entropy-based features e.g. (ent_q_diff_diffs_12) should be in theory more robust, depending on which portions of the executable are computed though.

Overall, it seems that this dataset hasn't aged very well, however it is still surprising to see recently published papers using it to evaluate new detection approaches. 

References

https://www.kaggle.com/code/x75a40890/ms-malware-big-2015-0-0067-score

No comments:

Post a Comment