In 2015, Microsoft provided the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families. Formatted as a Kaggle Competition, it featured a very large (for that time) dataset comprising of almost 40GB of compressed files containing disarmed malware samples and their corresponding disassembled ASM code.
At that time, my submitted solution had only a dozen of heuristic features and used a simple Random Forest as model, enough to secure a top 30 score (since it was my first Kaggle competition I ended up overfitting the public set and dropped 29 more positions, but that is a different story :) ). Revisiting publicly available feature sets from top solutions (e.g. top 10 with almost perfect scores) I was curious to see what type of features they would rely on the most.
In order to do that I quickly trained a LightGBM model and plotted the feature importance:
293 | section_names_header | 1707 |
---|---|---|
1367 | Offset.1 | 839 |
430 | VirtualAlloc | 594 |
284 | Entropy | 504 |
37 | DllEntryPoint | 483 |
81 | misc1_assume | 461 |
126 | ent_q_diff_diffs_12 | 427 |
189 | ent_q_diff_diffs_1_median | 426 |
1371 | dc_por | 414 |
1393 | string_len_counts_2 | 414 |
1272 | regs_esp | 400 |
22 | byte | 387 |
263 | ent_p_19 | 387 |
0 | Virtual | 386 |
287 | section_names_.edata | 377 |
237 | ent_q_diff_block_3_19 | 350 |
148 | ent_q_diff_block_0_8 | 342 |
991 | TB_00 | 342 |
1377 | db3_rdata | 339 |
19 | DATA | 339 |
107 | ent_q_diffs_19 | 318 |
1398 | string_len_counts_7 | 314 |
45 | void | 301 |
1387 | db3_NdNt | 297 |
1258 | regs_bh | 285 |
23 | word | 283 |
112 | ent_q_diffs_max | 282 |
290 | section_names_.rsrc | 280 |
1304 | asm_commands_jnb | 277 |
1296 | asm_commands_in | 274 |
165 | ent_q_diff_diffs_0_min | 244 |
1374 | dd_text | 239 |
1366 | FileSize | 235 |
135 | ent_q_diff_diffs_mean | 228 |
1196 | TB_cd | 222 |
1246 | TB_ff | 219 |
300 | Unknown_Sections_lines_por | 215 |
295 | Unknown_Sections | 213 |
426 | GetProcAddress | 211 |
1686 | contdll | 208 |
1331 | asm_commands_std | 208 |
1257 | regs_ax | 203 |
163 | ent_q_diff_diffs_0_median | 197 |
1381 | dd5 | 188 |
2 | loc | 181 |
79 | misc_visualc | 181 |
1322 | asm_commands_ror | 179 |
468 | FindFirstFileA | 177 |
5 | var | 174 |
59 | entry | 173 |
Going over the top features it is clear that they would not be particularly resistant to adversarial modifications, e.g. adding fake imports, renaming section names or adding extra padding bytes would likely upset the model predictions and would be very easy to perform. Entropy-based features e.g. (ent_q_diff_diffs_12) should be in theory more robust, depending on which portions of the executable are computed though.
Overall, it seems that this dataset hasn't aged very well, however it is still surprising to see recently published papers using it to evaluate new detection approaches.
References
https://www.kaggle.com/code/x75a40890/ms-malware-big-2015-0-0067-score
No comments:
Post a Comment