## Tuesday, January 24, 2023

### The string similarity problem

For two strings A and B (in the ASCII [a-z] range), we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.

The reader is asked to calculate the sum of similarities of a string S with each of its suffixes. Reference (https://www.hackerrank.com/challenges/string-similarity/problem)

Example input:

sim(ababaa)  -> 11

sim(aa) -> 3

Considering the above problem statement we can propose a very naïve solution as follows:

def prefixes(s):

pre = ([s[0:k] for k in range(len(s)+1)])

return pre

def suffixes(s):

su = ([s[k:len(s)] for k in range(len(s))])

return su

def sim(a, b):

maxl = 0

a_pre = prefixes(a)

b_pre= prefixes(b)

for aa in a_pre:

if aa in b_pre:

if len(aa)>maxl:

maxl = len(aa)

return maxl

def simsufix(s):

su = suffixes(s)

res = 0

for k in su:

si = sim(s,k)

res= res+si

return res

While “correct”, the above solution has several complexity-related issues. It is highly unoptimized for both time(O(n^2)) and space complexity!

A substantial improvement can be achieved by avoiding the intermediate storage of the suffixes and prefixes and performing the comparisons via string indexing instead. However the time complexity is still nonoptimal for large strings:

def simsufix(s):

res=0

for p in range(len(s)):

cursum = 0

for k,l in zip(s,s[p:len(s)]):

if k==l:

cursum=cursum+1

else:

break

res=res+cursum

return res

Can we do better? A workaround could involve the construction of a Z Array:

For a string str[0..n-1], the Z array is of the same length as string. An element Z[i] of Z array stores the length of the longest substring starting from str[i] which is also a prefix of str[0..n-1].

The idea is to concatenate pattern and text, and create a string “P\$T” where P is pattern, \$ is a special character that should not be present in pattern and text, and T is text. Build the Z array for concatenated strings. In Z array, if Z value at any point is equal to pattern length, then pattern is present at that point.

E.g. for our test case “ababaa” lets concatenate the string with itself and calculate its Z array:

 Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 Z Text: a b a b a a \$ a b a b a a Z Values: 0 3 0 1 1 0 6 0 3 0 1 1

If we add the Z values below: 6 + 3 + 1 + 1 = 11!

Let’s reimplement our comparison function:

def simsufix(s):

concat = s + "#" + s

c=0

i=0

j=1

z=[]

while(j<len(concat)):

if(concat[i]==concat[j]):

c+=1

j+=1

i+=1

continue

else:

z.append(c)

c=0

j-=i

i=0

j+=1

z.append(len(s))

return sum(z)

This code had a major improvement in time complexity: from quadratic to O(m+n), where m is the length of the string and n is the length of the pattern to be searched. Likewise the space complexity is also reduced to O(n+m).

This is already looking better, but can it be optimized further? Some potential minor improvements could involve just considering the special case where the string to match is the original string itself and avoiding the dynamic growth of the Z array:

def simsufix(s):

concat = s + "#"

c=0

i=0

j=1

z =  * len(s)

cur = 0

while(j<len(concat)):

if(concat[i]==concat[j]):

c+=1

j+=1

i+=1

continue

else:

z[cur] = c

cur+=1

c=0

j-=i

i=0

j+=1

return sum(z) + len(s)

While the above is substantially better than the starting baseline, there is still room for improvements… which are left as an exercise to the reader.

## Wednesday, December 7, 2022

### Revisiting the Microsoft Malware Classification Challenge (BIG 2015) in 2022

In 2015, Microsoft provided the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families. Formatted as a Kaggle Competition, it featured a very large (for that time) dataset comprising of almost 40GB of compressed files containing disarmed malware samples and their corresponding disassembled ASM code.

At that time, my submitted solution had only a dozen of heuristic features and used a simple Random Forest as model, enough to secure a top 30 score (since it was my first Kaggle competition I ended up overfitting the public set and dropped 29 more positions, but that is a different story :) ). Revisiting publicly available feature sets from top solutions (e.g. top 10 with almost perfect scores) I was curious to see what type of features they would rely on the most.

In order to do that I quickly trained a LightGBM model and plotted the feature importance:

293 1367 section_names_header 1707 Offset.1 839 VirtualAlloc 594 Entropy 504 DllEntryPoint 483 misc1_assume 461 ent_q_diff_diffs_12 427 ent_q_diff_diffs_1_median 426 dc_por 414 string_len_counts_2 414 regs_esp 400 byte 387 ent_p_19 387 Virtual 386 section_names_.edata 377 ent_q_diff_block_3_19 350 ent_q_diff_block_0_8 342 TB_00 342 db3_rdata 339 DATA 339 ent_q_diffs_19 318 string_len_counts_7 314 void 301 db3_NdNt 297 regs_bh 285 word 283 ent_q_diffs_max 282 section_names_.rsrc 280 asm_commands_jnb 277 asm_commands_in 274 ent_q_diff_diffs_0_min 244 dd_text 239 FileSize 235 ent_q_diff_diffs_mean 228 TB_cd 222 TB_ff 219 Unknown_Sections_lines_por 215 Unknown_Sections 213 GetProcAddress 211 contdll 208 asm_commands_std 208 regs_ax 203 ent_q_diff_diffs_0_median 197 dd5 188 loc 181 misc_visualc 181 asm_commands_ror 179 FindFirstFileA 177 var 174 entry 173

Going over the top features it is clear that they would not be particularly resistant to adversarial modifications, e.g. adding fake imports, renaming section names or adding extra padding bytes would likely upset the model predictions and would be very easy to perform. Entropy-based features e.g. (ent_q_diff_diffs_12) should be in theory more robust, depending on which portions of the executable are computed though.

Overall, it seems that this dataset hasn't aged very well, however it is still surprising to see recently published papers using it to evaluate new detection approaches.

References

https://www.kaggle.com/code/x75a40890/ms-malware-big-2015-0-0067-score