Tuesday, January 24, 2023

The string similarity problem

For two strings A and B (in the ASCII [a-z] range), we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.

The reader is asked to calculate the sum of similarities of a string S with each of its suffixes. Reference (https://www.hackerrank.com/challenges/string-similarity/problem)


Example input:


sim(ababaa)  -> 11

sim(aa) -> 3


Considering the above problem statement we can propose a very naïve solution as follows:



def prefixes(s):

    pre = ([s[0:k] for k in range(len(s)+1)])

    return pre

    

def suffixes(s):

    su = ([s[k:len(s)] for k in range(len(s))])

    return su


def sim(a, b):

    maxl = 0

    a_pre = prefixes(a)

    b_pre= prefixes(b)

    for aa in a_pre:

        if aa in b_pre:

            if len(aa)>maxl:

                maxl = len(aa)

    return maxl


def simsufix(s):

    su = suffixes(s)

    res = 0

    for k in su:

        si = sim(s,k)

        res= res+si

    return res



While “correct”, the above solution has several complexity-related issues. It is highly unoptimized for both time(O(n^2)) and space complexity!


A substantial improvement can be achieved by avoiding the intermediate storage of the suffixes and prefixes and performing the comparisons via string indexing instead. However the time complexity is still nonoptimal for large strings:


def simsufix(s):

    res=0

    for p in range(len(s)):

      cursum = 0

      for k,l in zip(s,s[p:len(s)]):

        if k==l:

          cursum=cursum+1

        else:

          break

      res=res+cursum

    return res


Can we do better? A workaround could involve the construction of a Z Array: 

For a string str[0..n-1], the Z array is of the same length as string. An element Z[i] of Z array stores the length of the longest substring starting from str[i] which is also a prefix of str[0..n-1]. 


The idea is to concatenate pattern and text, and create a string “P$T” where P is pattern, $ is a special character that should not be present in pattern and text, and T is text. Build the Z array for concatenated strings. In Z array, if Z value at any point is equal to pattern length, then pattern is present at that point. 


E.g. for our test case “ababaa” lets concatenate the string with itself and calculate its Z array:



Index:

0

1

2

3

4

5

6

7

8

9

10

11

12

Z Text:

a

b

a

b

a

a

$

a

b

a

b

a

a

Z Values:


0

3

0

1

1

0

6

0

3

0

1

1


If we add the Z values below: 6 + 3 + 1 + 1 = 11!


Let’s reimplement our comparison function:



def simsufix(s):

  concat = s + "#" + s

  c=0

  i=0

  j=1

  z=[]

  while(j<len(concat)):

    if(concat[i]==concat[j]):

        c+=1

        j+=1

        i+=1

        continue

    else:

        z.append(c)

        c=0

        j-=i

        i=0  

    j+=1

  z.append(len(s))

  return sum(z)


This code had a major improvement in time complexity: from quadratic to O(m+n), where m is the length of the string and n is the length of the pattern to be searched. Likewise the space complexity is also reduced to O(n+m).


This is already looking better, but can it be optimized further? Some potential minor improvements could involve just considering the special case where the string to match is the original string itself and avoiding the dynamic growth of the Z array:


def simsufix(s):

  concat = s + "#"

  c=0

  i=0

  j=1

  z = [0] * len(s)

  cur = 0

  while(j<len(concat)):

    if(concat[i]==concat[j]):

        c+=1

        j+=1

        i+=1

        continue

    else:

        z[cur] = c

        cur+=1

        c=0

        j-=i

        i=0  

    j+=1

  return sum(z) + len(s)


While the above is substantially better than the starting baseline, there is still room for improvements… which are left as an exercise to the reader.


Wednesday, December 7, 2022

Revisiting the Microsoft Malware Classification Challenge (BIG 2015) in 2022

 In 2015, Microsoft provided the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families. Formatted as a Kaggle Competition, it featured a very large (for that time) dataset comprising of almost 40GB of compressed files containing disarmed malware samples and their corresponding disassembled ASM code.

At that time, my submitted solution had only a dozen of heuristic features and used a simple Random Forest as model, enough to secure a top 30 score (since it was my first Kaggle competition I ended up overfitting the public set and dropped 29 more positions, but that is a different story :) ). Revisiting publicly available feature sets from top solutions (e.g. top 10 with almost perfect scores) I was curious to see what type of features they would rely on the most.

In order to do that I quickly trained a LightGBM model and plotted the feature importance:

293section_names_header1707
1367Offset.1839
430VirtualAlloc594
284Entropy504
37DllEntryPoint483
81misc1_assume461
126ent_q_diff_diffs_12427
189ent_q_diff_diffs_1_median426
1371dc_por414
1393string_len_counts_2414
1272regs_esp400
22byte387
263ent_p_19387
0Virtual386
287section_names_.edata377
237ent_q_diff_block_3_19350
148ent_q_diff_block_0_8342
991TB_00342
1377db3_rdata339
19DATA339
107ent_q_diffs_19318
1398string_len_counts_7314
45void301
1387db3_NdNt297
1258regs_bh285
23word283
112ent_q_diffs_max282
290section_names_.rsrc280
1304asm_commands_jnb277
1296asm_commands_in274
165ent_q_diff_diffs_0_min244
1374dd_text239
1366FileSize235
135ent_q_diff_diffs_mean228
1196TB_cd222
1246TB_ff219
300Unknown_Sections_lines_por215
295Unknown_Sections213
426GetProcAddress211
1686contdll208
1331asm_commands_std208
1257regs_ax203
163ent_q_diff_diffs_0_median197
1381dd5188
2loc181
79misc_visualc181
1322asm_commands_ror179
468FindFirstFileA177
5var174
59entry173

  Going over the top features it is clear that they would not be particularly resistant to adversarial modifications, e.g. adding fake imports, renaming section names or adding extra padding bytes would likely upset the model predictions and would be very easy to perform. Entropy-based features e.g. (ent_q_diff_diffs_12) should be in theory more robust, depending on which portions of the executable are computed though.

Overall, it seems that this dataset hasn't aged very well, however it is still surprising to see recently published papers using it to evaluate new detection approaches. 

References

https://www.kaggle.com/code/x75a40890/ms-malware-big-2015-0-0067-score