Alejandro Mosquera López is an online safety expert and Kaggle Grandmaster working in cybersecurity. His main research interests are Trustworthy AI and NLP. ORCID iD icon https://orcid.org/0000-0002-6020-3569

Tuesday, January 24, 2023

The string similarity problem

For two strings A and B (in the ASCII [a-z] range), we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.

The reader is asked to calculate the sum of similarities of a string S with each of its suffixes. Reference (https://www.hackerrank.com/challenges/string-similarity/problem)



Example input:


sim(ababaa)  -> 11

sim(aa) -> 3


Considering the above problem statement we can propose a very naïve solution as follows:



def prefixes(s):

    pre = ([s[0:k] for k in range(len(s)+1)])

    return pre

    

def suffixes(s):

    su = ([s[k:len(s)] for k in range(len(s))])

    return su


def sim(a, b):

    maxl = 0

    a_pre = prefixes(a)

    b_pre= prefixes(b)

    for aa in a_pre:

        if aa in b_pre:

            if len(aa)>maxl:

                maxl = len(aa)

    return maxl


def simsufix(s):

    su = suffixes(s)

    res = 0

    for k in su:

        si = sim(s,k)

        res= res+si

    return res



While “correct”, the above solution has several complexity-related issues. It is highly unoptimized for both time(O(n^2)) and space complexity!


A substantial improvement can be achieved by avoiding the intermediate storage of the suffixes and prefixes and performing the comparisons via string indexing instead. However the time complexity is still nonoptimal for large strings:


def simsufix(s):

    res=0

    for p in range(len(s)):

      cursum = 0

      for k,l in zip(s,s[p:len(s)]):

        if k==l:

          cursum=cursum+1

        else:

          break

      res=res+cursum

    return res


Can we do better? A workaround could involve the construction of a Z Array: 

For a string str[0..n-1], the Z array is of the same length as string. An element Z[i] of Z array stores the length of the longest substring starting from str[i] which is also a prefix of str[0..n-1]. 


The idea is to concatenate pattern and text, and create a string “P$T” where P is pattern, $ is a special character that should not be present in pattern and text, and T is text. Build the Z array for concatenated strings. In Z array, if Z value at any point is equal to pattern length, then pattern is present at that point. 


E.g. for our test case “ababaa” lets concatenate the string with itself and calculate its Z array:



Index:

0

1

2

3

4

5

6

7

8

9

10

11

12

Z Text:

a

b

a

b

a

a

$

a

b

a

b

a

a

Z Values:


0

3

0

1

1

0

6

0

3

0

1

1


If we add the Z values below: 6 + 3 + 1 + 1 = 11!


Let’s reimplement our comparison function:



def simsufix(s):

  concat = s + "#" + s

  c=0

  i=0

  j=1

  z=[]

  while(j<len(concat)):

    if(concat[i]==concat[j]):

        c+=1

        j+=1

        i+=1

        continue

    else:

        z.append(c)

        c=0

        j-=i

        i=0  

    j+=1

  z.append(len(s))

  return sum(z)


This code had a major improvement in time complexity: from quadratic to O(m+n), where m is the length of the string and n is the length of the pattern to be searched. Likewise the space complexity is also reduced to O(n+m).


This is already looking better, but can it be optimized further? Some potential minor improvements could involve just considering the special case where the string to match is the original string itself and avoiding the dynamic growth of the Z array:


def simsufix(s):

  concat = s + "#"

  c=0

  i=0

  j=1

  z = [0] * len(s)

  cur = 0

  while(j<len(concat)):

    if(concat[i]==concat[j]):

        c+=1

        j+=1

        i+=1

        continue

    else:

        z[cur] = c

        cur+=1

        c=0

        j-=i

        i=0  

    j+=1

  return sum(z) + len(s)


While the above is substantially better than the starting baseline, there is still room for improvements… which are left as an exercise to the reader.


No comments:

Post a Comment