http://www.quora.com/Given-a-string-how-do-I-find-the-number-of-distinct-substrings-of-the-string
(used)
http://web.stanford.edu/class/cs97si/suffix-array.pdf
https://greasepalm.wordpress.com/2012/07/01/suffix-arrays-a-simple-tutorial/
http://www.quora.com/How-do-I-find-the-total-number-of-different-palindromes-of-length-k-in-a-given-string-using-suffix-array (used)
http://www.quora.com/If-two-strings-have-more-than-one-longest-common-subsequence-considering-length-what-is-the-algorithm-to-find-the-lexicographically-smallest-longest-common-subsequence-among-them
http://www.roman10.net/suffix-array-part-3-longest-common-substring-lcs/ (read imp)
question link
http://www.spoj.com/problems/DISUBSTR/
http://www.spoj.com/problems/SARRAY/ (this is for understanding the complexity of ur designed suffix array implimentation)
suffix array standard code( time complexity o(n log^2 n ) using bucket sort
https://gist.github.com/calmhandtitan/8119030
String hashing
can be used instead of suffix array and all (in some cases)
tutorial by lalit kundu
http://threads-iiith.quora.com/String-Hashing-for-competitive-programming
code of DISUBSTR
as per the instructions of quora (implementation of suffix array) , first link
time complexity( n^2 logn)
#include<stdio.h>
#include<string.h>
#include<algorithm>
#include<iostream>
typedef long long ll;
using namespace std;
int L;
int su[1005];
char arr[1005];
int mpp(int a,int b)
{
while(arr[a]==arr[b] && a<L && b<L){a++;b++;}
if(arr[a]==NULL || arr[b]==NULL)
{
a--;b--;
}
if(arr[a]!=arr[b])
{
return arr[a]<arr[b];
}
else// the one which will be shorter all be being equal in all respect will come first
{
return a>b;
}
}
int str(int l)
{
for(int i=0;i<=l;i++)
su[i]=i;
sort(su,su+l,mpp);
}
int main()
{
int t;
// freopen("kr.in","r",stdin);
scanf("%d",&t);
while(t--)
{
scanf("%s",arr);
ll l=strlen(arr);
L=l;
str(l);
ll s=l-su[0];
for(int i=0;i<l-1;i++)
{
ll p=l-su[i+1];
ll co=0;
ll a=su[i];
ll b=su[i+1];
while(arr[a]==arr[b])
{
co++;
a++;
b++;
}
s+=p-co;
}
printf("%lld\n",s);
}
}
code of DISUBSTR
as per the instructions of quora (implementation of suffix array) , first link
time complexity( n^2 logn)
#include<stdio.h>
#include<string.h>
#include<algorithm>
#include<iostream>
typedef long long ll;
using namespace std;
int L;
int su[1005];
char arr[1005];
int mpp(int a,int b)
{
while(arr[a]==arr[b] && a<L && b<L){a++;b++;}
if(arr[a]==NULL || arr[b]==NULL)
{
a--;b--;
}
if(arr[a]!=arr[b])
{
return arr[a]<arr[b];
}
else// the one which will be shorter all be being equal in all respect will come first
{
return a>b;
}
}
int str(int l)
{
for(int i=0;i<=l;i++)
su[i]=i;
sort(su,su+l,mpp);
}
int main()
{
int t;
// freopen("kr.in","r",stdin);
scanf("%d",&t);
while(t--)
{
scanf("%s",arr);
ll l=strlen(arr);
L=l;
str(l);
ll s=l-su[0];
for(int i=0;i<l-1;i++)
{
ll p=l-su[i+1];
ll co=0;
ll a=su[i];
ll b=su[i+1];
while(arr[a]==arr[b])
{
co++;
a++;
b++;
}
s+=p-co;
}
printf("%lld\n",s);
}
}
explaination
This is one of the problems in SPOJ (Sphere Online Judge (SPOJ))
The solution consists of constructing the suffix array and then finding the number of distinct substrings based on the Longest Common Prefixes.
One key observation here is that:
If you look through the prefixes of each suffix of a string, you have covered all substrings of that string.
Let us take an example: BANANA
Suffixes are:
0) BANANA
1) ANANA
2) NANA
3) ANA
4) NA
5) A
It would be a lot easier to go through the prefixes if we sort the above set of suffixes, as we can skip the repeated prefixes easily.
Sorted set of suffixes:
5) A
3) ANA
1) ANANA
0) BANANA
4) NA
2) NANA
From now on,
LCP = Longest Common Prefix of 2 strings.
Initialize
ans = length(first suffix) = length("A") = 1.
Now consider the consecutive pairs of suffixes, i.e, [A, ANA], [ANA, ANANA], [ANANA, BANANA], etc. from the above set of sorted suffixes.
We can see that,
LCP("A", "ANA") = "A".
All characters that are not part of the common prefix contribute to a distinct substring. In the above case, they are 'N' and 'A'. So they should be added toans.
So we have,
1 2 | ans += length("ANA") - LCP("A", "ANA")
ans = ans + 3 - 1 = ans + 2 = 3
|
Do the same for the next pair of consecutive suffixes: ["ANA", "ANANA"]
1 2 3 4 | LCP("ANA", "ANANA") = "ANA".
ans += length("ANANA") - length(LCP)
=> ans = ans + 5 - 3
=> ans = 3 + 2 = 5.
|
Similarly, we have:
1 2 | LCP("ANANA", "BANANA") = 0
ans = ans + length("BANANA") - 0 = 11
|
1 2 | LCP("BANANA", "NA") = 0
ans = ans + length("NA") - 0 = 13
|
1 2 | LCP("NA", "NANA") = 2
ans = ans + length("NANA") - 2 = 15
|
Hence the number of distinct substrings for the string "BANANA" = 15.
lcp computation for lcs (longest common string etc)
for (i = 0; i < len1 + len2 - 1; ++i) {
if ((ap[i] - cstr >= len1) && (ap[i+1] - cstr >= len1)) {
//both starts with suffix of second string
continue;
} else if ((ap[i] - cstr < len1) && (ap[i+1] - cstr < len1)) {
//both starts with suffix of first string
continue;
} else {
lcplen = lcp(ap[i], ap[i+1]);
if (lcplen > lcslen) {
lcslen = lcplen;
lcssufpos = i;
}
}
we generally merge two strings s1 and s2 and then form the suffix array and do the lcp computation ,
in then lcp computation we needed to ensure that , the consecutive suffixes that we are comparing dont belong to the same string . and then do the needful.
the above part of code explains how it is to be performed