package miscellaneous.longestCommonSubstring;
/*
* What is Suffix tree
*
* Lets take an example, str = "BANANA"
* If told to find substring in "BANANA" then answer is "ANA" as it is repeated twice.
* How to find that "ANA" is longest repeated in "BANANA".
* There can be many solutions, but using suffix array it is easy to solve.
*
* Add end Character say $ to "BANANA", so now our string becomes "BANANA$"
* Now divide the possible substrings either from start or from end, here lets consider from end.
*
* BANANA$ with index as 1234567 ie B is at index 1, A is at index 2 till $ is at index 7.
* 0. $ from BANANA($) -->7
* 1. A$ from BANAN(A$) -->6
* 2. NA$ from BANA(NA$) -->5
* 3. ANA$ from BAN(ANA$) -->4
* 4. NANA$ from BA(NANA$) -->3
* 5. ANANA$ from B(ANANA$) -->2
* 6. BANANA$ from (BANANA$) -->1
*
* Now initially our tree is empty.
* 1. First comes $
* .
* |
* $ (7) ---> 7 because $ is at index 7
*
* 2. Now comes A$
* . __$ (7)
* |
* A$ (6) ---> 6 because A starts at index 6
*
* 4. Now came NA$, check here is there any path available starting with 'N' as NA$ starts with 'N', answer is NO, so we have to start new path for it.
*
* . __$(7)
* / \
* (6) A$ NA$ (5)
*
* 5. Now came ANA$, check here is there any path available starting with 'A' as ANA$ starts with 'A', answer is YES i.e A$ exist,
* so we have to check, 2nd character of ANA$ and A$ as first character of both matched which is N and $ respectively.
* it is not matching, so in ANA$ and A$ we have only one character ie 'A' matching, keeping common string and stream uncommon paths.
*
* . __$(7)
* / \
* A NA$ (5)
* / \
* (4) NA$ $(6)
*
* 6. Similarly, now comes NANA$, yes we have path existing which start with 'N' i.e "NA$", also 2nd character is matching 'A' from NANA$ and NA$. but 3rd character
* is not matching i.e $ in NA$ and N in NANA$, So we have to break from that point
*
* . ____________$(7)
* / \
* A NA
* / \ / \
* (4) NA$ $(6) NA$(3) $(5)
*
* 7. Similarly, check for ANANA$ and BANANA$ (for which separate link is required as no path starts from B)
*
* . ____________$(7)
* / | \
* A BANANA$ NA
* / \ (1) / \
* NA $ (6) (3)NA$ $(5)
* / \
* (2) NA$ $(4)
*
*
* We can see in figure that, longest repeated substring will end at the internal node which is farthest from the root (i.e. deepest node in the tree),
* because length of substring is the path label length from root to that internal node
* Finding the deepest internal node in the tree. Depth is measured by the number of characters traversed from the root.
* ie ANA in our case.
*
* How it happens is,
* If w is a repeated substring of Suffix Tree T, it must be a prefix of at least two different suffixes.
* Because the prefix is the same each time, all those suffixes will be in the same subtree.
*
*
* Same result can be achieved using suffix arrays.
*
* banana
* -----------
*
* banana
* anana
* nana
* ana
* na
* a
*
* Now sort it,
* a
* ana
* anana
* banana
* na
* nana
*
*
*/
import java.util.Arrays;
public class LongestRepeatedSubstring {
// return the longest common prefix of s and t
public static String lcp(String s, String t) {
int n = Math.min(s.length(), t.length());
for (int i = 0; i < n; i++) {
if (s.charAt(i) != t.charAt(i))
return s.substring(0, i);
}
return s.substring(0, n);
}
// return the longest repeated string in s
public static String lrs(String s) {
// form the N suffixes
int N = s.length();
String[] suffixes = new String[N];
for (int i = 0; i < N; i++) {
suffixes[i] = s.substring(i, N);
}
// sort them
Arrays.sort(suffixes);
// find longest repeated substring by comparing adjacent sorted suffixes
String lrs = "";
for (int i = 0; i < N - 1; i++) {
String x = lcp(suffixes[i], suffixes[i+1]);
if (x.length() > lrs.length())
lrs = x;
}
return lrs;
}
// read in text, replacing all consecutive whitespace with a single space
// then compute longest repeated substring
public static void main(String[] args) {
String s = "banana";
s = s.replaceAll("\\s+", " ");
System.out.println("'" + lrs(s) + "'");
}
}
Questions on Stack, Queues, Linkedlist, Binary Trees, Sorting, Searching, Graphs etc with solution using Java Language.
Wednesday, 4 March 2015
Longest Repeated Substring
Labels:
miscellaneous
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment