package miscellaneous.longestCommonSubstring; /* * What is Suffix tree * * Lets take an example, str = "BANANA" * If told to find substring in "BANANA" then answer is "ANA" as it is repeated twice. * How to find that "ANA" is longest repeated in "BANANA". * There can be many solutions, but using suffix array it is easy to solve. * * Add end Character say $ to "BANANA", so now our string becomes "BANANA$" * Now divide the possible substrings either from start or from end, here lets consider from end. * * BANANA$ with index as 1234567 ie B is at index 1, A is at index 2 till $ is at index 7. * 0. $ from BANANA($) -->7 * 1. A$ from BANAN(A$) -->6 * 2. NA$ from BANA(NA$) -->5 * 3. ANA$ from BAN(ANA$) -->4 * 4. NANA$ from BA(NANA$) -->3 * 5. ANANA$ from B(ANANA$) -->2 * 6. BANANA$ from (BANANA$) -->1 * * Now initially our tree is empty. * 1. First comes $ * . * | * $ (7) ---> 7 because $ is at index 7 * * 2. Now comes A$ * . __$ (7) * | * A$ (6) ---> 6 because A starts at index 6 * * 4. Now came NA$, check here is there any path available starting with 'N' as NA$ starts with 'N', answer is NO, so we have to start new path for it. * * . __$(7) * / \ * (6) A$ NA$ (5) * * 5. Now came ANA$, check here is there any path available starting with 'A' as ANA$ starts with 'A', answer is YES i.e A$ exist, * so we have to check, 2nd character of ANA$ and A$ as first character of both matched which is N and $ respectively. * it is not matching, so in ANA$ and A$ we have only one character ie 'A' matching, keeping common string and stream uncommon paths. * * . __$(7) * / \ * A NA$ (5) * / \ * (4) NA$ $(6) * * 6. Similarly, now comes NANA$, yes we have path existing which start with 'N' i.e "NA$", also 2nd character is matching 'A' from NANA$ and NA$. but 3rd character * is not matching i.e $ in NA$ and N in NANA$, So we have to break from that point * * . ____________$(7) * / \ * A NA * / \ / \ * (4) NA$ $(6) NA$(3) $(5) * * 7. Similarly, check for ANANA$ and BANANA$ (for which separate link is required as no path starts from B) * * . ____________$(7) * / | \ * A BANANA$ NA * / \ (1) / \ * NA $ (6) (3)NA$ $(5) * / \ * (2) NA$ $(4) * * * We can see in figure that, longest repeated substring will end at the internal node which is farthest from the root (i.e. deepest node in the tree), * because length of substring is the path label length from root to that internal node * Finding the deepest internal node in the tree. Depth is measured by the number of characters traversed from the root. * ie ANA in our case. * * How it happens is, * If w is a repeated substring of Suffix Tree T, it must be a prefix of at least two different suffixes. * Because the prefix is the same each time, all those suffixes will be in the same subtree. * * * Same result can be achieved using suffix arrays. * * banana * ----------- * * banana * anana * nana * ana * na * a * * Now sort it, * a * ana * anana * banana * na * nana * * */ import java.util.Arrays; public class LongestRepeatedSubstring { // return the longest common prefix of s and t public static String lcp(String s, String t) { int n = Math.min(s.length(), t.length()); for (int i = 0; i < n; i++) { if (s.charAt(i) != t.charAt(i)) return s.substring(0, i); } return s.substring(0, n); } // return the longest repeated string in s public static String lrs(String s) { // form the N suffixes int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) { suffixes[i] = s.substring(i, N); } // sort them Arrays.sort(suffixes); // find longest repeated substring by comparing adjacent sorted suffixes String lrs = ""; for (int i = 0; i < N - 1; i++) { String x = lcp(suffixes[i], suffixes[i+1]); if (x.length() > lrs.length()) lrs = x; } return lrs; } // read in text, replacing all consecutive whitespace with a single space // then compute longest repeated substring public static void main(String[] args) { String s = "banana"; s = s.replaceAll("\\s+", " "); System.out.println("'" + lrs(s) + "'"); } }
Questions on Stack, Queues, Linkedlist, Binary Trees, Sorting, Searching, Graphs etc with solution using Java Language.
Wednesday, 4 March 2015
Longest Repeated Substring
Labels:
miscellaneous
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment