This blog will soon be merged with

JavaByPatel

which explains each solution in detail, to visit new blog, click JavaByPatel

Wednesday, 4 March 2015

Longest Repeated Substring


package miscellaneous.longestCommonSubstring;

/*
 * What is Suffix tree
 * 
 * Lets take an example, str = "BANANA"
 * If told to find substring in "BANANA" then answer is "ANA" as it is repeated twice.
 * How to find that "ANA" is longest repeated in "BANANA".
 * There can be many solutions, but using suffix array it is easy to solve. 
 * 
 * Add end Character say $ to "BANANA", so now our string becomes "BANANA$"
 * Now divide the possible substrings either from start or from end, here lets consider from end.
 * 
 * BANANA$ with index as 1234567 ie B is at index 1, A is at index 2 till $ is at index 7.
 * 0. $   from BANANA($) -->7
 * 1. A$  from BANAN(A$) -->6
 * 2. NA$  from BANA(NA$) -->5
 * 3. ANA$  from BAN(ANA$) -->4
 * 4. NANA$  from BA(NANA$) -->3
 * 5. ANANA$ from B(ANANA$) -->2
 * 6. BANANA$ from (BANANA$) -->1
 * 
 * Now initially our tree is empty.
 * 1. First comes $
 *      .
 *         |
 *         $ (7) ---> 7 because $ is at index 7   
 * 
 * 2. Now comes A$
 *         . __$ (7)
 *         |
 *         A$ (6) ---> 6 because A starts at index 6
 * 
 * 4. Now came NA$, check here is there any path available starting with 'N' as NA$ starts with 'N', answer is NO, so we have to start new path for it.
 * 
 *           . __$(7)
 *          / \
 *    (6) A$   NA$ (5)
 * 
 * 5. Now came ANA$, check here is there any path available starting with 'A' as ANA$ starts with 'A', answer is YES i.e A$ exist, 
 *    so we have to check, 2nd character of ANA$ and A$ as first character of both matched which is N and $ respectively. 
 *    it is not matching, so in ANA$ and A$ we have only one character ie 'A' matching, keeping common string and stream uncommon paths.
 * 
 *             . __$(7)
 *            / \
 *           A   NA$ (5)
 *         /  \
 *    (4) NA$  $(6) 
 *    
 * 6. Similarly, now comes NANA$, yes we have path existing which start with 'N' i.e "NA$", also 2nd character is matching 'A' from NANA$ and NA$. but 3rd character 
 *    is not matching i.e $ in NA$ and N in NANA$, So we have to break from that point
 * 
 *                 . ____________$(7)
 *            /          \
 *           A            NA 
 *          / \         /     \
 *  (4)  NA$   $(6)   NA$(3)  $(5)
 * 
 * 7. Similarly, check for ANANA$ and BANANA$ (for which separate link is required as no path starts from B)
 * 
 *                . ____________$(7)
 *         /      |            \  
 *        A      BANANA$        NA
 *      /  \       (1)          /  \
 *     NA   $ (6)          (3)NA$  $(5)
 *         /  \ 
 *  (2) NA$    $(4)
 *   
 *  
 *   We can see in figure that, longest repeated substring will end at the internal node which is farthest from the root (i.e. deepest node in the tree), 
 *   because length of substring is the path label length from root to that internal node
 *   Finding the deepest internal node in the tree. Depth is measured by the number of characters traversed from the root.
 *   ie ANA in our case.
 *   
 *   How it happens is,
 *   If w is a repeated substring of Suffix Tree T, it must be a prefix of at least two different suffixes.
 *   Because the prefix is the same each time, all those suffixes will be in the same subtree.
 *   
 *   
 *   Same result can be achieved using suffix arrays.
 *   
 *   banana
 *   -----------
 *   
 *   banana
 *    anana
 *     nana
 *      ana
 *       na
 *        a
 *        
 *   Now sort it,
 *   a
 *   ana
 *   anana
 *   banana
 *   na
 *   nana
 *   
 *        
 */  
import java.util.Arrays;

public class LongestRepeatedSubstring {

 // return the longest common prefix of s and t
 public static String lcp(String s, String t) {
  int n = Math.min(s.length(), t.length());
  for (int i = 0; i < n; i++) {
   if (s.charAt(i) != t.charAt(i))
    return s.substring(0, i);
  }
  return s.substring(0, n);
 }


 // return the longest repeated string in s
 public static String lrs(String s) {

  // form the N suffixes
  int N  = s.length();
  String[] suffixes = new String[N];
  for (int i = 0; i < N; i++) {
   suffixes[i] = s.substring(i, N);
  }

  // sort them
  Arrays.sort(suffixes);

  // find longest repeated substring by comparing adjacent sorted suffixes
  String lrs = "";
  for (int i = 0; i < N - 1; i++) {
   String x = lcp(suffixes[i], suffixes[i+1]);
   if (x.length() > lrs.length())
    lrs = x;
  }
  return lrs;
 }

 // read in text, replacing all consecutive whitespace with a single space
 // then compute longest repeated substring
 public static void main(String[] args) {
  String s = "banana";
  s = s.replaceAll("\\s+", " ");
  System.out.println("'" + lrs(s) + "'");
 }
}


No comments:

Post a Comment