【问题标题】:Finding the longest repeated substring寻找最长的重复子串
【发布时间】:2012-05-08 11:10:33
【问题描述】:

解决这个问题的最佳方法(性能方面)是什么? 我被推荐使用后缀树。这是最好的方法吗?

【问题讨论】:

标签: algorithm pattern-recognition suffix-tree suffix-array


【解决方案1】:

查看此链接:http://introcs.cs.princeton.edu/java/42sort/LRS.java.html

/*************************************************************************
 *  Compilation:  javac LRS.java
 *  Execution:    java LRS < file.txt
 *  Dependencies: StdIn.java
 *  
 *  Reads a text corpus from stdin, replaces all consecutive blocks of
 *  whitespace with a single space, and then computes the longest
 *  repeated substring in that corpus. Suffix sorts the corpus using
 *  the system sort, then finds the longest repeated substring among 
 *  consecutive suffixes in the sorted order.
 * 
 *  % java LRS < mobydick.txt
 *  ',- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th'
 * 
 *  % java LRS 
 *  aaaaaaaaa
 *  'aaaaaaaa'
 *
 *  % java LRS
 *  abcdefg
 *  ''
 *
 *************************************************************************/


import java.util.Arrays;

public class LRS {

    // return the longest common prefix of s and t
    public static String lcp(String s, String t) {
        int n = Math.min(s.length(), t.length());
        for (int i = 0; i < n; i++) {
            if (s.charAt(i) != t.charAt(i))
                return s.substring(0, i);
        }
        return s.substring(0, n);
    }


    // return the longest repeated string in s
    public static String lrs(String s) {

        // form the N suffixes
        int N  = s.length();
        String[] suffixes = new String[N];
        for (int i = 0; i < N; i++) {
            suffixes[i] = s.substring(i, N);
        }

        // sort them
        Arrays.sort(suffixes);

        // find longest repeated substring by comparing adjacent sorted suffixes
        String lrs = "";
        for (int i = 0; i < N - 1; i++) {
            String x = lcp(suffixes[i], suffixes[i+1]);
            if (x.length() > lrs.length())
                lrs = x;
        }
        return lrs;
    }



    // read in text, replacing all consecutive whitespace with a single space
    // then compute longest repeated substring
    public static void main(String[] args) {
        String s = StdIn.readAll();
        s = s.replaceAll("\\s+", " ");
        StdOut.println("'" + lrs(s) + "'");
    }
}

【讨论】:

  • @paramvir,这不是 O(n^2),而是 O(n^2log(n))!他对字符串进行排序需要 O(n) 时间来比较和快速排序是 O(nlog(n)),所以总的来说,我们正在查看 O(n^2log(n))。
【解决方案2】:

也看看http://en.wikipedia.org/wiki/Suffix_array - 它们非常节省空间,并且有一些合理的可编程算法来生成它们,例如 Karkkainen 和 Sanders 的“简单线性工作后缀数组构造”

【讨论】:

    【解决方案3】:

    这是一个使用最简单后缀树的最长重复子串的简单实现。后缀树用这种方式很容易实现。

    #include <iostream>
    #include <vector>
    #include <unordered_map>
    #include <string>
    using namespace std;
    
    class Node
    {
    public:
        char ch;
        unordered_map<char, Node*> children;
        vector<int> indexes; //store the indexes of the substring from where it starts
        Node(char c):ch(c){}
    };
    
    int maxLen = 0;
    string maxStr = "";
    
    void insertInSuffixTree(Node* root, string str, int index, string originalSuffix, int level=0)
    {
        root->indexes.push_back(index);
    
        // it is repeated and length is greater than maxLen
        // then store the substring
        if(root->indexes.size() > 1 && maxLen < level)
        {
            maxLen = level;
            maxStr = originalSuffix.substr(0, level);
        }
    
        if(str.empty()) return;
    
        Node* child;
        if(root->children.count(str[0]) == 0) {
            child = new Node(str[0]);
            root->children[str[0]] = child;
        } else {
            child = root->children[str[0]];
        }
    
        insertInSuffixTree(child, str.substr(1), index, originalSuffix, level+1);
    }
    
    int main()
    {
        string str = "banana"; //"abcabcaacb"; //"banana";  //"mississippi";
        Node* root = new  Node('@');
    
        //insert all substring in suffix tree
        for(int i=0; i<str.size(); i++){
            string s = str.substr(i);
            insertInSuffixTree(root, s, i, s);
        }
    
        cout << maxLen << "->" << maxStr << endl;
    
        return 1;
    }
    
    /*
    s = "mississippi", return "issi"
    s = "banana", return "ana"
    s = "abcabcaacb", return "abca"
    s = "aababa", return "aba"
    */
    

    【讨论】:

      【解决方案4】:

      LRS 问题最好使用后缀树或后缀数组来解决。两种方法都具有 O(n) 的最佳时间复杂度。

      这是一个使用后缀数组的 LRS 问题的 O(nlog(n)) 解决方案。如果您有后缀数组的线性构造时间算法(这很难实现),我的解决方案可以改进为 O(n)。代码取自我的library。如果您想了解有关后缀数组如何工作的更多信息,请务必查看我的tutorials

      /**
       * Finds the longest repeated substring(s) of a string.
       * 
       * Time complexity: O(nlogn), bounded by suffix array construction
       *
       * @author William Fiset, william.alexandre.fiset@gmail.com
       **/
      
      import java.util.*;
      
      public class LongestRepeatedSubstring {
      
        // Example usage
        public static void main(String[] args) {
      
          String str = "ABC$BCA$CAB";
          SuffixArray sa = new SuffixArray(str);
          System.out.printf("LRS(s) of %s is/are: %s\n", str, sa.lrs());
      
          str = "aaaaa";
          sa = new SuffixArray(str);
          System.out.printf("LRS(s) of %s is/are: %s\n", str, sa.lrs());
      
          str = "abcde";
          sa = new SuffixArray(str);
          System.out.printf("LRS(s) of %s is/are: %s\n", str, sa.lrs());
      
        }
      
      }
      
      class SuffixArray {
      
        // ALPHABET_SZ is the default alphabet size, this may need to be much larger
        int ALPHABET_SZ = 256, N;
        int[] T, lcp, sa, sa2, rank, tmp, c;
      
        public SuffixArray(String str) {    
          this(toIntArray(str));    
        }
      
        private static int[] toIntArray(String s) {   
          int[] text = new int[s.length()];   
          for(int i=0;i<s.length();i++)text[i] = s.charAt(i);   
          return text;    
        }
      
        // Designated constructor
        public SuffixArray(int[] text) {
          T = text;
          N = text.length;
          sa = new int[N];
          sa2 = new int[N];
          rank = new int[N];
          c = new int[Math.max(ALPHABET_SZ, N)];
          construct();
          kasai();
        }
      
        private void construct() {
          int i, p, r;
          for (i=0; i<N; ++i) c[rank[i] = T[i]]++;
          for (i=1; i<ALPHABET_SZ; ++i) c[i] += c[i-1];
          for (i=N-1; i>=0; --i) sa[--c[T[i]]] = i;
          for (p=1; p<N; p <<= 1) {
            for (r=0, i=N-p; i<N; ++i) sa2[r++] = i;
            for (i=0; i<N; ++i) if (sa[i] >= p) sa2[r++] = sa[i] - p;
            Arrays.fill(c, 0, ALPHABET_SZ, 0);
            for (i=0; i<N; ++i) c[rank[i]]++;
            for (i=1; i<ALPHABET_SZ; ++i) c[i] += c[i-1];
            for (i=N-1; i>=0; --i) sa[--c[rank[sa2[i]]]] = sa2[i];
            for (sa2[sa[0]] = r = 0, i=1; i<N; ++i) {
                if (!(rank[sa[i-1]] == rank[sa[i]] &&
                    sa[i-1]+p < N && sa[i]+p < N &&
                    rank[sa[i-1]+p] == rank[sa[i]+p])) r++;
                sa2[sa[i]] = r;
            } tmp = rank; rank = sa2; sa2 = tmp;
            if (r == N-1) break; ALPHABET_SZ = r + 1;
          }
        }
      
        // Use Kasai algorithm to build LCP array
        private void kasai() {
          lcp = new int[N];
          int [] inv = new int[N];
          for (int i = 0; i < N; i++) inv[sa[i]] = i;
          for (int i = 0, len = 0; i < N; i++) {
            if (inv[i] > 0) {
              int k = sa[inv[i]-1];
              while( (i + len < N) && (k + len < N) && T[i+len] == T[k+len] ) len++;
              lcp[inv[i]-1] = len;
              if (len > 0) len--;
            }
          }
        }
      
        // Finds the LRS(s) (Longest Repeated Substring) that occurs in a string.
        // Traditionally we are only interested in substrings that appear at
        // least twice, so this method returns an empty set if this is not the case.
        // @return an ordered set of longest repeated substrings
        public TreeSet <String> lrs() {
      
          int max_len = 0;
          TreeSet <String> lrss = new TreeSet<>();
      
          for (int i = 0; i < N; i++) {
            if (lcp[i] > 0 && lcp[i] >= max_len) {
      
              // We found a longer LRS
              if ( lcp[i] > max_len )
                lrss.clear();
      
              // Append substring to the list and update max
              max_len = lcp[i];
              lrss.add( new String(T, sa[i], max_len) );
      
            }
          }
      
          return lrss;
      
        }
      
        public void display() {
          System.out.printf("-----i-----SA-----LCP---Suffix\n");
          for(int i = 0; i < N; i++) {
            int suffixLen = N - sa[i];
            String suffix = new String(T, sa[i], suffixLen);
            System.out.printf("% 7d % 7d % 7d %s\n", i, sa[i],lcp[i], suffix );
          }
        }
      
      }
      

      【讨论】:

      • 不错。视频也很有帮助!澄清一下,这是O( n log n) 时间复杂度,因为构建后缀数组需要排序吗?
      • 另外,基于后缀树的解决方案(忽略后缀树的构造)的时间复杂度最差情况是多少?
      【解决方案5】:
      public class LongestSubString {
      
          public static void main(String[] args) {
              String s = findMaxRepeatedString("ssssssssssss this is a ddddddd word with iiiiiiiiiis and loads of these are ppppppppppppps");
              System.out.println(s);
          }
      
          private static String findMaxRepeatedString(String s) {
              Processor p = new Processor();
              char[] c = s.toCharArray();
              for (char ch : c) {
                  p.process(ch);
              } 
              System.out.println(p.bigger());
              return new String(new char[p.bigger().count]).replace('\0', p.bigger().letter);
          }
      
          static class  CharSet {
              int count;
              Character letter;
              boolean isLastPush;
      
              boolean assign(char c) {
                  if (letter == null) {
                      count++;
                      letter = c;
                      isLastPush = true;
                      return true;
                  }
                  return false;
              }
      
              void reassign(char c) {
                  count = 1;
                  letter = c;
                  isLastPush = true;
              }
      
              boolean push(char c) {
                  if (isLastPush && letter == c) {
                      count++;
                      return true;
                  }
                  return false;
              }
      
              @Override
              public String toString() {
                  return "CharSet [count=" + count + ", letter=" + letter + "]";
              }
      
          }
      
          static class  Processor {
      
              Character previousLetter = null;
              CharSet set1 = new CharSet();
              CharSet set2 = new CharSet();
      
              void process(char c) {
                  if ((set1.assign(c)) || set1.push(c)) {
                      set2.isLastPush = false;
                  } else if ((set2.assign(c)) || set2.push(c)) {
                      set1.isLastPush = false;                
                  } else {
                      set1.isLastPush = set2.isLastPush = false;
                      smaller().reassign(c);
                  }
              }       
      
              CharSet smaller() {
                  return set1.count < set2.count ? set1 : set2;
              }
      
              CharSet bigger() {
                  return set1.count < set2.count ? set2 : set1;
              }
      
          }   
      }
      

      【讨论】:

      • 请详细说明为什么这是最好的方法?
      【解决方案6】:

      我有一个面试,我需要解决这个问题。这是我的解决方案:

      public class FindLargestSubstring {
      
      public static void main(String[] args) {
          String test = "ATCGATCGA";
          System.out.println(hasRepeatedSubString(test));
      }
      
      private static String hasRepeatedSubString(String string) {
          Hashtable<String, Integer> hashtable = new Hashtable<>();
          int length = string.length();
          for (int subLength = length - 1; subLength > 1; subLength--) {
              for (int i = 0; i <= length - subLength; i++) {
                  String sub = string.substring(i, subLength + i);
                  if (hashtable.containsKey(sub)) {
                      return sub;
                  } else {
                      hashtable.put(sub, subLength);
                  }
              }
          }
          return "No repeated substring!";
      }}
      

      【讨论】:

        【解决方案7】:

        影响性能的因素太多了,我们无法仅用您提供的信息来回答这个问题。 (操作系统、语言、内存问题、代码本身

        如果您只是在寻找算法效率的数学分析,您可能想要改变问题。

        编辑

        当我提到“内存问题”和“代码”时,我没有提供所有细节。您将要分析的字符串的长度是一个重要因素。此外,代码不能单独运行——它必须位于程序内部才能发挥作用。该程序的哪些特征会影响该算法的使用和性能?

        基本上,在您有实际情况要测试之前,您无法进行性能调整。您可以对什么可能表现最好做出非常有根据的猜测,但在您拥有真实数据和真实代码之前,您永远无法确定。

        【讨论】:

        • 我在 Windows 7 上使用 4 GB RAM 用 C++ 编写
        • 感谢您的澄清,最大字符串长度为 5000 个字符,它将是一个工作线程,它只是读取字符串并写入结果,因此您可以假设程序中没有其他代码。
        • 为什么会被否决?每个人都应该知道,预先优化是浪费时间。我们可以选择通常是个好主意的算法,但如果不衡量特定场景,我们就无法选择“最佳”。
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2014-03-22
        • 2016-11-17
        • 2018-07-06
        • 2021-08-29
        • 2016-08-11
        • 2023-03-17
        • 2020-12-09
        相关资源
        最近更新 更多