【问题标题】:Efficient string truncation algorithm, sequentially removing equal prefixes and suffixes高效的字符串截断算法,顺序去除相等的前缀和后缀
【发布时间】:2020-04-28 10:26:37
【问题描述】:

每次测试的时间限制:5 秒
每次测试的内存限制:512 MB

给你一个长度为n (n ≤ 5000) 的字符串s。你可以选择 此字符串的任何适当前缀也是其后缀并删除 选择的前缀或相应的后缀。然后你可以申请一个 对结果字符串的类似操作等等。是什么 最终字符串的最小长度,可以在之后实现 应用此类操作的最佳顺序?

输入
每个测试的第一行包含一个由小英文字母组成的字符串s

输出
输出一个整数——最终字符串的最小长度,在应用最优序列后可以达到 这样的操作。

示例 +-------+--------+----------------------------------+ | Input | Output | Explanation | +-------+--------+----------------------------------+ | caaca | 2 | caaca → ca|aca → aca → ac|a → ac | +-------+--------+----------------------------------+ | aabaa | 2 | aaba|a → a|aba → ab|a → ab | +-------+--------+----------------------------------+ | abc | 3 | No operations are possible | +-------+--------+----------------------------------+

这是我到目前为止所做的事情:

  1. O(n^2)

  2. 中计算给定字符串的所有子字符串的prefix function
  3. 检查在O(n^3)

  4. 中执行所有可能的操作组合的结果

我的解决方案在 n ≤ 2000 时通过了所有测试,但在 2000 n ≤ 5000 时超过了时间限制。这是它的代码:

#include <iostream>
#include <string>

using namespace std;

const int MAX_N = 5000;

int result; // 1 less than actual

// [x][y] corresponds to substring that starts at position `x` and ends at position `x + y` =>
// => corresponding substring length is `y + 1`
int lps[MAX_N][MAX_N]; // prefix function for the substring s[x..x+y]
bool checked[MAX_N][MAX_N]; // whether substring s[x..x+y] is processed by check function

// length is 1 less than actual
void check(int start, int length) {
    checked[start][length] = true;
    if (length < result) {
        if (length == 0) {
            cout << 1; // actual length = length + 1 = 0 + 1 = 1
            exit(0); // 1 is the minimum possible result
        }
        result = length;
    }
    // iteration over all proper prefixes that are also suffixes
    // i - current prefix length
    for (int i = lps[start][length]; i != 0; i = lps[start][i - 1]) {
        int newLength = length - i;
        int newStart = start + i;
        if (!checked[start][newLength])
            check(start, newLength);
        if (!checked[newStart][newLength])
            check(newStart, newLength);
    }
}

int main()
{
    string str;
    cin >> str;
    int n = str.length();
    // lps calculation runs in O(n^2)
    for (int l = 0; l < n; l++) {
        int subLength = n - l;
        lps[l][0] = 0;
        checked[l][0] = false;
        for (int i = 1; i < subLength; ++i) {
            int j = lps[l][i - 1];
            while (j > 0 && str[i + l] != str[j + l])
                j = lps[l][j - 1];
            if (str[i + l] == str[j + l])  j++;
            lps[l][i] = j;
            checked[l][i] = false;
        }
    }
    result = n - 1;
    // checking all possible operations combinations in O(n^3)
    check(0, n - 1);
    cout << result + 1;
}

问:有没有更有效的解决方案?

【问题讨论】:

  • 我认为代码审查堆栈交换会更好。无论如何,这个问题很好,很明确。
  • @ruohola 谢谢。我不是在寻找代码审查,而是在寻找更好的算法。
  • 顺便说一句,您确定 250 万个整数元素数组适合您的堆栈吗?
  • @ruohola 该数组位于文件范围内,因此它没有放在堆栈上,而是放在二进制文件的单独部分中。但是,是的,声明这样一个巨大的二维数组并不是一个好主意。一个小的向量对于缓存局部性会更好
  • 这是测试生成器超时:ideone.com/pDhxS6 这是 3.54 秒,420 MB:ideone.com/EIrhnR

标签: c++ string algorithm performance optimization


【解决方案1】:

这是获取日志因子的一种方法。如果我们可以到达子字符串s[i..j],让dp[i][j] 为真。那么:

dp[0][length(s)-1] ->
  true

dp[0][j] ->
  if s[0] != s[j+1]:
    false
  else:
    true if any dp[0][k]
      for j < k ≤ (j + longestMatchRight[0][j+1])

  (The longest match we can use is
   also bound by the current range.)

(Initialise left side similarly.)

现在从外部迭代:

for i = 1 to length(s)-2:
  for j = length(s)-2 to i:
    dp[i][j] ->
      // We removed on the right
      if s[i] != s[j+1]:
        false
      else:
        true if any dp[i][k]
          for j < k ≤ (j + longestMatchRight[i][j+1])

      // We removed on the left
      if s[i-1] != s[j]:
        true if dp[i][j]
      else:
        true if any dp[k][j]
          for (i - longestMatchLeft[i-1][j]) ≤ k < i

我们可以预先计算 O(n^2) 中每个起始对 (i, j) 的最长匹配和递归,

longest(i, j) -> 
  if s[i] == s[j]:
    return 1 + longest(i + 1, j + 1)
  else:
    return 0

这将允许我们检查从O(1) 中的索引ij 开始的子字符串匹配。 (我们需要左右方向。)

如何获取日志因子

我们可以想出一种方法来提出一种数据结构,使我们能够确定是否

any dp[i][k]
  for j < k ≤ (j + longestMatchRight[i][j+1])

(And similarly for the left side.)

O(log n) 中,考虑到我们已经看到了这些值。

这是带有段树的 C++ 代码(用于左右查询,所以 O(n^2 * log n)),其中包括 Bananon 的测试生成器。对于 5000 个“a”字符,它在 3.54 秒内运行,420 MB (https://ideone.com/EIrhnR)。为了减少内存,其中一个段树在单行上实现(我仍然需要研究对左侧查询执行相同操作以进一步减少内存。)

#include <iostream>
#include <string>
#include <ctime>
#include <random>
#include <algorithm>    // std::min

using namespace std;

const int MAX_N = 5000;

int seg[2 * MAX_N];
int segsL[MAX_N][2 * MAX_N];
int m[MAX_N][MAX_N][2];
int dp[MAX_N][MAX_N];
int best;

// Adapted from https://codeforces.com/blog/entry/18051
void update(int n, int p, int value) { // set value at position p
  for (seg[p += n] = value; p > 1; p >>= 1)
    seg[p >> 1] = seg[p] + seg[p ^ 1];
}
// Adapted from https://codeforces.com/blog/entry/18051
int query(int n, int l, int r) { // sum on interval [l, r)
  int res = 0;
  for (l += n, r += n; l < r; l >>= 1, r >>= 1) {
    if (l & 1) res += seg[l++];
    if (r & 1) res += seg[--r];
  }
  return res;
}
// Adapted from https://codeforces.com/blog/entry/18051
void updateL(int n, int i, int p, int value) { // set value at position p
  for (segsL[i][p += n] = value; p > 1; p >>= 1)
    segsL[i][p >> 1] = segsL[i][p] + segsL[i][p ^ 1];
}
// Adapted from https://codeforces.com/blog/entry/18051
int queryL(int n, int i, int l, int r) { // sum on interval [l, r)
  int res = 0;
  for (l += n, r += n; l < r; l >>= 1, r >>= 1) {
    if (l & 1) res += segsL[i][l++];
    if (r & 1) res += segsL[i][--r];
  }
  return res;
}

// Code by גלעד ברקן
void precalc(int n, string & s) {
  int i, j;
  for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
      // [longest match left, longest match right]
      m[i][j][0] = (s[i] == s[j]) & 1;
      m[i][j][1] = (s[i] == s[j]) & 1;
    }
  }

  for (i = n - 2; i >= 0; i--)
    for (j = n - 2; j >= 0; j--)
      m[i][j][1] = s[i] == s[j] ? 1 + m[i + 1][j + 1][1] : 0;

  for (i = 1; i < n; i++)
    for (j = 1; j < n; j++)
      m[i][j][0] = s[i] == s[j] ? 1 + m[i - 1][j - 1][0] : 0;
}

// Code by גלעד ברקן
void f(int n, string & s) {
  int i, j, k, longest;

  dp[0][n - 1] = 1;
  update(n, n - 1, 1);
  updateL(n, n - 1, 0, 1);

  // Right side initialisation
  for (j = n - 2; j >= 0; j--) {
    if (s[0] == s[j + 1]) {
      longest = std::min(j + 1, m[0][j + 1][1]);
      for (k = j + 1; k <= j + longest; k++)
        dp[0][j] |= dp[0][k];
      if (dp[0][j]) {
        update(n, j, 1);
        updateL(n, j, 0, 1);
        best = std::min(best, j + 1);
      }
    }
  }

  // Left side initialisation
  for (i = 1; i < n; i++) {
    if (s[i - 1] == s[n - 1]) {
      // We are bound by the current range
      longest = std::min(n - i, m[i - 1][n - 1][0]);
      for (k = i - 1; k >= i - longest; k--)
        dp[i][n - 1] |= dp[k][n - 1];
      if (dp[i][n - 1]) {
        updateL(n, n - 1, i, 1);
        best = std::min(best, n - i);
      }
    }
  }

  for (i = 1; i <= n - 2; i++) {
    for (int ii = 0; ii < MAX_N; ii++) {
      seg[ii * 2] = 0;
      seg[ii * 2 + 1] = 0;
    }
    update(n, n - 1, dp[i][n - 1]);
    for (j = n - 2; j >= i; j--) {
      // We removed on the right
      if (s[i] == s[j + 1]) {
        // We are bound by half the current range
        longest = std::min(j - i + 1, m[i][j + 1][1]);
        //for (k=j+1; k<=j+longest; k++)
        //dp[i][j] |= dp[i][k];
        if (query(n, j + 1, j + longest + 1)) {
          dp[i][j] = 1;
          update(n, j, 1);
          updateL(n, j, i, 1);
        }
      }
      // We removed on the left
      if (s[i - 1] == s[j]) {
        // We are bound by half the current range
        longest = std::min(j - i + 1, m[i - 1][j][0]);
        //for (k=i-1; k>=i-longest; k--)
        //dp[i][j] |= dp[k][j];
        if (queryL(n, j, i - longest, i)) {
          dp[i][j] = 1;
          updateL(n, j, i, 1);
          update(n, j, 1);
        }
      }
      if (dp[i][j])
        best = std::min(best, j - i + 1);
    }
  }
}

int so(string s) {
  for (int i = 0; i < MAX_N; i++) {
    seg[i * 2] = 0;
    seg[i * 2 + 1] = 0;
    for (int j = 0; j < MAX_N; j++) {
      segsL[i][j * 2] = 0;
      segsL[i][j * 2 + 1] = 0;
      m[i][j][0] = 0;
      m[i][j][1] = 0;
      dp[i][j] = 0;
    }
  }
  int n = s.length();
  best = n;
  precalc(n, s);
  f(n, s);
  return best;
}
// End code by גלעד ברקן

// Code by Bananon  =======================================================================

int result;

int lps[MAX_N][MAX_N];
bool checked[MAX_N][MAX_N];

void check(int start, int length) {
  checked[start][length] = true;
  if (length < result) {
    result = length;
  }
  for (int i = lps[start][length]; i != 0; i = lps[start][i - 1]) {
    int newLength = length - i;
    if (!checked[start][newLength])
      check(start, newLength);
    int newStart = start + i;
    if (!checked[newStart][newLength])
      check(newStart, newLength);
  }
}

int my(string str) {
  int n = str.length();
  for (int l = 0; l < n; l++) {
    int subLength = n - l;
    lps[l][0] = 0;
    checked[l][0] = false;
    for (int i = 1; i < subLength; ++i) {
      int j = lps[l][i - 1];
      while (j > 0 && str[i + l] != str[j + l])
        j = lps[l][j - 1];
      if (str[i + l] == str[j + l]) j++;
      lps[l][i] = j;
      checked[l][i] = false;
    }
  }
  result = n - 1;
  check(0, n - 1);
  return result + 1;
}

// generate =================================================================

bool rndBool() {
  return rand() % 2 == 0;
}

int rnd(int bound) {
  return rand() % bound;
}

void untrim(string & str) {
  int length = rnd(str.length());
  int prefixLength = rnd(str.length()) + 1;
  if (rndBool())
    str.append(str.substr(0, prefixLength));
  else {
    string newStr = str.substr(str.length() - prefixLength, prefixLength);
    newStr.append(str);
    str = newStr;
  }
}

void rndTest(int minTestLength, string s) {
  while (s.length() < minTestLength)
    untrim(s);
  int myAns = my(s);
  int soAns = so(s);
  cout << myAns << " " << soAns << '\n';
  if (soAns != myAns) {
    cout << s;
    exit(0);
  }
}

int main() {
  int minTestLength;
  cin >> minTestLength;
  string seed;
  cin >> seed;
  while (true)
    rndTest(minTestLength, seed);
}

这里的 JavaScript 代码(没有日志因子改进)表明重复有效。 (为了获得日志因子,我们将内部 k 循环替换为单个范围查询。)

debug = 1

function precalc(s){
  let m = new Array(s.length)
  for (let i=0; i<s.length; i++){
    m[i] = new Array(s.length)
    for (let j=0; j<s.length; j++){
      // [longest match left, longest match right]
      m[i][j] = [(s[i] == s[j]) & 1, (s[i] == s[j]) & 1]
    }
  }
  
  for (let i=s.length-2; i>=0; i--)
    for (let j=s.length-2; j>=0; j--)
      m[i][j][1] = s[i] == s[j] ? 1 + m[i+1][j+1][1] : 0

  for (let i=1; i<s.length; i++)
    for (let j=1; j<s.length; j++)
      m[i][j][0] = s[i] == s[j] ? 1 + m[i-1][j-1][0] : 0
  
  return m
}

function f(s){
  m = precalc(s)
  let n = s.length
  let min = s.length
  let dp = new Array(s.length)

  for (let i=0; i<s.length; i++)
    dp[i] = new Array(s.length).fill(0)

  dp[0][s.length-1] = 1
      
  // Right side initialisation
  for (let j=s.length-2; j>=0; j--){
    if (s[0] == s[j+1]){
      let longest = Math.min(j + 1, m[0][j+1][1])
      for (let k=j+1; k<=j+longest; k++)
        dp[0][j] |= dp[0][k]
      if (dp[0][j])
        min = Math.min(min, j + 1)
    }
  }

  // Left side initialisation
  for (let i=1; i<s.length; i++){
    if (s[i-1] == s[s.length-1]){
      let longest = Math.min(s.length - i, m[i-1][s.length-1][0])
      for (let k=i-1; k>=i-longest; k--)
        dp[i][s.length-1] |= dp[k][s.length-1]
      if (dp[i][s.length-1])
        min = Math.min(min, s.length - i)
    }
  }

  for (let i=1; i<=s.length-2; i++){
    for (let j=s.length-2; j>=i; j--){
      // We removed on the right
      if (s[i] == s[j+1]){
        // We are bound by half the current range
        let longest = Math.min(j - i + 1, m[i][j+1][1])
        for (let k=j+1; k<=j+longest; k++)
          dp[i][j] |= dp[i][k]
      }
      // We removed on the left
      if (s[i-1] == s[j]){
        // We are bound by half the current range
        let longest = Math.min(j - i + 1, m[i-1][j][0])
        for (let k=i-1; k>=i-longest; k--)
          dp[i][j] |= dp[k][j]
      }
      if (dp[i][j])
        min = Math.min(min, j - i + 1)
    }
  }

  if (debug){
    let str = ""
    for (let row of dp)
      str += row + "\n"
    console.log(str)
  }

  return min
}

function main(s){
  var strs = [
    "caaca",
    "bbabbbba",
    "baabbabaa",
    "bbabbba",
    "bbbabbbbba",
    "abbabaabbab",
    "abbabaabbaba",
    "aabaabaaabaab",
    "bbabbabbb"
  ]

  for (let s of strs){
    let t = new Date
    console.log(s)
    console.log(f(s))
    //console.log((new Date - t)/1000)
    console.log("")
  }
}

main()

【讨论】:

  • 评论不用于扩展讨论;这个对话是moved to chat
  • Shadowed i 从第 64 行开始第 99 行有点难以理解——这是故意的吗?对于第 98 行循环范围的其余部分,98 和 99 处的循环声明似乎将 i 留在 MAX_N 处? (C++ 版本)
  • @DavidC.Rankin i 仅适用于该四行循环的范围,但它可能看起来令人困惑。感谢您指出 - 我更改了它,尽管更改不会影响代码执行。
  • 我尝试了一种中间递归方法,显示出有希望,但是当相等的前缀/后缀很大时,确定哪条路径导致最小单词所需的递归分支变得非常不守规矩 - 很快.
  • @DavidC.Rankin 是的,我也试过了,但即使是对已经访问过的范围的检查似乎也证明太多了。
猜你喜欢
  • 1970-01-01
  • 2021-06-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多