可能回溯的正则表达式性能问题？答案

【问题标题】：Regex performance issues with possible back tracking?可能回溯的正则表达式性能问题？
【发布时间】：2013-12-10 06:57:25
【问题描述】：

我有以下输入/输出和正则表达式代码可以正常工作（对于以下输入/输出）。

-- 输入--

keep this

      keep this too

     Bye
------ Remove Below ------
  remove all of this

-- 输出--

keep this

      keep this too

     Bye

-- 代码--

    String text = "keep this\n       \n"
            + "      keep this too\n      \n     Bye\n------ Remove Below ------\n  remove all of this\n";
    System.out.println(text);
    Pattern PATTERN = Pattern.compile("^(.*?)(-+)(.*?)Remove Below(.*?)(-+)(.*?)$",
             Pattern.DOTALL);
    Matcher m = PATTERN.matcher(text);
    if (m.find()) {
        // remove everything as expected (from about input->regex->output)
        text =  ((m.group(1)).replaceAll("[\n]+$", "")).replaceAll("\\s+$", "");
        System.out.println(m.group(1));
        System.out.println(text);
    }

好的，这很好用。但是，这是针对已定义输入输出的测试。当我得到我必须解析的包含以下字符/模式序列的大文件时，我看到对于大小为 100k 的文件，按照 Find() 方法执行代码需要一段时间（4-5 秒）有以下模式。事实上，有时我不确定它是否会返回......当我作为调试测试单步执行时，find() 方法挂起并且我的客户端断开连接。

注意：此文件中没有可匹配的内容...但这是对我的正则表达式征税的模式。

-- 100k 文件--

junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 


this repeats from above to make up the 100k file.

-- 提问--

如何优化上述正则表达式以处理来自以上是这样还是正则表达式解析速度（4-6秒）完全挂起？

【问题讨论】：

标签： java regex

【解决方案1】：

由于您只对------ Remove Below ------ 行上方的文本感兴趣，因此您无需匹配所有内容。只需通过缩短正则表达式来匹配您想要的内容，并避免过度匹配和回溯。

Pattern PATTERN = Pattern.compile("^(.*?)-+ *Remove Below *-+", Pattern.DOTALL);

【讨论】：

您好，感谢您的回复。这确实是我正在寻找的东西，它就像一个魅力和出色的表现。一个问题是为什么你不必包含尾随的 $?
我们实际上不需要在这里匹配任何 $ ，因为我们正在抓取删除行之前的文本。
再次感谢。最后一次问：您的解决方案（。*？）之间有什么区别......假设它是（。*）。真正的区别是什么（非贪婪与贪婪）？你能详细说明一下吗？
使用贪婪匹配 .* 将尝试尽可能多地抓取，因此 -+ 之后将只匹配单个连字符。但是使用非贪婪量词 .*? 将仅匹配 ------ 之前的文本，这将与 -+ 完全匹配

【解决方案2】：

你说的完全正确，这是一场追溯噩梦！

使用通配符时避免可能的匹配。一些策略，可能会有所帮助：

如果'-'的数量已知，使用具体字符串进行测试：

^(.*?)(------ Remove Below ------)(.*)$

或者至少更具体一点

^(.*?)-*-\s*Remove Below\s*--*(.*?)$

更准确地说：

^(.*?)(-+)([^-]*)Remove Below([^-]*)(-+)(.*?)$

如果可以，请贪婪：

^(.*)(-+)(.*?)Remove Below(.*?)(-+)(.*?)$

如果不需要，请勿包含在匹配项中：

^(.*?)-+.*?Remove Below.*?-+.*?$

当然，根据您的输入质量，您可以结合这些概念：

^(.*)------ Remove Below ------.*$

在你的情况下，逐行解析，当它匹配 ^.*-+\s*Remove Below\s*-+.*$ 时停止修改

【讨论】：

这些甚至是很棒的最佳实践和建议。我一定会试一试的。谢谢！

【解决方案3】：

如果您确定要删除的内容位于文件末尾，请反转您的输入字符串。这应该对你有很大帮助。而不是

Matcher m = PATTERN.matcher(text);

使用

Matcher m = PATTERN.matcher(new StringBuilder(text).reverse());

记得也要反转一个模式。

【讨论】：

【解决方案4】：

您可以使用第三方正则表达式库。 Here you have benchmarks.

【讨论】：