基于多个正则表达式规则插入字符串的算法答案

【问题标题】：Algorithm to insert string based on multiple regex rules基于多个正则表达式规则插入字符串的算法
【发布时间】：2017-06-10 17:21:34
【问题描述】：

我正在创建一个 Markdown 风格的标记界面。例如，当用户输入**example string**时，使用正则表达式查找**的两次出现（定义粗体文本），实际明文将更改为<b>**example string**</b>并呈现为HTML。

这是我将用户输入解析为 HTML 的想法：

对于正则表达式规则中的每个规则
对于每次出现的start pattern（当前正则表达式规则）
在start pattern 的结尾之后获取所有文本（称为start substring）
对于end pattern 在start substring 中的第一个实例
从文本中获取substring(start_match.start() + end_match.end())
将其附加到最初为空白的 final text 字符串
通过substring(start_match.start() + end_match.end()) 剔除剩余文本，将其反馈到在2 处读取的文本中。

我的代码：

public static String process(String input_text) {
    String final_text = "";
    String current_text = input_text;

    for (MarkdownRule rule : _rules) {
        Pattern s_ptrn = rule.getStartPattern();    // Start pattern
        Pattern e_ptrn = rule.getEndPattern();      // End pattern

        /* For each occurrence of the start pattern */
        Matcher s_matcher = s_ptrn.matcher(current_text);
        while (s_matcher.find()) {
            int s_end = s_matcher.end();
            int s_start = s_matcher.start();

            /* Take all text after the end of start match */
            String working_text = current_text.substring(s_end); // ERROR HERE

            /* For first instance of end pattern in remaining text */
            Matcher e_matcher = e_ptrn.matcher(working_text);
            if (e_matcher.find()) {

                /* Take full substring from current text */
                int e_end = e_matcher.end();
                working_text = current_text.substring(s_start, s_end + e_end);

                /* Append to final text */
                working_text = new StringBuilder(working_text).insert(0, "<b>").append("</b>").toString();
                final_text = new StringBuilder(final_text).append(working_text).toString();

                /* Remove working text from current text */
                current_text = new StringBuilder(current_text).substring(s_start + e_end);
            }
        }
    }

    return final_text;
}

虽然理论上这应该可以正常工作，但我在这条线上得到了StringIndexOutOfBoundsException：

/* Take all text after the end of start match */
String working_text = current_text.substring(s_end);

当我使用输入文本**example** 时。我相信它对于第一次出现 start pattern （在索引 0 和 1 处）工作正常，但随后字符串没有被正确剔除，然后在纯文本 ** 上调用循环，这超出了范围错误。（不过我不能保证——这正是我自己的测试所相信的）

很遗憾，我的故障排除无法纠正错误。提前感谢您的帮助。

【问题讨论】：

既然已经降价了，为什么还要自己发明呢？！？！
我想以类似于 Typora 之类的应用程序的方式实时管理它。也只是看看我能不能！
不确定实时与否与语法和解析器有什么关系……但请注意，markdown 解析器编写者已经学会了艰难的方法：基于正则表达式的解析器并不是一个长期的策略。 . 请参阅github.com/jgm/CommonMark 了解现代实现..

标签： java html regex algorithm markdown

【解决方案1】：

你正在改变（缩小）current_text

/* Remove working text from current text */
current_text = new StringBuilder(current_text).substring(s_start + e_end);

虽然匹配器存储了初始的current_text 字符串，但无论您之后对current_text 做什么，它都不会改变。

/* For each occurrence of the start pattern */
Matcher s_matcher = s_ptrn.matcher(current_text);

您需要为新字符串使用新的匹配器。

【讨论】：

啊 - 所以Matcher 缓存字符串值而不是引用实例。这很有趣（也有点烦人）。谢谢！
这并不完全正确。 Matcher 捕获对String 实例的引用。由于String 是一个不可变对象，您无法更改String 的值，但您可以引用一个new String。这就是您更改current_text 时发生的情况。您现在基本上是在引用一个新的 String，而Matcher 仍然是在引用旧的String。