在 Python 中使用正则表达式来解析 LaTeX 代码答案

【问题标题】：Using regular expressions in Python to parse LaTeX code在 Python 中使用正则表达式来解析 LaTeX 代码
【发布时间】：2015-08-25 03:22:06
【问题描述】：

我正在尝试编写一个 Python 脚本来整理我的 LaTeX 代码。我想找到启动环境的实例，但是在下一个换行符之前的声明之后有非空白字符。比如我想匹配

\begin{theorem}[Weierstrass Approximation] \label{wapprox}

但不匹配

\begin{theorem}[Weierstrass Approximation] 
\label{wapprox}

我的目标是在声明结尾和第一个非空白字符之间插入（使用 re.sub）换行符。马虎地说，我想找到类似的东西

(\begin{evn}) ({text} | [text]) ({text2}|[text2]) ... ({textn}|textn]) (\S)

进行替换。我试过了

expr = re.compile(r'\\(begin|end){1}({[^}]+}|\[[^\]]+\])+[^{\[]+$',re.M)

但这并不完全有效。作为最后一组，它只匹配最后一对 {,} 或 [,]。

【问题讨论】：

一个不太复杂的解决方案可能是为 LaTeX 编写一个分词器/词法分析器，它将输入拆分为令牌并将它们一个接一个地复制到第二个缓冲区中。复制它们时，您可以确定是否要插入额外的空格或换行符。当您遍历每个标记时，如果遇到 '\begin{(\w+)}' 标记，则进入一个状态，确保在复制下一个非空白标记之前插入换行符。尝试使用正则表达式对 LaTeX 文档进行全文档分析可能会很脆弱。
一如既往，不要使用正则表达式来解析结构化语言。

标签： python parsing regex

【解决方案1】：

你可以这样做：

import re

s = r'''\begin{theorem}[Weierstrass Approximation] \label{wapprox}

but not match

\begin{theorem}[Weierstrass Approximation] 
\label{wapprox}'''

p = re.compile(r'(\\(?:begin|end)(?=((?:{[^}]*}|\[[^]]*])*))\2)[^\S\n]*(?=\S)')

print(p.sub(r'\1\n', s))

图案细节：

(   # capture group 1
    \\
    (?:begin|end)
    # trick to emulate an atomic group
    (?=(  # the subpattern is enclosed in a lookahead and a capture group (2)
        (?:{[^}]*}|\[[^]]*])*
    ))  # the lookahead is naturally atomic
    \2  # backreference to the capture group 2
)
[^\S\n]* # eventual horizontal whitespaces
(?=\S) # followed by a non whitespace character

说明：如果您编写像(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) 这样的模式，则无法防止在下一个标记之前有换行符的情况。请看以下场景：

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) 匹配：

\begin{theorem}[Weierstrass Approximation]
\label{wapprox}

但由于(?=\S) 失败（因为下一个字符是换行符），就会出现回溯机制：

(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S) 匹配：

\begin{theorem}[Weierstrass Approximation]
\label{wapprox}

并且(?=\S) 现在可以成功匹配[ 字符。

原子组是一个非捕获组，它禁止在组中包含的子模式中回溯。符号是(?>subpattern)。不幸的是 re 模块没有这个功能，但你可以用技巧(?=(subpattern))\1 来模拟它。

请注意，您可以使用regex module（具有此功能）而不是 re:

import regex

p = regex.compile(r'(\\(?:begin|end)(?>(?:{[^}]*}|\[[^]]*])*)[^\S\n]*(?=\S)')

或

p = regex.compile(r'(\\(?:begin|end)(?:{[^}]*}|\[[^]]*])*+[^\S\n]*+(?=\S)')

【讨论】：