PHP中正则表达式的解析器？答案

【问题标题】：A parser for regular expressions in PHP?PHP中正则表达式的解析器？
【发布时间】：2011-06-03 09:48:16
【问题描述】：

我需要在 PHP 中将正则表达式解析为它们的组件。我创建或执行正则表达式没有问题，但我想显示有关正则表达式的信息（例如，列出捕获组，将重复字符附加到它们的目标，...）。整个项目是 WordPress 的一个插件，它提供有关重写规则的信息，这些规则是具有替换模式的正则表达式，并且可能难以理解。

我自己写了a simple implementation，它似乎可以处理我扔给它的简单正则表达式并将它们转换为语法树。在我扩展此示例以支持更多操作正则表达式语法之前，我想知道是否还有其他好的实现我可以查看。实现语言并不重要。我假设大多数解析器都是为优化匹配速度而编写的，但这对我来说并不重要，甚至可能会影响清晰度。

【问题讨论】：

你试过使用正则表达式吗？哦不，你已经有十几个问题了：O
@Ivo：事实上，我的第一个实现是基于正则表达式的，但它变得非常复杂，以至于我切换到了一个简单的基于字符的循环。
你很想实现这样的xenon.stanford.edu/~xusch/regexp/analyzer.html 对吗？
有一个旧的 perl 包可能符合要求。 search.cpan.org/~gsullivan/YAPE-Regex-Explain-4.01/Explain.pm

标签： php regex parsing abstract-syntax-tree

【解决方案1】：

好吧，你可以看看php中正则表达式函数的实现。由于 php 是一个开源项目，所有的源代码和文档都是公开的。

【讨论】：

谢谢，但 PCRE 库（PHP 使用的）在速度方面进行了相当优化，因此不太适合我的需求。

【解决方案2】：

您需要的是一种语法和一种为其生成解析器的方法。生成解析器的最简单方法是直接在您的目标语言（例如，在 PHP）中编写递归下降，您可以在其中构建一个干净的解析器，该解析器的形状与您的语法完全一样（这使得解析器也可维护）。

我的SO description of how to build recursive descent parsers 和additional theory details here 提供了很多关于如何做到这一点的详细信息，一旦你掌握了语法，我的additional theory details here

至于正则表达式，一个简单的语法（可能不是你想的那个）是：

REGEX =  ALTERNATIVES ;
ALTERNATIVES = TERM ( '|' TERM )* ;
TERM = '(' ALTERNATIVES ')' |  CHARACTER | SET | TERM ( '*' | '+' | '?' ) ;
SET = '~' ? '[' ( CHARACTER | CHARACTER '-' CHARACTER )* ']' ;
CHARACTER = 'A' | 'B' | ... | '0' ... '9' | ...  ;

用 PHP 编写的用于处理该语法的递归下降解析器应该在几百行的数量级上，最多。

鉴于此作为起点，您应该能够向其中添加 PHP 正则表达式的功能。

解析愉快！

【讨论】：

【解决方案3】：

您可能对我去年夏天做的一个项目感兴趣。它是一个 Javascript 程序，提供 PCRE 兼容正则表达式的动态语法高亮：

见：Dynamic (?:Regex Highlighting)++ with Javascript!
和associated tester page
和GitHub project page

代码使用 (Javascript) 正则表达式将 (PCRE) 正则表达式分解为各个部分，并应用标记以允许用户将鼠标悬停在各个组件上并查看匹配的括号并捕获组号。

（我使用正则表达式编写它，因为我不知道更好！8^）

【讨论】：

【解决方案4】：

我会尝试将 ActionScript 1/2 正则表达式库转换为 PHP。早期版本的 Flash 没有本地正则表达式支持，因此有一些库是用 AS 编写的。从一种动态语言翻译成另一种动态语言应该比尝试破译 C 容易得多。

这里有一个链接可能值得一看：http://www.jurjans.lv/flash/RegExp.html

【讨论】：

【解决方案5】：

我是Debuggex 的创建者，其要求与您的非常相似：优化可显示的信息量。

下面是 Debuggex 使用的解析器中经过大量修改（为了便于阅读）的 sn-p。它不能按原样工作，而是用于演示代码的组织。大多数错误处理已被删除。许多简单但冗长的逻辑也是如此。

请注意，使用了recursive descent。这就是您在解析器中所做的，除了您的被扁平化为单个函数。我大致使用了这个语法：

Regex -> Alt
Alt -> Cat ('|' Cat)*
Cat -> Empty | (Repeat)+
Repeat -> Base (('*' | '+' | '?' | CustomRepeatAmount) '?'?)
Base -> '(' Alt ')' | Charset | Literal
Charset -> '[' (Char | Range | EscapeSeq)* ']'
Literal -> Char | EscapeSeq
CustomRepeatAmount -> '{' Number (',' Number)? '}'

您会注意到我的很多代码只是处理正则表达式的 javascript 风格的特殊性。您可以在this reference 找到有关它们的更多信息。对于 PHP，this 包含您需要的所有信息。我认为您的解析器进展顺利；剩下的就是实现其余的运算符并正确处理边缘情况。

:) 享受：

var Parser = function(s) {
  this.s = s; // This is the regex string.
  this.k = 0; // This is the index of the character being parsed.
  this.group = 1; // This is a counter for assigning to capturing groups.
};

// These are convenience methods to make reading and maintaining the code
// easier.
// Returns true if there is more string left, false otherwise.
Parser.prototype.more = function() {
  return this.k < this.s.length;
};
// Returns the char at the current index.
Parser.prototype.peek = function() { // exercise
};
// Returns the char at the current index, then advances the index.
Parser.prototype.next = function() { // exercise
};
// Ensures c is the char at the current index, then advances the index.
Parser.prototype.eat = function(c) { // exercise
};

// We use a recursive descent parser.
// This returns the root node of our tree.
Parser.prototype.parseRe = function() {
  // It has exactly one child.
  return new ReTree(this.parseAlt());
  // We expect that to be at the end of the string when we finish parsing.
  // If not, something went wrong.
  if (this.more()) {
    throw new Error();
  }
};

// This parses several subexpressions divided by |s, and returns a tree
// with the corresponding trees as children.
Parser.prototype.parseAlt = function() {
  var alts = [this.parseCat()];
  // Keep parsing as long as a we have more pipes.
  while (this.more() && this.peek() === '|') {
    this.next();
    // Recursive descent happens here.
    alts.push(this.parseCat());
  }
  // Here, we allow an AltTree with single children.
  // Alternatively, we can return the child if there is only one.
  return new AltTree(alts);
};

// This parses several concatenated repeat-subexpressions, and returns
// a tree with the corresponding trees as children.
Parser.prototype.parseCat = function() {
  var cats = [];
  // If we reach a pipe or close paren, we stop. This is because that
  // means we are in a subexpression, and the subexpression is over.
  while (this.more() && ')|'.indexOf(this.peek()) === -1) {
    // Recursive descent happens here.
    cats.push(this.parseRepeat());
  }
  // This is where we choose to handle the empty string case.
  // It's easiest to handle it here because of the implicit concatenation
  // operator in our grammar.
  return (cats.length >= 1) ? new CatTree(cats) : new EmptyTree();
};

// This parses a single repeat-subexpression, and returns a tree
// with the child that is being repeated.
Parser.prototype.parseRepeat = function() {
  // Recursive descent happens here.
  var repeat = this.parseBase();
  // If we reached the end after parsing the base expression, we just return
  // it. Likewise if we don't have a repeat operator that follows.
  if (!this.more() || '*?+{'.indexOf(this.peek()) === -1) {
    return repeat;
  }

  // These are properties that vary with the different repeat operators.
  // They aren't necessary for parsing, but are used to give meaning to
  // what was parsed.
  var min = 0; var max = Infinity; var greedy = true;
  if (this.peek() === '*') { // exercise
  } else if (this.peek() === '?') { // exercise
  } else if (this.peek() === '+') {
    // For +, we advance the index, and set the minimum to 1, because
    // a + means we repeat the previous subexpression between 1 and infinity
    // times.
    this.next(); min = 1;
  } else if (this.peek() === '{') { /* challenging exercise */ }

  if (this.more() && this.peek() === '?') {
    // By default (in Javascript at least), repetition is greedy. Appending
    // a ? to a repeat operator makes it reluctant.
    this.next(); greedy = false;
  }
  return new RepeatTree(repeat, {min:min, max:max, greedy:greedy});
};

// This parses a "base" subexpression. We defined this as being a
// literal, a character set, or a parnthesized subexpression.
Parser.prototype.parseBase = function() {
  var c = this.peek();
  // If any of these characters are spotted, something went wrong.
  // The ) should have been eaten by a previous call to parseBase().
  // The *, ?, or + should have been eaten by a previous call to parseRepeat().
  if (c === ')' || '*?+'.indexOf(c) !== -1) {
    throw new Error();
  }
  if (c === '(') {
    // Parse a parenthesized subexpression. This is either a lookahead,
    // a capturing group, or a non-capturing group.
    this.next(); // Eat the (.
    var ret = null;
    if (this.peek() === '?') { // excercise
      // Parse lookaheads and non-capturing groups.
    } else {
      // This is why the group counter exists. We use it to enumerate the
      // group appropriately.
      var group = this.group++;
      // Recursive descent happens here. Note that this calls parseAlt(),
      // which is what was initially called by parseRe(), creating
      // a mutual recursion. This is where the name recursive descent
      // comes from.
      ret = new MatchTree(this.parseAlt(), group);
    }
    // This MUST be a ) or something went wrong.
    this.eat(')');
    return ret;
  } else if (c === '[') {
    this.next(); // Eat the [.
    // Parse a charset. A CharsetTree has no children, but it does contain
    // (pseudo)chars and ranges, and possibly a negation flag. These are
    // collectively returned by parseCharset().
    // This piece can be structured differently depending on your
    // implementation of parseCharset()
    var opts = this.parseCharset();
    // This MUST be a ] or something went wrong.
    this.eat(']');
    return new CharsetTree(opts);
  } else {
    // Parse a literal. Like a CharsetTree, a LiteralTree doesn't have
    // children. Instead, it contains a single (pseudo)char.
    var literal = this.parseLiteral();
    return new LiteralTree(literal);
  }
};

// This parses the inside of a charset and returns all the information
// necessary to describe that charset. This includes the literals and
// ranges that are accepted, as well as whether the charset is negated.
Parser.prototype.parseCharset = function() {
  // challenging exercise
};

// This parses a single (pseudo)char and returns it for use in a LiteralTree.
Parser.prototype.parseLiteral = function() {
  var c = this.next();
  if (c === '.' || c === '^' || c === '$') {
    // These are special chars. Their meaning is different than their
    // literal symbol, so we set the 'special' flag.
    return new CharInfo(c, true);
  } else if (c === '\\') {
    // If we come across a \, we need to parse the escaped character.
    // Since parsing escaped characters is similar between literals and
    // charsets, we extracted it to a separate function. The reason we
    // pass a flag is because \b has different meanings inside charsets
    // vs outside them.
    return this.parseEscaped({inCharset: false});
  }
  // If neither case above was hit, we just return the exact char.
  return new CharInfo(c);
};

// This parses a single escaped (pseudo)char and returns it for use in
// either a LiteralTree or a CharsetTree.
Parser.prototype.parseEscaped = function(opts) {
  // Here we instantiate some default options
  opts = opts || {};
  inCharset = opts.inCharset || false;

  var c = peek();
  // Here are a bunch of escape sequences that require reading further
  // into the string. They are all fairly similar.
  if (c === 'c') { // exercises
  } else if (c === '0') {
  } else if (isDigit(c)) {
  } else if (c === 'x') {
  } else if (c === 'u') {
    // Use this as an example for implementing the ones above.
    // A regex may be used for this portion, but I think this is clearer.
    // We make sure that there are exactly four hexadecimal digits after
    // the u. Modify this for the escape sequences that your regex flavor
    // uses.
    var r = '';
    this.next();
    for (var i = 0; i < 4; ++i) {
      c = peek();
      if (!isHexa(c)) {
        throw new Error();
      }
      r += c;
      this.next();
    }
    // Return a single CharInfo desite having read multiple characters.
    // This is why I used "pseudo" previously.
    return new CharInfo(String.fromCharCode(parseInt(r, 16)));
  } else { // No special parsing required after the first escaped char.
    this.next();
    if (inCharset && c === 'b') {
      // Within a charset, \b means backspace
      return new CharInfo('\b');
    } else if (!inCharset && (c === 'b' || c === 'B')) {
      // Outside a charset, \b is a word boundary (and \B is the complement
      // of that). We mark it one as special since the character is not
      // to be taken literally.
      return new CharInfo('\\' + c, true);
    } else if (c === 'f') { // these are left as exercises
    } else if (c === 'n') {
    } else if (c === 'r') {
    } else if (c === 't') {
    } else if (c === 'v') {
    } else if ('dDsSwW'.indexOf(c) !== -1) {
    } else {
      // If we got to here, the character after \ should be taken literally,
      // so we don't mark it as special.
      return new CharInfo(c);
    }
  }
};

// This represents the smallest meaningful character unit, or pseudochar.
// For example, an escaped sequence with multiple physical characters is
// exactly one character when used in CharInfo.
var CharInfo = function(c, special) {
  this.c = c;
  this.special = special || false;
};

// Calling this will return the parse tree for the regex string s.
var parse = function(s) { return (new Parser(s)).parseRe(); };

【讨论】：

【解决方案6】：

perl 模块YAPE::Regex::Explain 模块可能很容易移植到PHP。这是它的输出示例

C:\>perl -e "use YAPE::Regex::Explain;print YAPE::Regex::Explain->new(qr/['-])->explain;"
The regular expression:

(?-imsx:['-])

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ['-]                     any character of: ''', '-'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------



C:\>perl -e "use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/(\w+), ?(.)/)->explain;"
The regular expression:

(?-imsx:(\w+), ?(.))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  ,                        ','
----------------------------------------------------------------------
   ?                       ' ' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    .                        any character except \n
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

C:\>

您可以查看the source code 并快速查看实现。

【讨论】：