【问题标题】:Deciphering a Regex解密正则表达式
【发布时间】:2013-05-29 13:21:39
【问题描述】:

请有人帮我理解这个用于匹配 HTML 中 img 标记的 src 属性的正则表达式吗?

src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))


src=                               this is easy
(?:(['""])(?<src>(?:(?!\1).)*)     ?: is unknown (['""]) matches either single or double quotes, followed by a named group "src" that matches unknown strings
\1                                 unknown
|                                  "or"
(?<src>[^\s>]+))                   named group "src" matches one or more of line start or whitespace

简而言之,?: 是什么意思?

所以(?:...) 是常规括号的非捕获版本。匹配括号内的任何正则表达式,但组匹配的子字符串在执行匹配后无法检索或稍后在模式中引用。

谢谢@mbratch

\1 是什么意思?

最后,感叹号在这里有什么特殊意义吗? (否定?)

【问题讨论】:

  • 你在理解什么方面有困难?
  • 没关系。用正则表达式本身解析html是错误的
  • 用正则表达式解析单个标签的内部实际上可能没问题
  • 我没有用正则表达式解析 html!

标签: .net regex


【解决方案1】:

这可能有助于您理解正则表达式。

(?:(['""])((?:(?!\1).)*)\1|([^\s>]+))

Edit live on Debuggex

【讨论】:

    【解决方案2】:

    例如,将src="img.jpg" 视为我们正在解析的文本

    在正则表达式中,\1 指的是第一个捕获组。在这种特殊情况下,第一个捕获组是(['""])。在我们的示例中,(?:(['""])(?&lt;src&gt;(?:(?!\1).)*) 部分是匹配 "img.jpg 的非捕获组。特别是,(['""]) 匹配任何引号字符。那么(?!\1) 是第一个组中匹配的引号字符的负前瞻,因此(?:(?!\1).) 匹配不是第一个组匹配的引号字符的任何字符,(?&lt;src&gt;(?:(?!\1).)*) 匹配命名捕获组中的一个序列结束引号字符之前的字符数。然后下面的\1 匹配结束引号字符。

    【讨论】:

      【解决方案3】:
      src=      # matches literal "src="
      (?:       # the ?: suppresses capturing. generally a good practice if capturing
                # is not explicitly necessary
        (['"])  # matches either ' or ", and captures what was matched in group 1
                # (because this is the first set of parentheses where capturing is not
                # suppressed)
        (?<src> # start another (named) capturing group with the name "src"
          (?:   # start non-capturing group
            (?!\1)
                # a negative lookahead, if its contents match, the lookahead causes the
                # pattern to fail
                # the \1 is a backreference and matches what was matched in capturing
                # group no. 1
          .)*   # match any character, end of non-capturing group, repeat
                # summary of this non-capturing group: for each character, check that
                # it is not the kind of quote we matched at the start. if it's not,
                # then consume it. repeat as long as possible.
      
        )       # end of capturing group "src"
        \1      # again a backreference to what was matched inside capturing group 1
                # i.e. match the same kind of quote that started the attribute value
      |         # or
        (?<src> # again a capturing group with the name "src"
          [^\s>]+
                # match as many non-space, non-> character as possible (at least one)
        )       # end of capturing group. this case treats unquoted attribute values.
      )         # end of non-capturing group (which was used to group the alternation)
      

      为您进一步阅读:

      如果您想稍微更新一下您的正则表达式知识,我建议您通读整个教程。绝对值得您花时间。

      更多资源可帮助您理解复杂的表达式:

      • Regex 101 从正则表达式生成解释。但是,它使用 PHP 的 PCRE 引擎,因此它会阻塞某些 .NET 功能,例如重复命名的捕获组(在您的情况下为 src)。
      • Debuggex 可让您逐步执行正则表达式并生成流程图。到目前为止,它的正则表达式风格更加有限(对于 JavaScript 的 ECMAScript 风格)
      • Regexper 专注于流程图。不过,到目前为止,它还仅限于 JavaScript 正则表达式。

      【讨论】:

        【解决方案4】:

        我使用 RegexBuddy 得到这个输出:

        Match the characters “src=” literally «src=»
        Match the regular expression below «(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))»
           Match either the regular expression below (attempting the next alternative only if this one fails) «(['""])(?<src>(?:(?!\1).)*)\1»
              Match the regular expression below and capture its match into backreference number 1 «(['""])»
                 Match a single character present in the list “'"” «['""]»
              Match the regular expression below and capture its match into backreference with name “src” «(?<src>(?:(?!\1).)*)»
                 Match the regular expression below «(?:(?!\1).)*»
                    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
                    Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\1)»
                       Match the same text as most recently matched by capturing group number 1 «\1»
                    Match any single character that is not a line break character «.»
              Match the same text as most recently matched by capturing group number 1 «\1»
           Or match regular expression number 2 below (the entire group fails if this one fails to match) «(?<src>[^\s>]+)»
              Match the regular expression below and capture its match into backreference with name “src” «(?<src>[^\s>]+)»
                 Match a single character NOT present in the list below «[^\s>]+»
                    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
                    A whitespace character (spaces, tabs, line breaks, etc.) «\s»
                    The character “>” «>»
        

        这个正则表达式对于你所描述的非常糟糕。 src=" 是一个有效的输入。

        【讨论】:

          【解决方案5】:

          1>它首先捕获组 1 中的任意一个 ['""],即 (['""])

          2>然后它将 0 匹配到多个不是第 1 组中捕获的字符,即(?:(?!\1).)*

          3>它执行第 2 步,直到它与第 1 组中捕获的匹配,即\1

          以上3个步骤与(['""])[^\1]*\1类似

          1>匹配所有非空格,>src= 之后的字符,即[^\s&gt;]+


          注意 我会使用src=(['""]).*?\1

          .* 是贪心的,它尽可能匹配..

          .*? 是懒惰的,它匹配的越少越好..

          例如,考虑这个字符串hello hi world

          对于正则表达式^h.*l,输出将是hello hi worl

          对于正则表达式^h.*?l,输出将是hel

          【讨论】:

          • 请您解释一下您建议的正则表达式中的*? 部分吗?
          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2014-03-22
          • 2012-10-17
          • 2012-07-25
          • 2023-03-06
          • 2010-11-18
          • 2017-02-23
          相关资源
          最近更新 更多