【问题标题】：Extracting urls from @font-face by searching within @font-face for replacement通过在 @font-face 中搜索以从 @font-face 中提取 url 以进行替换
【发布时间】：2014-02-18 23:44:56
【问题描述】：

我有一个网络服务，它重写 css 文件中的 url，以便它们可以通过 CDN 提供。

css 文件可以包含图片或字体的 url。

我目前有以下正则表达式来匹配 css 文件中的所有 url：

(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))

但是，我现在想引入对自定义字体的支持，并且需要定位 @font-fontface 内的 url：

@font-face {
  font-family: 'FontAwesome';
  src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
  src: url("fonts/fontawesome-webfont.eot?#iefix&v=4.0.3") format("embedded-opentype"), url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"), url("fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"), url("fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular") format("svg");
  font-weight: normal;
  font-style: normal;
}

然后我想出了以下内容：

@font-face\s*\{.*(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))\s*\}

问题是它匹配所有内容，而不仅仅是里面的 url。我想我可以像这样使用lookbehind：

(?<=@font-face\s*\{.*)(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))(?<=-\s*\})

不幸的是，PCRE（PHP 使用的）不支持后视中的变量重复，所以我被卡住了。

我不希望通过扩展名检查字体，因为某些字体具有 .svg 扩展名，这可能与具有 .svg 扩展名的图像冲突。

此外，我还想修改我原来的正则表达式以匹配不在@font-face 内的所有其他网址：

.someclass {
  background: url('images/someimage.png') no-repeat;
}

由于我无法使用lookbehinds，如何从@font-face 和@font-face 之外的网址中提取网址？

【问题讨论】：

您只需要提取还是希望之后能够进行替换？
我想做一个preg_replace()。对困惑感到抱歉。我将编辑我的问题:)
为什么要排除以“http”开头的网址？你能举例说明你想做的那种替换吗？
因为这些是完全定义的 url。在这些情况下，css 文件的作者希望指向某个特定位置，因此我们不应修改它们。我只想重写相对或仅包含文件夹和文件名的 url。
考虑使用 PHP CSS 解析器，例如：github.com/sabberworm/PHP-CSS-Parser

标签： php css regex

【解决方案1】：

你可以用这个：

$pattern = <<<'LOD'
~
(?(DEFINE)
    (?<quoted_content>
        (["']) (?>[^"'\\]++ | \\{2} | \\. | (?!\g{-1})["'] )*+ \g{-1}
    )
    (?<comment> /\* .*? \*/ )
    (?<url_skip> (?: https?: | data: ) [^"'\s)}]*+ )
    (?<other_content>
        (?> [^u}/"']++ | \g<quoted_content> | \g<comment>
          | \Bu | u(?!rl\s*+\() | /(?!\*) 
          | \g<url_start> \g<url_skip> ["']?+
        )++
    )
    (?<anchor> \G(?<!^) ["']?+ | @font-face \s*+ { )
    (?<url_start> url\( \s*+ ["']?+ )
)

\g<comment> (*SKIP)(*FAIL) |

\g<anchor> \g<other_content>?+ \g<url_start> \K [./]*+ 

( [^"'\s)}]*+ )    # url
~xs
LOD;

$result = preg_replace($pattern, 'http://cdn.test.com/fonts/$8', $data);
print_r($result);

测试字符串

$data = <<<'LOD'
@font-face {
  font-family: 'FontAwesome';
  src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
  src: url(fonts/fontawesome-webfont.eot?#iefix&v=4.0.3) format("embedded-opentype"),
     /*url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"),*/
       url("http://domain.com/fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"),
       url('fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular') format("svg");
  font-weight: normal;
  font-style: normal;
}
/*
@font-face {
  font-family: 'Font1';
  src: url("fonts/font1.eot");
} */
@font-face {
  font-family: 'Fon\'t2';
  src: url("fonts/font2.eot");
}
@font-face {
  font-family: 'Font3';
  src: url("../fonts/font3.eot");
}
LOD;

主要思想：

为了提高可读性，该模式被划分为命名的子模式。 (?(DEFINE)...) 不匹配任何内容，它只是一个定义部分。

此模式的主要技巧是使用\G 锚，这意味着：字符串的开始或与先例匹配。我在(?<!^) 后面添加了一个否定的lookbehind 来避免这个定义的第一部分。

<anchor> 命名子模式是最重要的，因为它仅在找到 @font-face { 或紧跟在 url 结尾之后才允许匹配（这就是您可以看到 ["']?+ 的原因）。

<other_content> 表示所有不是 url 部分但与必须跳过的 url 部分匹配的所有内容（以“http:”、“data:”开头的 url）。这个子模式的重要细节是它不能匹配@font-face 的右花括号。

<url_start>的使命只是匹配url("。

\K从匹配结果中重置所有之前匹配过的子串。

([^"'\s)}]*+) 匹配 url （匹配结果中唯一保留前导 ./../ 的内容）

由于 <other_content> 和 url 子模式无法匹配 }（即在引用或注释部分之外），您肯定永远不会匹配 @font-face 定义之外的内容，第二个后果是模式总是在最后一个 url 之后失败。因此，在下一次尝试时，“连续分支”将失败，直到下一个 @font-face。

另一个技巧：

主要模式以\g<comment> (*SKIP)(*FAIL) | 开头，以跳过cmets /*....*/ 中的所有内容。 \g<comment> 指的是描述评论外观的基本子模式。 (*SKIP) 禁止重试之前匹配过的子字符串（在他的左边，由g<comment>），如果模式在他的右边失败。 (*FAIL) 强制模式失败。使用这个技巧，cmets 被跳过并且不是匹配结果（因为模式失败）。

子模式详情：

quoted_content： 在<other_content> 中使用它来避免匹配引号内的url( 或/*。

(["'])              # capture group: the opening quote
(?>                 # atomic group: all possible content between quotes
    [^"'\\]++       # all that is not a quote or a backslash
  |                 # OR
    \\{2}           # two backslashes: (two \ doesn't escape anything)
  |                 # OR
    \\.             # any escaped character
  |                 # OR
    (?!\g{-1})["']  # the other quote (this one that is not in the capture group)
)*+                 # repeat zero or more time the atomic group
\g{-1}              # backreference to the last capturing group

other_content: 所有不是右花括号，或者没有http:或data:的url

(?>                     # open an atomic group
    [^u}/"']++          # all character that are not problematic!
  |
    \g<quoted_content>  # string inside quotes
  |
    \g<comment>         # string inside comments
  |
    \Bu                 # "u" not preceded by a word boundary
  |
    u(?!rl\s*+\()       # "u" not followed by "rl("  (not the start of an url definition)
  |                   
    /(?!\*)             # "/" not followed by "*" (not the start of a comment)
  |
    \g<url_start>       # match the url that begins with "http:"
    \g<url_skip> ["']?+ # until the possible quote
)++                     # repeat the atomic group one or more times

锚点

\G(?<!^) ["']?+    # contiguous to a precedent match with a possible closing quote
|                  # OR
@font-face \s*+ {  # start of the @font-face definition

注意：

您可以改进主要模式：

在 @font-face 的最后一个 url 之后，正则表达式引擎尝试匹配 <anchor> 的“连续分支”并匹配所有字符，直到导致模式失败的 }。然后，在每个相同的字符上，正则表达式引擎必须尝试两个分支或<anchor>（在} 之前总是会失败。

为了避免这些无用的尝试，您可以将主要模式更改为：

\g<comment> (*SKIP)(*FAIL) |

\g<anchor> \g<other_content>?+
(?>
    \g<url_start> \K [./]*+  ([^"'\s)}]*+)
  | 
    } (*SKIP)(*FAIL)
)

在这个新场景中，最后一个 url 之后的第一个字符由“连续分支”匹配，\g<other_content> 匹配所有字符，直到}、\g<url_start> 立即失败，} 匹配并且@987654360 @ 使模式失败并禁止重试这些字符。

【讨论】：

太棒了！你介意为正则表达式添加一些 cmets 吗？我是一个正则表达式菜鸟，很想知道它是如何工作的:)
另外，有没有办法反转这个，以便我可以匹配不在@font-face 中的网址？
我还注意到，如果我将http:// 添加到第一个font-face 中的任何网址，则不会捕获其余没有http:// 的网址。
我刚刚发现，如果字体 url 看起来像：.../fonts/blah.eot 或 /fonts/blah.eot
哇！很好的解释！我以前从未使用过任何子模式。关于生成的 url 匹配的快速问题：如果它们以 / 开头或包含任何类似于 /../ 或 ../../ 的内容，是否可以删除它们？

【解决方案2】：

_{免责声明：您可能不打算使用库，因为它比您想象的要难。我还想开始这个关于如何匹配不在 @font-face {} 中的 URL 的答案。我还假设/定义括号 {} 在 @font-face {} 内是平衡的。
注意：我要去使用“~”作为分隔符而不是“/”，这将使我以后不会在我的表达中转义。另请注意，我将从regex101.com 发布在线演示，在该站点上我将使用 g 修饰符。您应该删除 g 修饰符并只使用 preg_match_all()。
让我们使用一些正则表达式 Fu !!!}

第 1 部分：匹配不在 @font-face {}

中的 url

1.1 匹配@font-face {}

哦，是的，这可能听起来“奇怪”，但您稍后会注意到原因：)
我们需要一些递归正则表达式：

@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1

demo

1.2 转义@font-face {}

我们将在前一个正则表达式之后使用(*SKIP)(*FAIL)，它会跳过它。请参阅 this answer 了解它的工作原理。

demo

1.3 匹配url()

我们将使用这样的东西：

url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
\2               # Match what was matched in group 2
\s*              # Match optionally some whitespaces
\)               # Match )

请注意，我使用的是\2，因为我已将其附加到前一个具有组 1 的正则表达式中。
Here 是("|')(?:[^\\]|\\.)*?\1 的另一种用法。

demo

1.4 匹配url()里面的值

您可能已经猜到我们需要使用一些lookaround-fu，问题在于lookbehind，因为它需要固定长度。我有一个解决方法，我将向您介绍\K 转义序列。它将匹配的开始重置为令牌列表中的当前位置。 ^more-info
好吧，让我们将 \K 放在表达式中的某个位置并使用前瞻，我们的最终正则表达式将是：

@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
(*SKIP)(*FAIL)   # Skip it
|                # Or
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \2            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)

demo

1.5 在 PHP 中使用模式

我们需要转义引号、反斜杠\\\\ = \，使用正确的函数和正确的修饰符：

$regex = '~
@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
(*SKIP)(*FAIL)   # Skip it
|                # Or
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|\'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["\']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\\\]|\\\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \2            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)
~xs';

$input = file_get_contents($css_file);
preg_match_all($regex, $input, $m);
echo '<pre>'. print_r($m[0], true) . '</pre>';

demo

第 2 部分：匹配位于 @font-face {}

中的 url

2.1 不同的方法

我想在 2 个正则表达式中完成这部分，因为在递归正则表达式中处理大括号 {} 的状态时，匹配 @font-face {} 内的 URL 会很痛苦。

既然我们已经有了我们需要的部分，我们只需要在一些代码中应用它们：

匹配所有@font-face {} 实例
遍历这些并匹配所有 url() 的

2.2 放入代码

$results = array(); // Just an empty array;
$fontface_regex = '~
@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
~xs';

$url_regex = '~
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|\'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["\']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url\'s with http, https or ftp)
(?:[^\\\\]|\\\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \1            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)
~xs';

$input = file_get_contents($css_file);

preg_match_all($fontface_regex, $input, $fontfaces); // Get all font-face instances
if(isset($fontfaces[0])){ // If there is a match then
    foreach($fontfaces[0] as $fontface){ // Foreach instance
        preg_match_all($url_regex, $fontface, $r); // Let's match the url's
        if(isset($r[0])){ // If there is a hit
            $results[] = $r[0]; // Then add it to the results array
        }
    }
}
echo '<pre>'. print_r($results, true) . '</pre>'; // Show the results

demo

_{_{Join the regex chatroom !}}

【讨论】：

你的解释真的对我理解正则表达式中更高级的东西有很大帮助！ :) 不幸的是，我无法接受多个答案，因为我最终使用了 Casimir et Hippolyte 的解决方案并进行了一些修改。尽管如此，我还是给了你一票！
干得好+1