如何从正则表达式中排除一个词答案

【问题标题】：How to exclude a word from regex如何从正则表达式中排除一个词
【发布时间】：2022-01-25 22:00:06
【问题描述】：

我有一个有效的正则表达式。但是我希望它删除具有特定单词的匹配项。

/\<meta[^\>]+(http\-equiv[^\>]+?refresh[^\>]+?(?<!\-)(?<!\d)[0-9]\d*[^\>]+?url[^\>]+?|(?<!\-)(?<!\d)[0-9]\d*[^\>]+?url[^\>]+?http\-equiv[^\>]+?refresh[^\>]+?)\/?\>/is

这匹配以下内容：（http-equiv 和 url 任意顺序）

<meta http-equiv="refresh" content="21;URL='http://example.com/'" />
<meta content="21;URL='http://example.com/'" http-equiv="refresh" />

我想排除任何包含?PageSpeed=noscript的网址

一个。 <meta content="21;URL='http://example.com/?PageSpeed=noscript'" http-equiv="refresh" /> 湾。 <meta content="21;URL='http://example.com/segment?PageSpeed=noscript&var=value'" http-equiv="refresh" />

非常感谢任何想法。谢谢。

【问题讨论】：

是标准的元标记还是就像你的例子中的分号格式一样？
重定向页面的标准元标记。本质上，正则表达式检测页面是否重定向到某个地方。所以“内容”的价值必须是非负的。最后，URL 不能包含 ?PageSpeed=noscript
对于这种特殊情况，我只使用str_contains，因为它更容易发现和评论异常。可能不是您正在寻找的答案，我理解。
You could use a negative lookahead (demo)。如前所述，使用解析器可能是一个更好的主意。
@ShivanandSharma 当然有可能。查看更新的答案。

标签： php regex

【解决方案1】：

您可以使用 DOM 解析器而不是正则表达式。

<?php

$meta = '<meta content="21;URL=\'http://example.com/\'" http-equiv="refresh" /> <meta content="21;URL=\'http://example.com/?PageSpeed=noscript\'" http-equiv="refresh" />';

$dom = new DOMDocument;
$dom->loadHTML($meta);
$noPageScripts = [];

foreach ($dom->getElementsByTagName('meta') as $tag) {
  $content = $tag->getAttribute('content');
  // Match the URL
  preg_match('/URL=["\']?([^"\'>]+)["\']?/i',$content,$matches);

  if($tag->getAttribute('http-equiv') && isset($matches[1]) && stripos($matches[1],'?PageSpeed=noscript') === false) {
    $noPageScripts[] = [
      'originalTag' => $dom->saveHTML($tag),
      'url' => $matches[1]
    ];
  }
}

var_dump($noPageScripts);

这是fiddle

【讨论】：

我可以使用 dom 解析器，但是这是一个恶意软件扫描程序，它通过页面源代码中的正则表达式匹配来工作。我的软件的技术限制。但我很高兴您的回复对那些寻找 DOM 解析器的人有所帮助。