如何用链接替换 HTML 文本中的词汇表术语？答案

【问题标题】：How to replace glossary terms in HTML text with links?如何用链接替换 HTML 文本中的词汇表术语？
【发布时间】：2012-02-20 09:43:32
【问题描述】：

我想运行str_replace 或preg_replace，它在我的$content 中查找某些单词（在$glossary_terms 中找到）并将它们替换为链接（如<a href="/glossary/initial/term">term</a>）。

但是，$content 是完整的 HTML，我的链接/图像也受到影响，这不是我想要的。

$content 的一个例子是：

<div id="attachment_542" class="wp-caption alignleft" style="width: 135px"><a href="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1.jpg"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /></a><p class="wp-caption-text">Amazonas Magazine - now in English!</p></div>
<p>Edited by Hans-Georg Evers, the magazine &#8216;Amazonas&#8217; has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it&#8217;s only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper&#8217;s Xmas list&#8230;</p>
<p>The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices.</p>
<p>It&#8217;s fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout.</p>
<p>U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the <a href="http://www.amazonasmagazine.com/">Amazonas website</a> for further information and a sample digital issue!</p>
<p>Alternatively, subscribe directly to the print version <a href="https://www.amazonascustomerservice.com/subscribe/index2.php">here</a> or digital version <a href="https://www.amazonascustomerservice.com/subscribe/digital.php">here</a>. Just gonna add this to the end of the post so I can do some testing.</p>

我遇到了this link，但我不确定这种方法是否适用于嵌套 HTML。

我有什么办法可以str_replace 或preg_replace 内容仅包含在<p> 标记中；排除任何嵌套的<a>、<img> 或<h1/2/3/4/5> 标签？

提前致谢，

【问题讨论】：

str_replace within certain html tags only的可能重复
可能重复？我引用了该主题并表示，“不确定这种方法是否适用于嵌套 HTML”。
@dunc：使用$xpath->query("//text()[not(parent::a) and contains(., '$glossary_term')]")，一切就绪。 // 部分负责嵌套。
@dunc - 显然您没有正确阅读链接的答案，接受的答案使用 DomDocument 和 XPath 来完成工作，强烈建议您甚至不要考虑使用 str_replace 或 preg_replace跨度>
相反，我阅读了整个线程，尤其是接受的答案。但是，我以前从未遇到过这样的功能，我不清楚它们是否会完全满足我的需要。我也不认为从 2010 年 7 月开始提出问题是谨慎或适当的。

标签： php

【解决方案1】：

“按部就班的解决方案”是这样的：

<?php

$html = "<your HTML string>";
$glossary_terms = array('fishes', 'invertebrates', 'aquatic plants');

$dom = new DOMDocument;
$dom->loadHTML($html);

dom_link_glossary($dom, $glossary_terms);

echo $dom->saveHTML();

// wraps all occurrences of the glossary terms in links
function dom_link_glossary(&$document, &$glossary) {
  $xpath   = new DOMXPath($document);
  $urls    = array();
  $pattern = array();

  // build a normalized lookup (case-insensitive, whitespace-agnostic)
  foreach ($glossary as $term) {
    $term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($term)));
    $pattern[] = preg_replace('/ /', '\\s+', preg_quote($term_norm));
    $urls[$term_norm] = '/glossary/initial/' . rawurlencode($term);
  }

  $pattern  = '/\b(' . implode('|', $pattern) . ')\b/i';
  $text_nodes = $xpath->query('//text()[not(ancestor::a)]');

  foreach($text_nodes as $original_node) {
    $text     = $original_node->nodeValue;
    $hitcount = preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);

    if ($hitcount == 0) continue;

    $offset   = 0;
    $parent   = $original_node->parentNode;
    $refnode  = $original_node->nextSibling;

    $parent->removeChild($original_node);

    foreach ($matches[0] as $i => $match) {
      $term_txt = $match[0];
      $term_pos = $match[1];
      $term_norm = preg_replace('/\s+/', ' ', strtoupper($term_txt));

      // insert any text before the term instance
      $prefix = substr($text, $offset, $term_pos - $offset);
      $parent->insertBefore($document->createTextNode($prefix), $refnode);

      // insert the actual term instance as a link
      $link = $document->createElement("a", $term_txt);
      $link->setAttribute("href", $urls[$term_norm]);
      $parent->insertBefore($link, $refnode);

      $offset = $term_pos + strlen($term_txt);

      if ($i == $hitcount - 1) {  // last match, append remaining text
        $suffix = substr($text, $offset);
        $parent->insertBefore($document->createTextNode($suffix), $refnode);
      }
    }
  }
}
?>

dom_link_glossary() 的工作原理如下：

它将词汇表术语（修剪、大写、空白）标准化，并构建一个查找数组和一个匹配所有术语的正则表达式模式。
它使用 XPath 来查找尚未成为链接一部分的所有文本节点。无论嵌套深度如何，都会返回文本节点（即我们不需要递归）。我使用\b 来防止部分匹配。
对于每个包含术语的文本节点：
- 删除原文节点($parent->removeChild())
- 现在新节点被创建并插入到 DOM 中：文本节点用于词汇表术语之前（或之后）的任何内容，元素节点 (<a>) 用于实际术语表术语。

解决方案保留原始大小写和空白，因此

term 将变为 <a href="/glossary/initial/term">term</a>
Term 将变为 <a href="/glossary/initial/term">Term</a>
Foo Bar 将变为 <a href="/glossary/initial/foo%20bar">Foo Bar</a>。 HTML 中多余的空格或换行符不会破坏机制。

请注意，在纯文本节点值上使用正则表达式是完全可以的。不能在完整的 HTML 上使用正则表达式。

我建议将词汇表术语与其各自的 URL 放在一个数组中，而不是在函数中计算 URL。这样你就可以让多个词条指向同一个 URL。

【讨论】：

抱歉 Tomalak - 出于某种原因，直到我登录写另一个问题之前，我才看到这篇文章。我今晚会试试这个，非常感谢。
嗨托马拉克。我已经实现了脚本，但它链接了每个空间：<a href=""></a>All<a href=""></a> <a href=""></a>species<a href=""></a> <a href=""></a>in<a href="">。有什么想法吗？
@dunc 我已经测试过这个功能，它绝对不是为我做的。 -- 如果你仔细观察，它实际上并没有链接空间。看着我的水晶球：难道你的$glossary_terms 包含空字符串？
这正是我的想法@Tomalak，但用于单词列表的完整数组（嗯，我想使用的缩减样本，但我仍然看不到它有任何问题）这是：pastebin.com/wNPby2U3
好的，放弃那个。我正在做一些奇怪的事情，导致您的代码无法正常工作 - 抱歉。我花了大约一个小时来试图弄清楚为什么您的代码可以使用您的词汇表术语而不是我的 - wet/dry filter 似乎是问题所在！ :) 现已修复，非常感谢您的帮助。

【解决方案2】：

你可以试试这个：

$content = preg_replace('/(<p\sclass=\"wp\-caption\-text\">)[^<]+(<\/p>)/i', '', $content);

【讨论】：

你能解释一下那条线以及它的具体作用吗？我不擅长正则表达式。
好吧，你可以看到，在 preg_replace 函数中，第一个参数是
标签。在标签之间有 [^] 表示标签之间的任何内容。您还将在
之前看到 / 和在
之后看到 /i，它们定义了开始和结束。然后在第二个参数中，有一个空字符串来替换
标记之间的任何内容（在这里您可以设置自己的字符串来替换）。我想这应该会有所帮助。
@dunc：你永远不应该使用正则表达式来处理 HTML。 特别是当你不擅长正则表达式时。
是的，我不想 Tomalak - 感谢您回答我最初的问题。如果您想提供它作为答案，我很乐意给您“打勾”:) 另外，如果您有任何关于使用 $xpath->query 的链接，那将非常有帮助。