如何删除一段 HTML 周围的封闭标签？答案

【问题标题】：How can I remove the enclosing tags around a piece of HTML?如何删除一段 HTML 周围的封闭标签？
【发布时间】：2017-12-05 09:15:00
【问题描述】：

我正在使用 customfilter 模块为 Drupal 使用 asciidoc 语法为文本创建自定义过滤器。我将它包含在 [asciidoc][/asciidoc] 标签中，当我通过 asciidoctor 命令运行它时，输出包含在 <div class="paragraph"><p> 标签中。

我使用 [asciidoc] 标签格式化 html 链接的输出是这样的。

On the markup side Drupal's contrib `markdown` filter has been somewhat iffy,
and so has the `bbcode` filter. Looking around for other more compact documenting
systems led me to the https://asciidoc.org[Asciidoc] utility and its more
advanced brother https://asciidoctor.org[Asciidoctor]. In combination with another
 Drupal module called https://drupal.org/project/customfilter[customfilter] which
makes it easy to create your own filters, I think I have hit on a combination
of modules which allow me as much freedom and fine control on my pages as I want.

<div class="paragraph">
<p>On the markup side Drupal&#8217;s contrib <code>markdown</code> filter has been somewhat iffy,
and so has the <code>bbcode</code> filter. Looking around for other more compact documenting
systems led me to the <a href="https://asciidoc.org">Asciidoc</a> utility and its more
advanced brother <a href="https://asciidoctor.org">Asciidoctor</a>. In combination with another
 Drupal module called <a href="https://drupal.org/project/customfilter">customfilter</a> which
makes it easy to create your own filters, I think I have hit on a combination
of modules which allow me as much freedom and fine control on my pages as I want.</p>
</div>

是否有一些 PHP 函数可以将字符串 HTML 和一组封闭标签转换为字符串，并返回它们所包含的内部 HTML？或者也许是一些可以匹配标签之间部分的正则表达式？

这是想要的输出

On the markup side Drupal&#8217;s contrib <code>markdown</code> filter has been somewhat iffy,
and so has the <code>bbcode</code> filter. Looking around for other more compact documenting
systems led me to the <a href="https://asciidoc.org">Asciidoc</a> utility and its more
advanced brother <a href="https://asciidoctor.org">Asciidoctor</a>. In combination with another
 Drupal module called <a href="https://drupal.org/project/customfilter">customfilter</a> which
makes it easy to create your own filters, I think I have hit on a combination
of modules which allow me as much freedom and fine control on my pages as I want.

我问了一个相关问题，是否可以配置 asciidoc 以避免将输出包含在 <div class="paragraph"><p>...</p></div> - Does asciidoctor have a setting to remove the <paragraph> and <p> tags from the source it outputs? 中

【问题讨论】：

strip-tags 为此目的是错误的。目的是只去除外部标签，因为它们会创建额外的段落。例如，如果我更喜欢使用 asciidoc URL 并在 URL 周围使用[asciidoc]https://asciidoctor.org[Asciidoctor][/asciidoc]，过滤器将在 URL 周围创建段落，这会破坏文本的流动。 strip-enclosing-tags 会更喜欢。

标签： php regex html-content-extraction

【解决方案1】：

通过纯 PHP，您可以使用 DOMDocument，我不建议您使用它，因为它很慢，而且您在跟踪它的错误时会遇到麻烦等等。出于同样的原因，我不会解释更多关于该对象的信息。只是来自官方网站的链接：

PHP DomDocument

注意：当您处理大文本时，我个人更喜欢使用DomDocument，例如，我曾经阅读整个页面并一一获取所有元素，这几乎不可能使用正则表达式。在那种情况下，我使用了DomDocument。

让我们回到你的主题。您的示例表明您没有解析大块，因此我建议使用Regex。

preg_match_all( '/<p>(?P<content>.*?)<\/p>/s' ,$text, $ref );
var_dump($ref['content']);

上面的正则表达式给你所有的元素 beetwen p 标签。

您可以像这样玩它并制作一个新的：

preg_match_all( '/<div class="paragraph">\s<p>(?P<content>.*?)<\/*p>\s<\/*div>/' ,$text, $ref );

它为您提供 div 标签之间的所有内容（标签可能具有任何属性）。

另请参阅下面的正则表达式链接

Regex Tutorial

祝你好运

【讨论】：

感谢您的帮助。我注意到第二个正则表达式没有删除输出中的第一个 <p>。怎么解决？
当我在文本中插入<p>xxxx</p> 时，匹配在新引入的</p> 上结束，此时它应该匹配与开头<div class="paragraph"> 匹配的div 标签之前的</p>。怎么也能解决？
不客气。关于您的第一条评论，是的，第二个正则表达式也返回 p elems 但第一个是您想要的。它为您提供p 标签之间的所有内容。另请参阅我的新编辑。关于第二条评论，我不太明白你的意思，希望我的新编辑有所帮助。
@vfclists 试试这个。 <div class="paragraph">\s<p>(?P<content>.*?)<\/*p>\s<\/*div>
谢谢。这个一炮打响！！。即使插入了额外的
和

标签，它仍然可以工作。