从多个标签中提取innerHTML答案

【问题标题】：Extract innerHTML from multiple tags从多个标签中提取innerHTML
【发布时间】：2014-10-27 03:35:44
【问题描述】：

我的任务是从 Perl 的 html 链接中提取内部 html 文本。

这是一个例子，

<a href="www.stackoverflow.com">Regex Question</a>

我要提取字符串：Regex Question

请注意，内部文本可能像这样为空。这个例子得到一个空字符串。

<a href="www.stackoverflow.com"></a>

内部文本可能包含多个这样的标签。

<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>

我尝试编写 Perl 正则表达式有一段时间了，但没有成功。特别是，我不知道如何处理多个标签。

【问题讨论】：

为什么使用正则表达式而不是解析器？
实际上，与他们“交易”是什么意思。如果在 a-tags 之间，它们将被匹配，对吗？ Perl 有一些非常好的 html 解析器模块可用。

标签： html regex perl text

【解决方案1】：

使用 HTML Parser 来解析 HTML。

如果你需要从网上下载内容，我建议你看看Mojo::DOM和Mojo::UserAgent。

下面会拉取所有href包含stackoverflow.com的链接，并显示里面的文字：

use strict;
use warnings;

use Mojo::DOM;
use Data::Dump;

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

for my $link ($dom->find('a[href*="stackoverflow.com"]')->each) {
    dd $link->all_text;
}

__DATA__
<html>
<body>
<a href="www.stackoverflow.com">Regex Question</a>
I want to extract the string: Regex Question

<a href="www.notme.com">Don't want this link</a>
Note that, the inner text might be empty like this. This example get an empty string.

<a href="www.stackoverflow.com"></a>
and the inner text might be enclosed with multiple tags like this.

<a href="www.stackoverflow.com"><b><h2>Regex Question with tags</h2></b></a>
</body>
</html>

输出：

"Regex Question"
""
"Regex Question with tags"

有关有用的 8 分钟介绍视频，请查看Mojocast Episode 5。

【讨论】：

【解决方案2】：

<a[^>]*>(?:<[^>]*>)*([^<>]*)(?:<[^>]*>)*<\/a>

试试这个。查看演示。抓拍或匹配。

http://regex101.com/r/sU3fA2/1

【讨论】：

它可以工作，除非它也匹配外部标签<a><a><a><a><a>kbjhkb</a>
我不太了解html标签，但它仍然与<a><b>hello</b>world</a>不匹配。

【解决方案3】：

通过正则表达式解析 HTML 是个坏主意，你不是 Chuck Norris。您可以使用Mojo::DOM 模块，这将使您的任务变得非常简单。

一个样本：

use Mojo::DOM;

# Parse
my $dom = Mojo::DOM->new('<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>');

# Find
say $dom->at('a')->text;
say $dom->find('a')->text;

要安装 Mojo::DOM，只需输入以下命令

$ cpan Mojo::DOM

【讨论】：

【解决方案4】：

应该使用 html 解析器，但可能可以使用正则表达式。
这会发现没有嵌套 A 标记的开闭 A 标记对，以及
让其他标签出现在内容中。
如果您想要完全没有其他标签的 a-tags 内容，它会略有不同（未显示）。

由于您使用的是 Perl，这可能会起作用。

 # =~ /(?s)<a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)(?<!\/>)((?:(?!(?><a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)|<\/a\s*>)).)*)<\/a\s*>/

 (?s)
 <a                            # Begin A-tag, must (should) contain attrib/val's
 (?>
      \s+                      # (?!\s) add this if you think malformed '<a  >' could slip by
      (?: " .*? " | ' .*? ' | [^>]*? )+
      >
 )
 (?<! /> )                     # Lookbehind, Insure this is not a closed A-tag '<a/>'
 (                             # (1 start), Capture Content between open/close A-tags
      (?:                           # Cluster, match content
           (?!                           # Negative assertion
                (?>
                     <a                            # Not Start A-tag
                     (?>
                          \s+  
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                          >
                     )
                  |  </a \s* >                     #  and Not End A-tag
                )
           )
           .                             # Assert passed, consume a content character 
      )*                            # End Cluster, do 0 to many times
 )                             # (1 end)
 </a \s* >                     # End A-tag

【讨论】：

【解决方案5】：

怎么样

(?<=>)[^<>\/]*(?=<\/)

将匹配字符串：Regex Question

示例：http://regex101.com/r/sG4bZ1/1

【讨论】：

这个看似简单，但与空字符串不匹配。