在 HTML 文件的输出中保留 HTML 标记答案

【问题标题】：Retain HTML tags in output from HTML file在 HTML 文件的输出中保留 HTML 标记
【发布时间】：2017-05-01 18:54:30
【问题描述】：

我正在尝试将 HTML 网页中的文本和标签提取到文本文件中。

这里是输入的网页内容（在view:source模式下查看时）：

<div class="moduleBody">In addition, <b>ABC provides</b> dual finishing and detailing <u>products</u>, including a system of cleaners, dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics Business</p><p></p><p>The Safety and Graphics segment serves a range of markets for the safety, security and productivity of people, facilities and systems. Its <b>product offerings</b> include personal protection products, such as <u>respiratory, hearing, eye and fall protection</u> equipment;<div class="moreLink">

以下代码在单独提取文本时工作正常，但它正在取消 <p>、</p>、<u>、</u>、<b> 和 </b> 以及其他 HTML 标记，我想保留它。

use WWW::Mechanize;

use threads;

my $mech = WWW::Mechanize->new;

my $Lvalue = "";

$mech->get($link);
$mech->quiet(1);

my $p = HTML::TokeParser->new(\$mech->content);

while ( my $tag1 = $p->get_tag('div') ) {

    if ( $tag1->[1]{class} and $tag1->[1]{class} eq 'moduleBody' ) {

        $Lvalue = $p->get_trimmed_text("moreLink");
        $Lvalue =~ s/$find1/|/g;
        $Lvalue =~ s/$find2/|/g;

        print $fh "$ticker^|$Lvalue\n";
    }
}

上面代码的输出是：

In addition, ABC provides dual finishing and detailing products, including a system of cleaners, dressings, polishes, waxes and other products. Safety and Graphics Business The Safety and Graphics segment serves a range of markets for the safety, security and productivity of people, facilities and systems. Its product offerings include personal protection products, such as respiratory, hearing, eye and fall protection equipment;

实际上，我的代码正在删除我想要保留的 HTML 标记。我觉得可能需要调整“get_trimmed_text”以保留 p、/p、b 和 /b（以及其他 html）标签。有人可以帮助对代码进行任何必要的更改吗？

明确说明要求： 我正在寻找一个 perl 函数，它可以帮助提取位于网页上“<div class="moduleBody">”和“<div class="moreLink">”之间的（TEXT+ ALL HTML TAGS）（如上面的示例输入文本中所引用）。我愿意使用除 get_trimmed_text 之外的其他功能。

非常感谢。

回答此问题 - 面向普通观众 @SinanÜnür 提供的回复效果很好。谢谢@SinanÜnür！ +1 并将其标记为答案。为了广大观众的利益，请注意，只要您将 HTML 内容保留在“my $html = <<HTML;”变量中，Sinan Ünür 的代码就可以正常工作。如果您正在阅读 URL，则需要对代码进行一些调整以包含以下内容：

use LWP::Simple;
my $url = "http://www.example.com/profile?item=66&class=XYZ";
my $html = get($url);

【问题讨论】：

好吧，我提供了一个独立的例子。您可以根据自己获取源 HTML 的方式来调整它。

标签： perl

【解决方案1】：

在问题更新后更新答案。

我正在寻找一个 perl 函数，它可以帮助提取位于网页上“<div class="moduleBody">”和“<div class="moreLink">”之间的（TEXT+ ALL HTML TAGS）（如上面的示例输入文本中所引用）。

HTML::TokeParser 是一个流解析器：您要求令牌或标签（它们是特定种类的令牌。因此，使用此模块，您将要求解析器找到下一个div，检查它是否正确类，如果是，则开始累积所有后续标记的内容，直到 <div class="moreLink"> 开始标记。

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML;
<div class="moduleBody">In addition, <b>ABC provides</b>
dual finishing and detailing <u>products</u>, including a system of cleaners,
dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics
Business</p><p></p><p>The Safety and Graphics segment serves a range of markets
for the safety, security and productivity of people, facilities and systems.
Its <b>product offerings</b> include personal protection products, such as
<u>respiratory, hearing, eye and fall protection</u> equipment;<div
class="moreLink">
HTML

my $p = HTML::TokeParser::Simple->new(\$html);
my $start = { tag => 'div', class => 'moduleBody' };
my $end = { tag => 'div', class => 'moreLink' };

while ( defined(my $chunk = extract_html_between($p, $start, $end)) ) {
    print "[[[$chunk]]]\n"
}

sub extract_html_between {
    my $p = shift;
    my $start = shift;
    my $end = shift;

    my $chunk;
    while (my $tag = $p->get_tag($start->{tag})) {
        my $class = $tag->get_attr('class');
        next unless $class and $class eq $start->{class};

        $chunk = $tag->as_is; # only if you want the opening div
        CHUNK:
        while (my $token = $p->get_token) {
            if ( $token->is_start_tag($end->{tag}) ) {
                $class = $token->get_attr('class');
                last CHUNK if $class and $class eq $end->{class};
            }
            $chunk .= $token->as_is;
        }
    }

    return $chunk;
}

输出：

[[[<div class="moduleBody">In addition, <b>ABC provides</b>
dual finishing and detailing <u>products</u>, including a system of cleaners,
dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics
Business</p><p></p><p>The Safety and Graphics segment serves a range of markets
for the safety, security and productivity of people, facilities and systems.
Its <b>product offerings</b> include personal protection products, such as
<u>respiratory, hearing, eye and fall protection</u> equipment;]]]

【讨论】：

【解决方案2】：

这是非常奇怪的代码。除了获取网页之外，您没有使用WWW::Mechanize，因此您不妨直接使用LWP::UserAgent。此外，如果您想提取 HTML 资源的 parr 并打印它，HTML::TokeParser 不是正确的工具

您甚至似乎都没有阅读过文档，因为$p->get_trimmed_text("moreLink") 将返回所有文本，直到第一次出现<moreLink> 元素，这不是一个有效的HTML 标记。您所拥有的是您刚刚找到的div 元素的class 属性的值

我会为此选择 Mojolicious，因为它会获取页面、构建 DOM 并字符串化您指定的元素，而无需任何其他模块

我已经写了这个，但我目前无法测试它

use strict;
use warnings 'all';

use Mojo::UserAgent;

use constant URL => 'http://example.com/';

my $ua = Mojo::UserAgent->new;

my $txn = $ua->get(URL);

if ( my $err = $txn->error ) {
    die "@{$err}{qw/ code message /};
}

print $txn->res->dom->at('div.moduleBody')->to_string;

【讨论】：

鉴于 OP 的 HTML，它会在最后打印一个额外的 <div class="moreLink"> </div></div>。这就是使用流解析器而不是构造 DOM 的好处。无论 HTML 有多损坏，您都可以在任何地方停下来。
@SinanÜnür：我同意。在 OP 更新之前，一切似乎都表明他们只想要显示的 HTML 的逐字版本。我仍然对这两个替换感到困惑。不管怎样，我把它留在这里是为了让其他人来这里的时候对它的奥秘要求不那么高。