Perl Mojo::DOM 查找和替换 html 块答案

【问题标题】：Perl Mojo::DOM to find and replace html blocksPerl Mojo::DOM 查找和替换 html 块
【发布时间】：2014-07-26 16:59:25
【问题描述】：

由于这里的每个人都建议使用 Perl 模块 Mojo::DOM 来完成这项任务，我想问如何使用它。

我在模板中有这个 html 代码：

some html content here top base
<!--block:first-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here bottom base

我想做的事（请不要再建议使用Templates模块），我想先找到内块：

        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->

然后用一些html代码替换它，然后找到第二个块：

<!--block:second-->
    some html content here 2 top
    <!--block:third-->
        some html content here 3a
        some html content here 3b
    <!--endblock-->
    some html content here 2 bottom
<!--endblock-->

然后用一些html代码替换它，然后找到第三块：

<!--block:first-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->

【问题讨论】：

标签： html perl dom mojo

【解决方案1】：

我不建议使用Mojo::DOM 来完成这项任务，因为它可能有点矫枉过正，但是......你可以。

真正的答案是我已经说过的in other questions，那就是使用已经存在的框架，例如Template::Toolkit。它功能强大、经过良好测试且速度很快，因为它允许缓存模板。

但是，您希望推出自己的模板解决方案。任何此类解决方案都应包括解析、验证和执行阶段。我们将只关注前两个步骤，因为您没有分享最后一步的真实信息。

Mojo::DOM 不会有任何真正的魔法。它的好处和强大之处在于它可以完全轻松地解析 HTML，捕捉所有潜在的边缘情况。它只能帮助模板的解析阶段，因为它是您自己的规则来决定验证。事实上，它基本上就像我提供给你的my earlier solution 中的split 的替代品一样。这就是为什么它可能是一个太重的解决方案。

因为修改并不难，所以我在下面编写了一个完整的解决方案。然而，为了让事情变得更有趣，并试图证明我的一个重要观点，现在是时候在 3 个可用解决方案之间分享一些 Benchmark 测试了：

Mojo::DOM 用于解析，如下所示。
split 按照我在Match nested html comment blocks regex 中的建议进行解析
recursive regex 由sln 在Perl replace nested blocks regex 中提出

以下包含所有三种解决方案：

use strict;
use warnings;

use Benchmark qw(:all);

use Mojo::DOM;
use Data::Dump qw(dump dd);

my $content = do {local $/; <DATA>};

#dd parse_using_mojo($content);
#dd parse_using_split($content);
#dd parse_using_regex($content);

timethese(100_000, {
    'regex' => sub { parse_using_regex($content) },
    'mojo' => sub { parse_using_mojo($content) },
    'split' => sub { parse_using_split($content) },
});

sub parse_using_mojo {
    my $content = shift;

    my $dom = Mojo::DOM->new($content);

    # Resulting Data Structure
    my @data = ();

    # Keep track of levels of content
    # - This is a throwaway data structure to facilitate the building of nested content
    my @levels = ( \@data );

    for my $html ($dom->all_contents->each) {
        if ($html->node eq 'comment') {
            # Start of Block - Go up to new level
            if ($html =~ m{^<!--\s*block:(.*)-->$}s) {
                #print +('  ' x @levels) ."<$1>\n";  # For debugging
                my $hash = {
                    block   => $1,
                    content => [],
                };
                push @{$levels[-1]}, $hash;
                push @levels, $hash->{content};
                next;

            # End of Block - Go down level
            } elsif ($html =~ m{^<!--\s*endblock\s*-->$}) {
                die "Error: Unmatched endblock found before " . dump($html) if @levels == 1;
                pop @levels;
                #print +('  ' x @levels) . "</$levels[-1][-1]{block}>\n";  # For debugging
                next;
            }
        }

        push @{$levels[-1]}, '' if !@{$levels[-1]} || ref $levels[-1][-1];
        $levels[-1][-1] .= $html;
    }
    die "Error: Unmatched start block: $levels[-2][-1]{block}" if @levels > 1;

    return \@data;
}


sub parse_using_split {
    my $content = shift;

    # Tokenize Content
    my @tokens = split m{<!--\s*(?:block:(.*?)|(endblock))\s*-->}s, $content;

    # Resulting Data Structure
    my @data = (
        shift @tokens, # First element of split is always HTML
    );

    # Keep track of levels of content
    # - This is a throwaway data structure to facilitate the building of nested content
    my @levels = ( \@data );

    while (@tokens) {
        # Tokens come in groups of 3.  Two capture groups in split delimiter, followed by html.
        my ($block, $endblock, $html) = splice @tokens, 0, 3;

        # Start of Block - Go up to new level
        if (defined $block) {
            #print +('  ' x @levels) ."<$block>\n"; # For Debugging
            my $hash = {
                block    => $block,
                content  => [],
            };
            push @{$levels[-1]}, $hash;
            push @levels, $hash->{content};

        # End of Block - Go down level
        } elsif (defined $endblock) {
            die "Error: Unmatched endblock found before " . dump($html) if @levels == 1;
            pop @levels;
            #print +('  ' x @levels) . "</$levels[-1][-1]{block}>\n"; # For Debugging
        }

        # Append HTML content
        push @{$levels[-1]}, $html;
    }
    die "Error: Unmatched start block: $levels[-2][-1]{block}" if @levels > 1;

    return \@data;
}


sub parse_using_regex {
    my $content = shift;
    my $href = {};
    ParseCore( $href, $content );

    return $href;
}


sub ParseCore
{
    my ($aref, $core) = @_;

        # Set the error mode on/off here ..
    my $BailOnError = 1;
    my $IsError = 0;

    my ($k, $v);
    while ( $core =~ /(?is)(?:((?&content))|(?><!--block:(.*?)-->)((?&core)|)<!--endblock-->|(<!--(?:block:.*?|endblock)-->))(?(DEFINE)(?<core>(?>(?&content)|(?><!--block:.*?-->)(?:(?&core)|)<!--endblock-->)+)(?<content>(?>(?!<!--(?:block:.*?|endblock)-->).)+))/g )
    {
       if (defined $1)
       {
         # CONTENT
           $aref->{content} .= $1;
       }
       elsif (defined $2)
       {
         # CORE
           $k = $2; $v = $3;
           $aref->{$k} = {};
 #         $aref->{$k}->{content} = $v;
 #         $aref->{$k}->{match} = $&;

           my $curraref = $aref->{$k};
           my $ret = ParseCore($aref->{$k}, $v);
           if ( $BailOnError && $IsError ) {
               last;
           }
           if (defined $ret) {
               $curraref->{'#next'} = $ret;
           }
       }
       else
       {
         # ERRORS
           print "Unbalanced '$4' at position = ", $-[0];
           $IsError = 1;

           # Decide to continue here ..
           # If BailOnError is set, just unwind recursion. 
           # -------------------------------------------------
           if ( $BailOnError ) {
              last;
           }
       }
    }
    return $k;
}


__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

具有 3 个嵌套块的简单模板的结果：

Benchmark: timing 100000 iterations of mojo, regex, split...
      mojo: 50 wallclock secs (50.36 usr +  0.00 sys = 50.36 CPU) @ 1985.78/s (n=100000)
     regex: 14 wallclock secs (13.42 usr +  0.00 sys = 13.42 CPU) @ 7453.79/s (n=100000)
     split:  2 wallclock secs ( 2.70 usr +  0.00 sys =  2.70 CPU) @ 37050.76/s (n=100000)

在 100% 处归一化为正则表达式，在 375% 处等同于 mojo，在 20% 处拆分。

对于上面代码中包含的更复杂的模板：

Benchmark: timing 100000 iterations of mojo, regex, split...
      mojo: 237 wallclock secs (236.61 usr +  0.02 sys = 236.62 CPU) @ 422.61/s (n=100000)
     regex: 46 wallclock secs (47.25 usr +  0.00 sys = 47.25 CPU) @ 2116.31/s (n=100000)
     split:  7 wallclock secs ( 6.65 usr +  0.00 sys =  6.65 CPU) @ 15046.64/s (n=100000)

标准化为 100% 的正则表达式，相当于 501% 的 mojo，以及 14% 的拆分。（快 7 倍）

速度重要吗？

如上所示，我们可以毫无疑问地看到我的split 解决方案将比迄今为止的任何其他解决方案都要快。这应该不足为奇。这是一个非常简单的工具，因此速度很快。

事实上，速度并不重要。

为什么不呢？好吧，因为您通过解析和验证模板构建的任何数据结构都可以在每次您想要执行模板时被缓存和重新加载，直到模板发生更改。

最终决定

由于速度与缓存无关，因此您应该关注的是代码的可读性，它的脆弱性，扩展和调试的难易程度等。

尽管我很欣赏精心设计的正则表达式，但它们往往很脆弱。将所有解析和验证逻辑放入一行代码只是自找麻烦。

剩下的要么是拆分解决方案，要么是 mojo。

如果您像我描述的那样进行缓存，您实际上可以毫无顾虑地选择其中任何一个。我为每个提供的代码基本相同，略有不同，所以它是个人喜好。尽管初始解析的拆分速度比使用实际 HTML 解析器更易于维护的代码要快 20-35 倍。

祝你好运，选择你的最终方法。我仍然祈祷你有一天会选择TT，但你会选择自己的毒药:)

【讨论】：

@sln 我知道你的偏好，但如果你好奇的话，这里是基准代码。