用 Perl 中的 HTML 对应物替换特定的内联 CSS答案

【问题标题】：Replace specific inline CSS with HTML counterpart in Perl用 Perl 中的 HTML 对应物替换特定的内联 CSS
【发布时间】：2009-11-10 04:54:48
【问题描述】：

这是我第一次使用 Stack Overflow，所以如果我做错了什么，请告诉我。

我目前正在尝试编写一个“scraper”，因为没有更好的术语，它将提取 html 并将某些内联 CSS 样式替换为 HTML 对应样式。例如，我有这个 HTML：

<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What's here doesn't matter so much as what needs to happen around it.</span></p>

我希望能够将font-weight:bold 替换为，将font-style:italic 替换为，将text-align:center 替换为<center>。之后，我将使用正则表达式删除所有非基本 HTML 标记和任何属性。 KISS 绝对适用于此。

我已经阅读了这个问题：Convert CSS Style Attributes to HTML Attributes using Perl 和其他一些关于使用 HTML::TreeBuilder 和其他模块（如 HTML::TokeParser）的问题，但到目前为止我自己都被绊倒了。

我是 Perl 的新手，但对一般的编码并不陌生。道理是一样的。

这是我目前所拥有的：

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;

my $newcont = ""; #Has to be set to something? I've seen other scripts where it doesn't...this is confusing.
my $html = <<HTML;
<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What's here doesn't matter so much as what needs to happen around it.</span> And sometimes not all the text is styled the same.</p>
HTML

my $tb = HTML::TreeBuilder->new_from_content($html);
my @spans = $tb->look_down(_tag => q{span}) or die qq{look_down for tag failed: $!\n};

for my $span (@spans){
    #What next?? A print gives HASH, not really workable. Split doesn't seem to work...I've never felt like such a noobie coder before.
}

print $tb->as_HTML;

希望有人可以帮助我，告诉我我可能做错了什么，等等。我真的很好奇还有其他可能的方法可以做到这一点。或者，如果以前曾经做过。

另外，如果有人可以通过建议我应该使用哪些标签来提供帮助，那就太好了。我唯一知道肯定会使用的是 perl。

【问题讨论】：

你为什么不在 perl 中使用简单的搜索和替换 perl -pi -e 's/find/replace/g' file_name
您可以在命令行上执行 3 次替换 3 次。
@John - 因为问题比简单的搜索和替换正则表达式更复杂。
那是我的第一直觉，但是您将如何将新的 HTML 标签包裹在内容周围？完成后的 HTML /should/ 如下所示：<center>Some random text here. What's here doesn't matter so much as what needs to happen around it. And sometimes not all the text is styled the same.</center>
你真正需要的是一个好的 DOM 解析器。 HTML::DOM 似乎有些不成熟。

标签： perl

【解决方案1】：

从 HTML::Element 文档看来，look_down() 返回一个 HTML::Element 对象列表。 Perl 对象通常是对哈希的引用（尽管它们不是必须的）——这就是为什么在打印 $span 时会得到 HASH。

无论如何，在你的 for 循环中，你应该可以调用

 $span->method()

其中 method 是 HTML::Element 的任何方法。对于您的目的，all_attr()、as_text() 和 replace_with() 方法看起来很有前景。

我尝试链接到每种方法，但 SO 不喜欢粗糙的 CPAN 锚定链接，因此为方便起见，这里有一个指向主文档页面的快速链接：

https://metacpan.org/pod/HTML::Element

【讨论】：

你说得对，它只链接到一页，但我想我明白了。我去看看，谢谢。
"Perl 对象在内部只是散列..." 不正确。 Perl 哈希是有福的引用。 bless {}, $class 和 bless [], $class 或 bless do{ \(my $o = "") }, $class 一样有效。
好的，我给。相应地进行了编辑。
我应该用我想出的新代码编辑我的原始问题还是有更好的方法？在评论中添加它不会很好，它可能会被系统吃掉。

【解决方案2】：

迈克，
问题是在 Perl 中你不能在调试器中看到元素的类型，因为对象系统只是标准类型的包装器。因此，查看文档和/或代码是不可能找到相关属性/方法的。 About Objects 为您提供有关此的更多详细信息。
每个 $span 都将是一个 HTML::Element 对象 - Ben 的回答涵盖了这一部分。我猜你只会更改树内的一些属性并将树保存到一个新文件中。

【讨论】：

谢谢。我有点猜到这就是为什么我不能只打印$span。这是一篇好文章。

【解决方案3】：

通过使用HTML::TreeBuilder，您绝对是在正确的轨道上；对于解析 CSS，我刚刚找到了 CSS::DOM。这是一个非常有趣的模块，它允许您轻松访问属性。

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;
use CSS::DOM::Style;

my $html = <<HTML;
<p style="text-align:center"><span style="font-weight:bold;font-style:italic;">Some random text here. What's here doesn't matter so much as what needs to ha>
HTML

my $tb = HTML::TreeBuilder->new_from_content($html);


my @replacements = (
    { property => 'font-style', value => 'italic', replacement => 'em' },
    { property => 'font-weight', value => 'bold', replacement => 'strong' },
    { property => 'text-align', value => 'center', replacement => 'center' },
);

# build a sensible list of tag names (or just use sub { 1 })
my @nodes = $tb->look_down(sub { $_[0]->tag =~ /^(p|span)$/ });

for my $el (@nodes) {
    if ($el->attr('style')) {
        my $st = CSS::DOM::Style::parse($el->attr('style'));
        if ($st) {
            foreach my $h (@replacements) {
                if ($st->getPropertyValue($h->{property}) eq $h->{value}) {
                    $st->removeProperty($h->{property});
                    my $new = HTML::Element->new($h->{replacement});
                    foreach my $inner ($el->detach_content) {
                        $new->push_content($inner);
                    }
                    $el->push_content($new);
                }
            }
            $el->attr('style', $st->cssText ? $st->cssText : undef);
        }
    }
}

print $tb->as_HTML(undef, "\t");

【讨论】：

我最初放弃了 CSS::DOM，因为我阅读的 CPAN 页面使它更多地用于外部 CSS 而不是内联（甚至是页面顶部的内部 CSS）。我会在安装 CSS::DOM 后立即对您的代码进行测试。谢谢！
太棒了！它奏效了，我运行了一些正则表达式来清理仍然存在的错误span：Some random text here. What&#39;s here doesn&#39;t matter so much as what needs to happen around it. And sometimes not all the text is styled the same. 现在我只需要弄清楚如何使p 标签也能正常工作，我们会是金色的。
立即查看。我使用“detach_content”的方式有问题。另外，看看如何构建一个包含所有允许解析的节点的列表。
太棒了！我将发布我稍微调整过的版本作为新答案，这样你就可以看到我用它做了什么。这正是我所需要的！谢谢你，谢谢你，谢谢你。另外，我不知道你是否注意到，但as_HTML 似乎去掉了结尾的p 标签。我通过添加一个空的 hashref ({}) 作为第三个参数来修复它（根据 HTML::TreeBuilder 文档）。
从头开始。它不想让我回答我自己的问题。 :P 这是代码：pastebin.com/f75bfd1a5