在 Perl 中，如何将一个 URL 列表从一个文件流式传输到一个数组中，然后递归地在一个文件中获取它们的所有 HTML 数据？答案

【问题标题】：How does one -- in Perl -- stream a list of URLs from a file into an array to then recursively acquire all of their HTML data in a single file?在 Perl 中，如何将一个 URL 列表从一个文件流式传输到一个数组中，然后递归地在一个文件中获取它们的所有 HTML 数据？
【发布时间】：2014-04-05 07:01:20
【问题描述】：

另一个费力的标题...对不起...无论如何，我有一个名为 mash.txt 的文件，其中包含一堆这样的 URL：

http://www...

.

所以，在这一点上，我想将这些（URL）输入到一个数组中——可能不需要在此过程中声明任何东西——然后递归地从每个数组中提取 HTML 数据并将其全部附加到同一个文件——我想必须创建它......无论如何，提前致谢。

实际上，为了完全实现，按照设计，我希望将每个 HTML 标记中的 option 标记下的值 (value) 与此文档相匹配，所以我没有那么多垃圾。 . 也就是这些中的每一个

http://www...

会产生类似的东西

<!DOCTYPE html>
<HTML>
   <HEAD>
      <TITLE>
         DATA! 
      </TITLE>
   </HEAD>
<BODY>
.
.
.

我想要的只是 option 标记下的 value 名称，该名称出现在此 mash.txt 的每个 HTML 中。

【问题讨论】：

你能举一个这个神秘的value标签的例子吗？您是否暗示每个 HTML 源代码只有一个这些标签？另外，本着 SO 的精神，你到目前为止写了什么，你坚持什么？请证明您已经为此做了一些工作。

标签： perl file stream append html-tree

【解决方案1】：

以下内容获取 mash.txt 中每个 URL 的 HTML 内容，检索所有选项中的所有值，并将它们推送到单个数组中。然后将结果数组传递给 input.template，并将处理后的输出写入 output.html：

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;
use Template;

my %values;
my $input_file     = 'mash.txt';
my $input_template = 'input.template';
my $output_file    = 'output.html';

# create a new lwp user agent object (our browser).
my $ua = LWP::UserAgent->new( );

# open the input file (mash.txt) for reading.
open my $fh, '<', $input_file or die "cannot open '$input_file': $!";

# iterate through each line (url) in the input file.
while ( my $url = <$fh> )
{
    # get the html contents from url. It returns a handy response object.
    my $response = $ua->get( $url );

    # if we successfully got the html contents from url.
    if ( $response->is_success ) 
    {
        # create a new html tree builder object (our html parser) from the html content.
        my $tb = HTML::TreeBuilder->new_from_content( $response->decoded_content );

        # fetch values across options and push them into the values array.
        # look_down returns an array of option node objects, which we translate to the value of the value attribute via attr upon map.
        $values{$_} = undef for ( map { $_->attr( 'value' ) } $tb->look_down( _tag => 'option' ) );
    }
    # else we failed to get the html contents from url.
    else 
    {
        # warn of failure before next iteration (next url).
        warn "could not get '$url': " . $response->status_line;
    }
}

# close the input file since we have finished with it.
close $fh;

# create a new template object (our output processor).
my $tp = Template->new( ) || die Template->error( );

# process the input template (input.template), passing in the values array, and write the result to the output file (output.html).
$tp->process( $input_template, { values => [ keys %values ] }, $output_file ) || die $tp->error( );

__END__

input.template 可能类似于：

<ul>
[% FOREACH value IN values %]
    <li>[% value %]</li>
[% END %]
</ul>

【讨论】：

对于初学者，如果您的代码有 cmets，我肯定会喜欢它。我不介意握着我的手——形象地说（在这种特殊情况下）。
嗯...这很好...如何使用您的方法过滤重复？
您能否更详细地解释一下您所说的过滤器重复是什么意思，您是否要删除重复值？
是的，以防万一。我不认为它会，但你永远不会知道。
我在调整后更新了我的帖子以使用哈希而不是数组，有效地即时删除重复项。