【发布时间】:2018-10-16 10:41:03
【问题描述】:
我在尝试调试以下代码时遇到了一些错误。
请注意,它从http://europa.eu/youth/volunteering/evs-organisation#open 的大约 6,000 个字段中获取数据
解析每个页面后,检查底部的next ›链接是否存在。
View-source 是一个基于浏览器的命令。它告诉浏览器以纯文本形式输出响应,而不是根据其实际内容类型(在本例中为 HTML)来呈现它。您不需要在 URL 中包含 view-source。
这里有一个脚本,可以从每个块中提取数据并稍微清理一下。 browse 函数是通用的。它需要一个输入引用,其中包含父级和子级的 URL 和 XPath,以便构造输出引用。这只是一种方法:它还没有在每个页面上导航,
在我测试的粗略脚本中,我使用//span[@class="ey_badge"] 获取总结果,然后使用最大页面
my $page_max = $results / 21;
$page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max ) ;
查看错误
martin@linux-3645:~/dev/perl> perl eu.pl
syntax error at eu.pl line 81, near "our "
Global symbol "$iterator_organizations" requires explicit package name at eu.pl line 81.
Can't use global @_ in "my" at eu.pl line 84, near "= @_"
Missing right curly or square bracket at eu.pl line 197, at end of line
Execution of eu.pl aborted due to compilation errors.
martin@linux-3645:~/dev/perl> ^C
martin@linux-3645:~/dev/perl>
它从http://europa.eu/youth/volunteering/evs-organisation#open 的大约 6,000 个字段中获取数据
查看代码
use strict;
use warnings FATAL => qw#all#;
use LWP::UserAgent;
use HTML::TreeBuilder::XPath;
use Data::Dumper;
my $handler_relurl = sub { q#https://europa.eu# . $_[0] };
my $handler_trim = sub { $_[0] =~ s#^\s*(.+?)\s*$#$1#r };
my $handler_val = sub { $_[0] =~ s#^[^:]+:\s*##r };
my $handler_split = sub { [ split $_[0], $_[1] ] };
my $handler_split_colon = sub { $handler_split->( qr#; #, $_[0] ) };
my $handler_split_comma = sub { $handler_split->( qr#, #, $_[0] ) };
my $conf = {
url => q#https://europa.eu/youth/volunteering/evs-organisation_en#,
parent => q#//div[@class="vp ey_block block-is-flex"]#,
children => {
internal_url => [ q#//a/@href#, [ $handler_relurl ] ],
external_url => [ q#//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, [ $handler_trim ] ],
title => [ q#//h4# ],
topics => [ q#//div[@class="org_cord"]#, [ $handler_val, $handler_split_colon ] ],
location => [ q#//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, [ $handler_trim ] ],
hand => [ q#//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, [ $handler_trim, $handler_split_comma ] ],
pic_number => [ q#//p[contains(.,'PIC no')]#, [ $handler_val ] ],
}
};
print Dumper browse( $conf );
sub browse {
my $conf = shift;
my $ref = [ ];
my $lwp_useragent = LWP::UserAgent->new( agent => q#IE 6#, timeout => 10 );
my $response = $lwp_useragent->get( $conf->{url} );
die $response->status_line unless $response->is_success;
my $content = $response->decoded_content;
my $html_treebuilder_xpath = HTML::TreeBuilder::XPath->new_from_content( $content );
my @nodes = $html_treebuilder_xpath->findnodes( $conf->{parent} );
for my $node ( @nodes ) {
push @$ref, { };
while ( my ( $key, $val ) = each %{ $conf->{children} } ) {
my $xpath = $val->[0];
my $handlers = $val->[1] // [ ];
$val = ( $node->findvalues( qq#.$xpath# ) )[0] // next;
$val = $_->( $val ) for @$handlers;
$ref->[-1]->{$key} = $val;
}
}
return $ref;
}
{
'internal_url' => 'https://europa.eu/youth/volunteering/organisation/948417016_en',
'external_url' => 'http://www.apd.ge',
'location' => 'Tbilisi, Georgia',
'title' => '"Academy for Peace and Development" Union',
'topics' => [
'Access for disadvantaged',
'Youth (Participation, Youth Work, Youth Policy)',
'Intercultural/intergenerational education and (lifelong)learning'
],
'pic_number' => '948417016',
'hand' => [
'Receiving',
'Sending'
]
}
our $iterator_organizations = sub {
my ( $browser, $parent ) = @_;
my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;
my $nodes = $browser->nodes( url => $url );
my $iterator = sub {
return shift @$nodes;
};
return ( $iterator, 1 );
our $iterator_organizations_b = sub {
my ( $browser, $parent ) = @_;
my $url = q#https://europa.eu/youth/volunteering/evs-organisation_en#;
my $uri = URI->new( $url );
my $xpath = q#//div[@class="vp ey_block block-is-flex"]#;
my $nodes = [ ];
my $page = 0;
my $results = $parent->{results};
my $page_max = $results / 21;
$page_max = int($page_max) == $page_max ? $page_max-- : int($page_max);
my $iterator_uri = sub {
$uri->query_form( page => $page++ );
return $page > 2 ? undef : $uri ; # $page_max;
};
my $iterator_node = sub {
unless ( @$nodes ) {
my $uri = $iterator_uri->( ) // return undef;
my $options = $page == 1 ? { tree => $parent->{_node} } : { url => $uri->as_string };
$nodes = $browser->nodes( %$options, xpath => $xpath );
}
return shift @$nodes;
};
return ( $iterator_node, 0 );
};
our $iterator_organization = sub {
my ( $browser, $parent ) = @_;
my $url = $parent->{internal_url};
my $nodes = $browser->nodes( url => $url );
my $iterator = sub {
return shift @$nodes;
};
return ( $iterator, 1 );
};
sub organizations {
my ( $self, $options ) = ( shift, { @_ } );
my $map = [
$Massweb::Browser::Europa::iterator_organizations,
results => q#.//span[@class="ey_badge"]#,
organizations => [
$Massweb::Browser::Europa::iterator_organizations_b,
internal_url => [ q#.//a/@href#, $Massweb::Browser::Europa::handler_url ],
external_url => [ q#.//i[@class="fa fa-external-link fa-lg"]/parent::p//a/@href#, $Massweb::Browser::handler_trim ],
title => q#.//h4#,
topics => [ q#.//div[@class="org_cord"]#, $Massweb::Browser::handler_val, $Massweb::Browser::handler_list_colon ],
location => [ q#.//i[@class="fa fa-location-arrow fa-lg"]/parent::p#, $Massweb::Browser::handler_trim ],
hand => [ q#.//i[@class="fa fa-hand-o-right fa-lg"]/parent::p#, $Massweb::Browser::handler_trim, $Massweb::Browser::handler_list_comma ],
pic_number => [ q#.//p[contains(.,'PIC no')]#, $Massweb::Browser::handler_val ],
recruiting => [ q#boolean(.//i[@class="fa fa-user-times fa-lg"])#, $Massweb::Browser::handler_bool_rev ],
_ => \&organization,
],
];
my $organizations = $self->browse( map => $map );
return $organizations;
}
sub organization {
my ( $self, $options ) = ( shift, { @_ } );
my $map = [
sub { $Massweb::Browser::Europa::iterator_organization->( $_[0], $options ) },
#title => q#.//h1#,
description => q#.//div[@class="ey_vp_detail_page"]/p#,
];
my $organization = $self->browse( map => $map );
return $organization;
}
【问题讨论】:
-
你的
$page_max = int( $page_max ) == $page_max ? $page_max-- : int( $page_max )很奇怪。它说:“如果变量是整数,则将其递减,然后重新分配其原始值,否则删除其小数部分。 -
您好,亲爱的鲍罗丁-非常感谢您的提示:好吧-因为我是 perl 初学者,所以我尝试在这里学习。我认为我需要重新编写代码并且必须使其更简单 -
标签: apache perl lwp-useragent