【问题标题】:recursive web crawler perl递归网络爬虫 perl
【发布时间】:2012-12-16 08:59:07
【问题描述】:

我正在尝试编写一个最小的网络爬虫。目的是从种子中发现新的 URL 并进一步抓取这些新的 URL。代码如下:

use strict;
use warnings;
use Carp;
use Data::Dumper;
use WWW::Mechanize;

my $url = "http://foobar.com"; # example
my %links;

my $mech = WWW::Mechanize->new(autocheck => 1);
$mech->get($url);
my @cr_fronteir = $mech->find_all_links();

foreach my $links (@cr_fronteir) {
    if ( $links->[0] =~ m/^http/xms ) {
        $links{$links->[0]} = $links->[1];
    }
}

我被困在这里,如何进一步爬取 %links 中的链接,以及如何增加深度以防止溢出。欢迎提出建议。

【问题讨论】:

    标签: perl recursion web-scraping web-crawler


    【解决方案1】:

    Mojolicious 网络框架提供了一些对网络爬虫有用的有趣功能:

    • 除 Perl v5.10 或更高版本外无依赖项
    • 网址解析器
    • DOM 树解析器
    • 异步 HTTP/HTTPS 客户端(允许并发请求,没有 fork() 开销)

    这是一个递归抓取本地 Apache 文档并显示页面标题和提取链接的示例。它使用 4 个并行连接并且不超过 3 个路径级别,每个提取的链接只访问一次:

    #!/usr/bin/env perl
    use 5.010;
    use open qw(:locale);
    use strict;
    use utf8;
    use warnings qw(all);
    
    use Mojo::UserAgent;
    
    # FIFO queue
    my @urls = (Mojo::URL->new('http://localhost/manual/'));
    
    # User agent following up to 5 redirects
    my $ua = Mojo::UserAgent->new(max_redirects => 5);
    
    # Track accessed URLs
    my %uniq;
    
    my $active = 0;
    
    sub parse {
        my ($tx) = @_;
    
        # Request URL
        my $url = $tx->req->url;
    
        say "\n$url";
        say $tx->res->dom->at('html title')->text;
    
        # Extract and enqueue URLs
        for my $e ($tx->res->dom('a[href]')->each) {
    
            # Validate href attribute
            my $link = Mojo::URL->new($e->{href});
            next if 'Mojo::URL' ne ref $link;
    
            # "normalize" link
            $link = $link->to_abs($tx->req->url)->fragment(undef);
            next unless $link->protocol =~ /^https?$/x;
    
            # Don't go deeper than /a/b/c
            next if @{$link->path->parts} > 3;
    
            # Access every link only once
            next if ++$uniq{$link->to_string} > 1;
    
            # Don't visit other hosts
            next if $link->host ne $url->host;
    
            push @urls, $link;
            say " -> $link";
        }
    
        return;
    }
    
    sub get_callback {
        my (undef, $tx) = @_;
    
        # Parse only OK HTML responses
        $tx->res->code == 200
            and
        $tx->res->headers->content_type =~ m{^text/html\b}ix
            and
        parse($tx);
    
        # Deactivate
        --$active;
    
        return;
    }
    
    Mojo::IOLoop->recurring(
        0 => sub {
    
            # Keep up to 4 parallel crawlers sharing the same user agent
            for ($active .. 4 - 1) {
    
                # Dequeue or halt if there are no active crawlers anymore
                return ($active or Mojo::IOLoop->stop)
                    unless my $url = shift @urls;
    
                # Fetch non-blocking just by adding
                # a callback and marking as active
                ++$active;
                $ua->get($url => \&get_callback);
            }
        }
    );
    
    # Start event loop if necessary
    Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
    

    有关更多网络抓取提示和技巧,请阅读I Don’t Need No Stinking API: Web Scraping For Fun and Profit 文章。

    【讨论】:

      【解决方案2】:

      如果不将其设为函数,就不能进行递归。

      use strict;
      use warnings;
      use Carp; #unused, but I guess yours was a sample
      use Data::Dumper;
      use WWW::Mechanize;
      
      my %links;
      my $mech = WWW::Mechanize->new(autocheck => 1);
      
      sub crawl {
          my $url = shift;
          my $depth = shift or 0;
          #this seems like a good place to assign some form of callback, so you can
          # generalize this function
      
          return if $depth > 10; #change as needed
      
          $mech->get($url);
          my @cr_fronteir = $mech->find_all_links();
      
          #not so sure what you're trying to do; before, $links in the
          # foreach overrides the global %links
          #perhaps you meant this...?
          foreach my $link (@cr_fronteir) {
              if ($link->[0] =~ m/^http/xms) {
                  $links{$link->[0]} = $link->[1];
      
                  #be nice to servers - try not to overload them
                  sleep 3;
                  #recursion!
                  crawl( $link->[0], depth+1 );
              }
          }
      }
      
      crawl("http://foobar.com", 0);
      

      我没有在这个分区上安装 Perl,所以这很容易出现语法错误和其他恶作剧,但可以作为基础。

      正如第一个函数注释中所说:您可以通过向函数传递一个回调并为您抓取的每个链接调用它来概括您的函数以获得更大的荣耀,而不是对映射功能进行硬编码。

      【讨论】:

        【解决方案3】:

        一些代码:

        while ( scalar @links ) {
            my $link = shift @links;
            process_link($link);
        }
        
        sub process_link {
            my $link = shift;
        
            $mech->get($link);
            foreach my $page_link ( $mech->find_all_links() ) {
                next if $links{$page_link};
                $links{$page_links} = 1;
                push @links, $page_link;
            }
        }
        

        P。 S. /m/s 修饰符在您的代码中是不必要的(/x 也是如此)。

        【讨论】:

        • /m、/s 和 /x 标志:各种 Perl 风格指南建议在每个正则表达式中放置这些标志。 /ms 改变了一些对新手不友好的正则表达式行为,而 /x 非常有用;-) 我也总是用这三个标志来注释我的正则表达式,无论是否直接需要。
        猜你喜欢
        • 2017-08-10
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2012-03-22
        • 2011-12-11
        • 1970-01-01
        相关资源
        最近更新 更多