【问题标题】:I'd like to scrape the iTunes top X RSS feed and insert into a dB我想抓取 iTunes 顶部 X RSS 提要并插入 dB
【发布时间】:2009-02-23 07:19:33
【问题描述】:

我最好用一些 bash shell 脚本来做,也许是一些 PHP 或 PERL 和一个 MySQL 数据库。想法?

【问题讨论】:

    标签: shell rss scripting screen-scraping itunes


    【解决方案1】:

    这是一个使用 Perl 的解决方案,借助(当然!)一堆模块。

    它使用 SQLite,因此您可以轻松运行它((简单的)数据库的定义在脚本的末尾)。它还使用 Perl 哈希和简单的 SQL 语句,而不是适当的对象和 ORM 层。我发现直接解析 XML 比使用 RSS 模块更容易(我试过 XML::Feed),因为您需要访问特定标签(名称、预览...)。

    您可以使用它作为基础来添加更多功能、数据库中的更多字段、流派表……但至少这样您就有了可以扩展的基础(也许您可以随后发布结果作为开源)。

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use XML::Twig;                 # to parse the RSS
    use DBIx::Simple;              # DB interaction made easy
    use Getopt::Std;               # always need options for a script
    use PerlIO::gzip;              # itunes sends a gzip-ed file
    use LWP::Simple 'getstore';    # to get the RSS
    
    my %opt;
    getopts( 'vc:', \%opt);
    
    # could also be an option, but I guess it won't change that much
    my @URLs= ( 
                'http://ax.itunes.apple.com/WebObjects/MZStoreServices.woa/ws/RSS/topsongs/limit=10/xml',
              );
    
    # during debug, it's nice to use a cache of the feed instead of hitting hit every single run
    if( $opt{c}) { @URLs= ($opt{c}); }
    
    # I like using SQLite when developping,
    # replace with MySQL connect parameters if needed (see DBD::MySQL for the exact syntax)
    my @connect= ("dbi:SQLite:dbname=itunes.db","","", { RaiseError => 1, AutoCommit => 0 }) ;
    
    my $NS_PREFIX='im';
    
    # a global, could be passed around, but would make the code a bit more verbose
    my $db = DBIx::Simple->connect(@connect) or die "cannot connect to DB: $DBI::errstr";
    
    foreach my $url (@URLs)
      { add_feed( $url); }
    
    $db->disconnect;
    
    warn "done\n" if( $opt{v});
    
    sub add_feed 
      { my( $url)= @_;
    
        # itunes sends gziped RSS, so we need to unzip it
        my $tempfile= "$0.rss.gz"; # very crude, should use File::Temp instead 
        getstore($url, $tempfile);
        open( my $in_feed, '<:gzip', $tempfile) or die " cannot open tempfile: $!";
    
        XML::Twig->new( twig_handlers => { 'feed/title' => sub { warn "adding feed ", $_->text if $opt{v}; },
                                              entry       => \&entry,
                                           },
                          map_xmlns => { 'http://phobos.apple.com/rss' => $NS_PREFIX },
                      )
                 ->parse( $in_feed);
    
        close $in_feed;
      }
    
    sub entry
      { my( $t, $entry)= @_;
    
        # get the data
        my %song= map { $_ => $entry->field( "$NS_PREFIX:$_") } qw( name artist price);
        if( my $preview= $entry->first_child( 'link[@title="Preview"]') )
          { $song{preview}= $preview->att( 'href'); }
    
        # $db->begin_work;
    
        # store it
        if( ($db->query( 'SELECT count(*) FROM song WHERE name=?', $song{name})->flat)[0])
          { warn "  skipping $song{name}, already stored\n" if $opt{v};
          }
        else
          {
            warn "  adding $song{name}\n" if $opt{v};
            if( my $artist_id= ($db->query( 'SELECT id from ARTIST where name=?', $song{artist})->flat)[0])
              { warn "  existing artist $song{name} ($artist_id)\n" if $opt{v};
                $song{artist}= $artist_id; 
              }
            else
              { warn "  creating new artist $song{artist}\n" if $opt{v};
    
                $db->query( 'INSERT INTO artist (name) VALUES (??)', $song{artist});
    
                # should be $db->last_insert_id but that's not available in DBD::SQLite at the moment
                $song{artist}= $db->func('last_insert_rowid');
              }
    
            $db->query( 'INSERT INTO song ( name, artist, price, preview) VALUES (??)', 
                                  @song{qw( name  artist  price  preview)});
            $db->commit;
          }
        $t->purge; # keeps memory usage lower, probably not needed for small RSS files
      }
    
    __END__
    =head1 NAME
    
      itunes2db - loads itunes RSS feeds to a DB
    
    =head1 OPTIONS
    
      -c <file>  uses a cache instead of the list of URLs
      -v         verbose
    
    =head1 DB schema
    
      create table song ( id INT PRIMARY KEY, name TEXT, artist INT, price TEXT, preview TEXT);
      create table artist (id INT PRIMARY KEY, name TEXT);
    

    【讨论】:

      【解决方案2】:

      据我所知,它没有得到积极维护,但 Scriptella 可能会有所帮助。非常简单的 xml 脚本,在 Java 上运行。

      Example of how to suck RSS into a database:

      <!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
      <etl>
          <connection id="in" driver="xpath" url="http://snippets.dzone.com/rss"/>
          <connection id="out" driver="text" url="rss.txt"/>
          <connection id="db" driver="hsqldb" url="jdbc:hsqldb:db/rss" user="sa" classpath="hsqldb.jar"/>
          <script connection-id="db">
             CREATE TABLE Rss (
                 ID Integer,
                 Title VARCHAR(255),
                 Description VARCHAR(255),   
                 Link VARCHAR(255)
      
             )
          </script>
          <query connection-id="in">
              /rss/channel/item
              <script connection-id="out">
                  Title: $title
                  Description: [
                  ${description.substring(0, 20)}...
                  ]
                  Link: $link
                  ----------------------------------
              </script>
              <script connection-id="db">
                  INSERT INTO Rss (ID, Title, Description, Link) 
                  VALUES (?rownum, ?title, ?description, ?link);
              </script>
          </query>
      </etl>
      

      【讨论】:

        【解决方案3】:

        好吧,我不太确定您正在寻找什么样的答案,但我认为您不需要编写任何类型的 shell 脚本。 Bother PHP 和 Perl 完全能够下载 RSS 提要并将数据插入 MySQL。将 PHP 或 Perl 脚本设置为每隔 X 小时/天/使用 cronjob 运行一次,这样就完成了。

        其实没什么好说的,你的问题是多么的含糊。

        【讨论】:

          【解决方案4】:

          我正在抓取 Stack Overflow 的提要,以使用 PHP 的 DOMDocument 执行一些额外的过滤,然后使用 DOM 方法来访问我想要的内容。我建议调查一下。

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 2012-03-04
            • 1970-01-01
            • 2012-04-15
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2013-02-22
            相关资源
            最近更新 更多