【问题标题】:Scrape and generate RSS feed抓取并生成 RSS 提要
【发布时间】:2009-02-17 16:19:08
【问题描述】:

我使用Simple HTML DOM 抓取一个页面以获取最新消息,然后使用此PHP class 生成一个RSS 提要。

这是我现在拥有的:

<?php

 // This is a minimum example of using the class
 include("FeedWriter.php");
 include('simple_html_dom.php');

 $html = file_get_html('http://www.website.com');

foreach($html->find('td[width="380"] p table') as $article) {
$item['title'] = $article->find('span.title', 0)->innertext;
$item['description'] = $article->find('.ingress', 0)->innertext;
$item['link'] = $article->find('.lesMer', 0)->href;     
$item['pubDate'] = $article->find('span.presseDato', 0)->plaintext;     
$articles[] = $item;
}


//Creating an instance of FeedWriter class. 
$TestFeed = new FeedWriter(RSS2);


 //Use wrapper functions for common channel elements

 $TestFeed->setTitle('Testing & Checking the RSS writer class');
 $TestFeed->setLink('http://www.ajaxray.com/projects/rss');
 $TestFeed->setDescription('This is test of creating a RSS 2.0 feed Universal Feed Writer');

  //Image title and link must match with the 'title' and 'link' channel elements for valid RSS 2.0

  $TestFeed->setImage('Testing the RSS writer class','http://www.ajaxray.com/projects/rss','http://www.rightbrainsolution.com/images/logo.gif');


foreach($articles as $row) {

    //Create an empty FeedItem
    $newItem = $TestFeed->createNewItem();

    //Add elements to the feed item    
    $newItem->setTitle($row['title']);
    $newItem->setLink($row['link']);
    $newItem->setDate($row['pubDate']);
    $newItem->setDescription($row['description']);

    //Now add the feed item
    $TestFeed->addItem($newItem);
}

  //OK. Everything is done. Now genarate the feed.
  $TestFeed->genarateFeed();

?>

我怎样才能使这段代码更简单? 知道有两个foreach语句,我该如何组合它们?

因为抓取的新闻是挪威语,我需要在标题上应用 html_entity_decode()。我在这里尝试过,但我无法让它工作:

foreach($html->find('td[width="380"] p table') as $article) {
$item['title'] = html_entity_decode($article->find('span.title', 0)->innertext, ENT_NOQUOTES, 'UTF-8');
$item['description'] = "<img src='" . $article->find('img[width="100"]', 0)->src . "'><p>" . $article->find('.ingress', 0)->innertext . "</p>";    
$item['link'] = $article->find('.lesMer', 0)->href;     
$item['pubDate'] = unix2rssdate(strtotime($article->find('span.presseDato', 0)->plaintext));
$articles[] = $item;
} 

谢谢:)

【问题讨论】:

    标签: php foreach rss screen-scraping


    【解决方案1】:

    您似乎循环遍历$html 以构建文章数组,然后循环遍历这些添加到提要 - 您可以通过将项目添加到提要中来跳过整个循环。为此,您需要在执行流程中将 FeedWriter 构造函数向上移动一点。

    我还会添加一些方法来帮助提高可读性,从长远来看这可能有助于可维护性。如果您需要为提要插入不同的提供程序类、更改解析规则等,封装提要创建、项目修改等应该会更容易。可以对以下代码进行进一步改进(html_entity_decode 是在与$item['title'] assignment 等不同的行上),但您明白了。

    html_entity_decode 遇到了什么问题?您有示例输入/输出吗?

    <?php
    
     // This is a minimum example of using the class
     include("FeedWriter.php");
     include('simple_html_dom.php');
    
     // Create new instance of a feed
     $TestFeed = create_new_feed();
    
     $html = file_get_html('http://www.website.com');
    
     // Loop through html pulling feed items out
     foreach($html->find('td[width="380"] p table') as $article) 
     {
        // Get a parsed item
        $item = get_item_from_article($article);
    
        // Get the item formatted for feed
        $formatted_item = create_feed_item($TestFeed, $item);
    
        //Now add the feed item
        $TestFeed->addItem($formatted_item);
     }
    
     //OK. Everything is done. Now generate the feed.
     $TestFeed->generateFeed();
    
    
    // HELPER FUNCTIONS
    
    /**
     * Create new feed - encapsulated in method here to allow
     * for change in feed class etc
     */
    function create_new_feed()
    {
         //Creating an instance of FeedWriter class. 
         $TestFeed = new FeedWriter(RSS2);
    
         //Use wrapper functions for common channel elements
         $TestFeed->setTitle('Testing & Checking the RSS writer class');
         $TestFeed->setLink('http://www.ajaxray.com/projects/rss');
         $TestFeed->setDescription('This is test of creating a RSS 2.0 feed Universal Feed Writer');
    
         //Image title and link must match with the 'title' and 'link' channel elements for valid RSS 2.0
         $TestFeed->setImage('Testing the RSS writer class','http://www.ajaxray.com/projects/rss','http://www.rightbrainsolution.com/images/logo.gif');
    
         return $TestFeed;
    }
    
    
    /**
     * Take in html article segment, and convert to usable $item
     */
    function get_item_from_article($article)
    {
        $item['title'] = $article->find('span.title', 0)->innertext;
        $item['title'] = html_entity_decode($item['title'], ENT_NOQUOTES, 'UTF-8');
    
        $item['description'] = $article->find('.ingress', 0)->innertext;
        $item['link'] = $article->find('.lesMer', 0)->href;     
        $item['pubDate'] = $article->find('span.presseDato', 0)->plaintext;     
    
        return $item;
    }
    
    
    /**
     * Given an $item with feed data, create a
     * feed item
     */
    function create_feed_item($TestFeed, $item)
    {
        //Create an empty FeedItem
        $newItem = $TestFeed->createNewItem();
    
        //Add elements to the feed item    
        $newItem->setTitle($item['title']);
        $newItem->setLink($item['link']);
        $newItem->setDate($item['pubDate']);
        $newItem->setDescription($item['description']);
    
        return $newItem;
    }
    ?>
    

    【讨论】:

      【解决方案2】:

      对于这两个循环的简单组合,您可以创建提要作为通过 HTML 的解析:

      <?php
      include("FeedWriter.php");
      include('simple_html_dom.php');
      
      $html = file_get_html('http://www.website.com');
      
      //Creating an instance of FeedWriter class. 
      $TestFeed = new FeedWriter(RSS2);
      $TestFeed->setTitle('Testing & Checking the RSS writer class');
      $TestFeed->setLink('http://www.ajaxray.com/projects/rss');
      $TestFeed->setDescription(
        'This is test of creating a RSS 2.0 feed Universal Feed Writer');
      
      $TestFeed->setImage('Testing the RSS writer class',
                          'http://www.ajaxray.com/projects/rss',
                          'http://www.rightbrainsolution.com/images/logo.gif');
      
      //parse through the HTML and build up the RSS feed as we go along
      foreach($html->find('td[width="380"] p table') as $article) {
        //Create an empty FeedItem
        $newItem = $TestFeed->createNewItem();
      
        //Look up and add elements to the feed item   
        $newItem->setTitle($article->find('span.title', 0)->innertext);
        $newItem->setDescription($article->find('.ingress', 0)->innertext);
        $newItem->setLink($article->find('.lesMer', 0)->href);     
        $newItem->setDate($article->find('span.presseDato', 0)->plaintext);     
      
        //Now add the feed item
        $TestFeed->addItem($newItem);
      }
      
      $TestFeed->genarateFeed();
      ?>
      

      您在使用html_entity_decode 时遇到了什么问题,如果您向我们提供指向该页面无法正常工作的页面的链接可能会有所帮助?

      【讨论】:

        【解决方案3】:

        我怎样才能使这段代码更简单?

        我知道这不是你要问的,但你知道 [http://pipes.yahoo.com/pipes/](Yahoo!管道)?

        【讨论】:

        • 对于发现这个老问题的其他人来说,雅虎管道似乎在 2015 年年中关闭。
        【解决方案4】:

        也许您可以只使用像 Feedity - http://feedity.com 这样的东西,它已经解决了从任何网页生成 RSS 提要的问题。

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2012-03-04
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2011-06-17
          • 2017-11-22
          相关资源
          最近更新 更多