【问题标题】:Screen-scraping JavaScript in PHPPHP 中的屏幕抓取 JavaScript
【发布时间】:2014-04-09 08:52:47
【问题描述】:

我可以使用此脚本成功抓取this page 上的所有项目:

$html = file_get_contents($list_url);
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);

    if(!empty($html))
    {
        $doc->loadHTML($html);
        libxml_clear_errors(); // remove errors for yucky html
        $xpath = new DOMXPath($doc);

        /* FIND LINK TO PRODUCT PAGE */

        $products = array();

        $row = $xpath->query($product_location);

        /* Create an array containing products */
        if ($row->length > 0)
        {            
            foreach ($row as $location)
            {
                $product_urls[] = $product_url_root . $location->getAttribute('href');
            }
        }
        else { echo "product location is wrong<br>";}

        $imgs = $xpath->query($photo_location);

        /* Create an array containing the image links */
        if ($imgs->length > 0)
        {            
            foreach ($imgs as $img)
            {
                $photo_url[] = $photo_url_root . $img->getAttribute('src');
            }
        }
        else { echo "photo location is wrong<br>";}

        $was = $xpath->query($was_price_location);

        /* Create an array containing the was price */
        if ($was->length > 0)
        {
            foreach ($was as $price)
            {
                $stripped = preg_replace("/[^0-9,.]/", "", $price->nodeValue);
                $was_price[] = "&pound;".$stripped;
            }
        }
        else { echo "was price location is wrong<br>";}

        $now = $xpath->query($now_price_location);

        /* Create an array containing the sale price */
        if ($now->length > 0)
        {
            foreach ($now as $price)
            {
                $stripped = preg_replace("/[^0-9,.]/", "", $price->nodeValue);
                $stripped = number_format((float)$stripped, 2, '.', '');
                $now_price[] = "&pound;".$stripped;
            }
        }
        else { echo "now price location is wrong<br>";}

        $result = array();

        /* Create an associative array containing all the above values */
        foreach ($product_urls as $i => $product_url)
        {
            $result[] = array(
                'product_url' => $product_url,
                'shop_name' => $shop_name,
                'photo_url' => $photo_url[$i],
                'was_price' => $was_price[$i],
                'now_price' => $now_price[$i]
            );
        }
    }

但是,如果我想获得第二页,或者如果我每页查看 100 个,就会出现问题file_get_contents($list_url) 将始终返回具有 24 个值的第一页。

我认为页面更改是通过 AJAX 请求处理的(尽管我在源代码中找不到任何证据)。有没有办法准确地抓取我在屏幕上看到的内容?

我在以前的答案中看到过关于 PhantomJS 的讨论,但鉴于我正在使用 PHP,我不确定这里是否合适。

【问题讨论】:

    标签: javascript php ajax web-scraping


    【解决方案1】:

    这是因为链接中的标签是由一些 js 脚本生成的。关闭该站点的 javascript 并检查它生成的输出链接。

    例如第二页是http://www.hm.com/gb/subdepartment/sale?page=1

    【讨论】:

    • 你是绝对的英雄。
    【解决方案2】:
    // Create DOM from URL or file
    $file= file_get_html('http://stackoverflow.com/');
    
    // Find your links
    foreach($file->find('a') as $youreEement) {
           echo $yourElement->href . '<br>';
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-06-03
      • 1970-01-01
      • 2010-09-16
      • 2010-12-06
      • 1970-01-01
      • 1970-01-01
      • 2011-02-20
      相关资源
      最近更新 更多