无法使用特定网站的 html dom 解析器抓取内容答案

【问题标题】：Unable to scrape content using html dom parser from a particular website无法使用特定网站的 html dom 解析器抓取内容
【发布时间】：2017-02-21 07:35:50
【问题描述】：

我一直在尝试从网站上抓取内容，并且在某些网站上取得了成功。但是我的代码无法从 Flipkart.com 中抓取内容。我使用 HTML DOM PARSER，这是我的代码..

<?php
include ('simple_html_dom.php');
$scrape_url = 'https://www.flipkart.com/lenovo-f309-2-tb-external-hard-disk-drive/p/itmehwha6zkhkgfw';
$html = file_get_html($scrape_url);
foreach($html->find('h1._3eAQiD') as $title_s)
echo $title_s->plaintext;
foreach($html->find('div.hGSR34') as $ratings_s)
echo $ratings_s->plaintext;
?>

此代码给出空结果。有人可以让我知道问题所在吗？有没有其他方法可以从这个网站上抓取内容？

【问题讨论】：

内容可能令人窒息。或者你可能期望一些 js 加载的内容在那里。如果您能缩小范围，这将对我们有所帮助。
我认为内容是 js 加载的。有没有办法用 php 报废内容？
你可以先run it through phantomjs。如果你想发疯，还有一些 php selenium 库。

标签： php simple-html-dom

【解决方案1】：

这段代码对我有用。

get_content_by_class(curl('https://www.flipkart.com/lenovo-f309-2-tb-external-hard-disk-drive/p/itmehwha6zkhkgfw'), "hGSR34");

function curl($url) {
    $ch = curl_init();  // Initialising cURL
    //curl_setopt($ch, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT , 0);
    curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
    $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
    curl_close($ch);    // Closing cURL
    return $data;   // Returning the data from the function
}

function get_content_by_class($html, $container_class_name) {

    //preg_match_all('/<div class="' . $container_class_name .'">(.*?)<\/div>/s', $html, $matches);
    preg_match_all('#<\s*?div class="'. $container_class_name . '\b[^>]*>(.*?)</div\b[^>]*>#s', $html, $matches);

    // 

    foreach($matches as $match){
        $match1 = str_replace('<','&lt',$match);
        $match2 = str_replace('>','&gt',$match1);
        print_r($match2);
    }  

    if (empty($matches)){
        echo 'no matches found';
        echo '</br>';
    }
    //return $matches;

}

【讨论】：

函数curl从页面中抓取html，并返回，get content函数按类获取内容的html