【问题标题】:Get Contents (URL) and Run Pregmatch all, then Print out results as specified获取内容(URL)并运行 Pregmatch all,然后按指定打印结果
【发布时间】:2014-08-05 23:16:51
【问题描述】:

我正在尝试在 craigslist 上搜索公寓。

代码:

$city = 'saltlakecity';
$rooms = '';
$query = '';
$sdate ='';
$url = 'http://'.$city.'.craigslist.org/search/apa?bedrooms='.$rooms.'&query='.$query.'&sale_date='.$sdate.'';
$base_url = parse_url($url, PHP_URL_HOST);
$resultspage = file_get_contents($url);

// use DOMDocument and DOMXpath
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($resultspage);
libxml_clear_errors();
$xpath = new DOMXpath($dom);

$data = array();
$rows = $xpath->query('//p[@class="row"]'); // get all rows
foreach($rows as $entries) { // loop each row
$entry = array();
$entry['title'] = $xpath->query('./span[@class="txt"]/span[@class="pl"]/a', $entries)->item(0)->nodeValue;
$entry['link'] = 'http://' . $base_url . $xpath->query('./a[@class="i"]', $entries)->item(0)->getAttribute('href');
$entry['price'] = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[1]', $entries)->item(0)->nodeValue;
$location = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[2]', $entries)->item(0)->nodeValue;
$loc = str_replace(array('(', ')'), '', $location);
$entry['location'] = $loc;
$entry['seller'] = $xpath->query('./span[@class="txt"]/span[@class="l2"]/a', $entries)->item(0)->nodeValue;

$url2 = $entry['link'];
$listingpage = file_get_contents($url2);
$dom2 = new DOMDocument();
libxml_use_internal_errors(true);
$dom2->loadHTML($listingpage);
libxml_clear_errors();
$xpath2 = new DOMXpath($dom2);
$entry['address'] = $xpath2->query('./div[@class="mapAndAttrs"]/div[3]')->item(0)->nodeValue;

$text_node = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[1]/following-sibling::text()[1]', $entries)->item(0)->nodeValue;
// remove "/"" and "-""  | explode by space | filter space (now, its left by 2 values: bedroom and size)
$text_node = array_filter(explode(' ', str_replace(array('/', '-'), '', $text_node)));
$entry['bedrooms'] = array_shift($text_node); // bedroom
$entry['dimensions'] = array_shift($text_node); // dimensions

$data[] = $entry; // after gathering necessary items, assign inside
}

echo '<pre>';
print_r($data);

**更新:我现在正在尝试抓取已抓取的链接,以获取该物业的地址**

我想要完成的是进行预匹配,找到标题、URL、卧室数量、所在城市以及价格,然后将其打印出来。但是,如果我简单地放置“$matches”,则页面放置数组。如果我把代码放在上面,页面加载为白色。

有人可以检查我的代码并告诉我我在这里可能做错了什么吗? 谢谢!

【问题讨论】:

    标签: php html xpath web-scraping domdocument


    【解决方案1】:

    我谦虚地建议通过使用DOMDocumentDOMXpath 而不是正则表达式来使用适当的工具(HTML 解析器)。示例:Sample Fiddle

    $city = 'saltlakecity';
    $url = "http://".$city.".craigslist.org/search/apa/?bedrooms=2&hasPic=1&query=";
    $base_url = parse_url($url, PHP_URL_HOST);
    $resultspage = file_get_contents($url);
    
    // use DOMDocument and DOMXpath
    $dom = new DOMDocument();
    libxml_use_internal_errors(true);
    $dom->loadHTML($resultspage);
    libxml_clear_errors();
    $xpath = new DOMXpath($dom);
    
    $data = array();
    $rows = $xpath->query('//p[@class="row"]'); // get all rows
    foreach($rows as $entries) { // loop each row
        $entry = array();
        $entry['title'] = $xpath->query('./span[@class="txt"]/span[@class="pl"]/a', $entries)->item(0)->nodeValue;
        $entry['link'] = 'http://' . $base_url . $xpath->query('./a[@class="i"]', $entries)->item(0)->getAttribute('href');
        $entry['price'] = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[1]', $entries)->item(0)->nodeValue;
        $text_node = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[1]/following-sibling::text()[1]', $entries)->item(0)->nodeValue;
        // remove "/"" and "-""  | explode by space | filter space (now, its left by 2 values: bedroom and size)
        $text_node = array_filter(explode(' ', str_replace(array('/', '-'), '', $text_node)));
        $entry['bedrooms'] = array_shift($text_node); // bedroom
        $entry['dimensions'] = array_shift($text_node); // dimensions
    
        $address = @$xpath->query('./span[@class="txt"]/span[@class="l2"]/span[@class="pnr"]/small', $entries)->item(0)->nodeValue;
        $address = str_replace(array('(', ')'), '', $address);
        $entry['address'] = $address;
    
        $data[] = $entry; // after gathering necessary items, assign inside
    }
    
    echo '<pre>';
    print_r($data);
    

    应该输出这个:

    Array
    (
        [0] => Array
            (
                [title] => Beautiful Spacious Sandy Home for rent
                [link] => http://saltlakecity.craigslist.org/apa/4605359897.html
                [price] => $2050
                [bedrooms] => 6br
                [dimensions] => 3710ft²
                [address] =>  10251 Snow Iris Way, Sandy
            )
        and many more ...
    

    【讨论】:

    • @user3259138 我修改了一些代码,如果您正在处理 HTML 元素,请始终考虑使用 HTML Parser(simple_html_dom 也可以),这是针对此类情况的推荐方法。
    • 所以我实际上有一个问题要您扩展。
    • 我想获取 entry['link'] 然后构造另一个 url 来抓取像 "http://'.$city.'.craigslist.org'.$entry['link ']",然后抓取地图正下方的实际地址。我尝试设置一个全新的 domdoc 和 domxpath 以及所有内容,但我无法完全弄清楚如何获取地址并使其成为现有“数据”数组的一部分。建议?
    • @user3259138 只需在循环中使用该 url 创建另一个 dom 文档和 xpath 实例。从那里,只需在答案(xpath 查询)上使用相同的逻辑来获取所需的数据。当然不要指望这会很快,因为你有很多行。
    • 好的,代码现在更新了。我目前收到“通知:尝试获取非对象的属性”错误。关于如何正确抓取地址的条目 ['link'] 并将抓取的地址添加到现有数据数组的建议?
    猜你喜欢
    • 2015-10-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-07-17
    • 1970-01-01
    • 2016-01-23
    • 2019-09-23
    相关资源
    最近更新 更多