获取内容（URL）并运行 Pregmatch all，然后按指定打印结果答案

【问题标题】：Get Contents (URL) and Run Pregmatch all, then Print out results as specified获取内容（URL）并运行 Pregmatch all，然后按指定打印结果
【发布时间】：2014-08-05 23:16:51
【问题描述】：

我正在尝试在 craigslist 上搜索公寓。

代码：

$city = 'saltlakecity';
$rooms = '';
$query = '';
$sdate ='';
$url = 'http://'.$city.'.craigslist.org/search/apa?bedrooms='.$rooms.'&query='.$query.'&sale_date='.$sdate.'';
$base_url = parse_url($url, PHP_URL_HOST);
$resultspage = file_get_contents($url);

// use DOMDocument and DOMXpath
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($resultspage);
libxml_clear_errors();
$xpath = new DOMXpath($dom);

$data = array();
$rows = $xpath->query('//p[@class="row"]'); // get all rows
foreach($rows as $entries) { // loop each row
$entry = array();
$entry['title'] = $xpath->query('./span[@class="txt"]/span[@class="pl"]/a', $entries)->item(0)->nodeValue;
$entry['link'] = 'http://' . $base_url . $xpath->query('./a[@class="i"]', $entries)->item(0)->getAttribute('href');
$entry['price'] = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[1]', $entries)->item(0)->nodeValue;
$location = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[2]', $entries)->item(0)->nodeValue;
$loc = str_replace(array('(', ')'), '', $location);
$entry['location'] = $loc;
$entry['seller'] = $xpath->query('./span[@class="txt"]/span[@class="l2"]/a', $entries)->item(0)->nodeValue;

$url2 = $entry['link'];
$listingpage = file_get_contents($url2);
$dom2 = new DOMDocument();
libxml_use_internal_errors(true);
$dom2->loadHTML($listingpage);
libxml_clear_errors();
$xpath2 = new DOMXpath($dom2);
$entry['address'] = $xpath2->query('./div[@class="mapAndAttrs"]/div[3]')->item(0)->nodeValue;

$text_node = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[1]/following-sibling::text()[1]', $entries)->item(0)->nodeValue;
// remove "/"" and "-""  | explode by space | filter space (now, its left by 2 values: bedroom and size)
$text_node = array_filter(explode(' ', str_replace(array('/', '-'), '', $text_node)));
$entry['bedrooms'] = array_shift($text_node); // bedroom
$entry['dimensions'] = array_shift($text_node); // dimensions

$data[] = $entry; // after gathering necessary items, assign inside
}

echo '<pre>';
print_r($data);

**更新：我现在正在尝试抓取已抓取的链接，以获取该物业的地址**

我想要完成的是进行预匹配，找到标题、URL、卧室数量、所在城市以及价格，然后将其打印出来。但是，如果我简单地放置“$matches”，则页面放置数组。如果我把代码放在上面，页面加载为白色。

有人可以检查我的代码并告诉我我在这里可能做错了什么吗？谢谢！

【问题讨论】：

标签： php html xpath web-scraping domdocument

【解决方案1】：

我谦虚地建议通过使用DOMDocument 和DOMXpath 而不是正则表达式来使用适当的工具（HTML 解析器）。示例：Sample Fiddle

$city = 'saltlakecity';
$url = "http://".$city.".craigslist.org/search/apa/?bedrooms=2&hasPic=1&query=";
$base_url = parse_url($url, PHP_URL_HOST);
$resultspage = file_get_contents($url);

// use DOMDocument and DOMXpath
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($resultspage);
libxml_clear_errors();
$xpath = new DOMXpath($dom);

$data = array();
$rows = $xpath->query('//p[@class="row"]'); // get all rows
foreach($rows as $entries) { // loop each row
    $entry = array();
    $entry['title'] = $xpath->query('./span[@class="txt"]/span[@class="pl"]/a', $entries)->item(0)->nodeValue;
    $entry['link'] = 'http://' . $base_url . $xpath->query('./a[@class="i"]', $entries)->item(0)->getAttribute('href');
    $entry['price'] = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[1]', $entries)->item(0)->nodeValue;
    $text_node = $xpath->query('./span[@class="txt"]/span[@class="l2"]/span[1]/following-sibling::text()[1]', $entries)->item(0)->nodeValue;
    // remove "/"" and "-""  | explode by space | filter space (now, its left by 2 values: bedroom and size)
    $text_node = array_filter(explode(' ', str_replace(array('/', '-'), '', $text_node)));
    $entry['bedrooms'] = array_shift($text_node); // bedroom
    $entry['dimensions'] = array_shift($text_node); // dimensions

    $address = @$xpath->query('./span[@class="txt"]/span[@class="l2"]/span[@class="pnr"]/small', $entries)->item(0)->nodeValue;
    $address = str_replace(array('(', ')'), '', $address);
    $entry['address'] = $address;

    $data[] = $entry; // after gathering necessary items, assign inside
}

echo '<pre>';
print_r($data);

应该输出这个：

Array
(
    [0] => Array
        (
            [title] => Beautiful Spacious Sandy Home for rent
            [link] => http://saltlakecity.craigslist.org/apa/4605359897.html
            [price] => $2050
            [bedrooms] => 6br
            [dimensions] => 3710ft²
            [address] =>  10251 Snow Iris Way, Sandy
        )
    and many more ...

【讨论】：

@user3259138 我修改了一些代码，如果您正在处理 HTML 元素，请始终考虑使用 HTML Parser（simple_html_dom 也可以），这是针对此类情况的推荐方法。
所以我实际上有一个问题要您扩展。
我想获取 entry['link'] 然后构造另一个 url 来抓取像 "http://'.$city.'.craigslist.org'.$entry['link ']"，然后抓取地图正下方的实际地址。我尝试设置一个全新的 domdoc 和 domxpath 以及所有内容，但我无法完全弄清楚如何获取地址并使其成为现有“数据”数组的一部分。建议？
@user3259138 只需在循环中使用该 url 创建另一个 dom 文档和 xpath 实例。从那里，只需在答案（xpath 查询）上使用相同的逻辑来获取所需的数据。当然不要指望这会很快，因为你有很多行。
好的，代码现在更新了。我目前收到“通知：尝试获取非对象的属性”错误。关于如何正确抓取地址的条目 ['link'] 并将抓取的地址添加到现有数据数组的建议？