优化远程页面检索和解析答案

【问题标题】：Optimize remote page retrieving and parsing优化远程页面检索和解析
【发布时间】：2016-08-05 21:10:13
【问题描述】：

我正在使用 PHP 检索一个远程页面，从该页面获取一些链接并访问每个链接并对其进行解析。
我花了大约 12 秒，这太多了，我需要以某种方式优化代码。
我的代码是这样的：

$result = get_web_page('THE_WEB_PAGE');

preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);

foreach ($matches[2] as $lnk) {
    $result = get_web_page($lnk);

    preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test'] = $match[1];

    preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test2'] = $match[1];

    preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);

    $re[$index]['test3'] = $match[1];
    ++$index;
}

我在循环中还有一些 preg_match 调用。
如何优化我的代码？

编辑：

我已将代码更改为使用 xpath 而不是正则表达式，但它变得更慢了。

编辑2：

这是我的完整代码：

    <?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');

$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);

// Get the links
$matches = $xpath->evaluate('//li[@class = "lasts"]/a[@class = "lnk"]/@href | //li[@class=""]/a[ @class = "lnk"]/@href');
if ($matches === FALSE) {
    echo 'error';
    exit();
}
foreach ($matches as $match) {
    $links[] = 'WEB_PAGE'.$match->value;
}

$index = 0;

// For each link
foreach ($links as $link) {
    echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
    $result = get_web_page($link);

    $dom = new DOMDocument();
    $dom->loadHTML($result['content']);
    $xpath = new DOMXPath($dom);

    $match = $xpath->evaluate('concat(//span[@id = "header"]/span[@id = "sub_header"]/text(), //span[@id = "header"]/span[@id = "sub_header"]/following-sibling::text()[1])');
    if ($matches === FALSE) {
        exit();
    }
    $data[$index]['name'] = $match;

    $matches = $xpath->evaluate('//li[starts-with(@class, "active")]/a/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['types'][] = $match->data;
    }

    $matches = $xpath->evaluate('//span[@title = "this is a title" and @class = "info"]/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['info'][] = $match->data;
    }

    $matches = $xpath->evaluate('//span[@title = "this is another title" and @class = "name"]/text()');
    if ($matches === FALSE) {
        exit();
    }
    foreach ($matches as $match) {
        $data[$index]['names'][] = $match->data;
    }

    ++$index;
}

?>

【问题讨论】：

使用正则解析HTML时自找麻烦。（参考@Tim van Osch 的回答）stackoverflow.com/questions/1732348/…
stackoverflow.com/questions/3577641/…
一开始就使用贪婪量词，你将如何获得预期的结果？
@revo 你是什么意思？我得到了预期的结果...
好吧，除非你举一些 $result['content'] 可以容纳的例子，否则我怀疑。

标签： php regex parsing optimization xpath

【解决方案1】：

正如其他人提到的，请改用解析器（即DOMDocument）并将其与xpath 查询结合使用。考虑以下示例：

<?php

# set up some dummy data
$data = <<<DATA
<div>
    <a class='link'>Some link</a>
    <a class='link' id='otherid'>Some link 2</a>
</div>
DATA;

$dom = new DOMDocument();
$dom->loadHTML($data);

$xpath = new DOMXPath($dom);

# all links
$links = $xpath->query("//a[@class = 'link']");
print_r($links);

# special id link
$special = $xpath->query("//a[@id = 'otherid']")

# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>

【讨论】：

按照您的建议，我已将代码更改为使用 xpath 而不是正则表达式，但它变得更慢了。
@Lior：那么您需要更具体地使用 xpath 查询，即 /div/span/p/a 而不是 //a。我会选择一个更强大的解决方案，即使它有点慢（1-2 秒）。
问题是它在我得到的 foreach 链接的循环内运行，因此每次迭代都会使其更慢。 0 loop 1.66981506348 1 loop 2.49688410759 2 loop 3.00950098038 3 loop 3.5253970623 4 loop 4.01076102257 5 loop 4.67162799835 6 loop 5.2378718853 7 loop 5.74008488655 8 loop 6.26041197777 9 loop 6.78747105598 10 loop 7.47332000732 11 loop 8.03243994713 12 loop 8.50538802147 13 loop 9.37472701073 14 loop 11.5049209595 15 loop 12.2112920284 .. . 40 循环 30.2815680504 41 循环 31.1307020187
@Lior：它可能不需要循环运行。在问题中发布您的完整代码。
请显示一些 html 输出，可能查询可以组合或相对。

【解决方案2】：

考虑为 PHP 使用 DOM 框架。这应该更快。

将 PHP 的 DOMDocument 与 xpath 查询一起使用：
http://php.net/manual/en/class.domdocument.php

更多解释请参见 Jan 的回答。

根据 cmets 的说法，以下方法也有效，但不太可取。
例如：
http://simplehtmldom.sourceforge.net/

获取页面上所有a标签的示例：

<?php
  include_once('simple_html_dom.php');

  $url = "http://your_url/";
  $html = new simple_html_dom();
  $html->load_file($url);

  foreach($html->find("a") as $link)
  {
    // do something with the link
  }
?>

【讨论】：

不需要外部库。
请注意，simple_html_dom 并不是那么简单，它的源代码大量使用了正则表达式。
...它会成倍地消耗你的内存。
你的答案是另一种方式，不需要被压制，知道也很好。
供参考：+1