【发布时间】:2016-08-05 21:10:13
【问题描述】:
我正在使用 PHP 检索一个远程页面,从该页面获取一些链接并访问每个链接并对其进行解析。
我花了大约 12 秒,这太多了,我需要以某种方式优化代码。
我的代码是这样的:
$result = get_web_page('THE_WEB_PAGE');
preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);
foreach ($matches[2] as $lnk) {
$result = get_web_page($lnk);
preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test'] = $match[1];
preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test2'] = $match[1];
preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test3'] = $match[1];
++$index;
}
我在循环中还有一些 preg_match 调用。
如何优化我的代码?
编辑:
我已将代码更改为使用 xpath 而不是正则表达式,但它变得更慢了。
编辑2:
这是我的完整代码:
<?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
// Get the links
$matches = $xpath->evaluate('//li[@class = "lasts"]/a[@class = "lnk"]/@href | //li[@class=""]/a[ @class = "lnk"]/@href');
if ($matches === FALSE) {
echo 'error';
exit();
}
foreach ($matches as $match) {
$links[] = 'WEB_PAGE'.$match->value;
}
$index = 0;
// For each link
foreach ($links as $link) {
echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
$result = get_web_page($link);
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('concat(//span[@id = "header"]/span[@id = "sub_header"]/text(), //span[@id = "header"]/span[@id = "sub_header"]/following-sibling::text()[1])');
if ($matches === FALSE) {
exit();
}
$data[$index]['name'] = $match;
$matches = $xpath->evaluate('//li[starts-with(@class, "active")]/a/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['types'][] = $match->data;
}
$matches = $xpath->evaluate('//span[@title = "this is a title" and @class = "info"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['info'][] = $match->data;
}
$matches = $xpath->evaluate('//span[@title = "this is another title" and @class = "name"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['names'][] = $match->data;
}
++$index;
}
?>
【问题讨论】:
-
使用正则解析HTML时自找麻烦。 (参考@Tim van Osch 的回答)stackoverflow.com/questions/1732348/…
-
一开始就使用贪婪量词,你将如何获得预期的结果?
-
@revo 你是什么意思?我得到了预期的结果...
-
好吧,除非你举一些
$result['content']可以容纳的例子,否则我怀疑。
标签: php regex parsing optimization xpath