带有特定标签的PHP DOM html问题答案

【问题标题】：PHP DOM html issue with a certain tag带有特定标签的PHP DOM html问题
【发布时间】：2018-01-07 02:06:43
【问题描述】：

人。我通常会在网上和 stackoverflow 上找到我的答案，但这次无法解决我的问题。我正在使用 php dom 解析网站并从中提取一些数据，但由于某种原因，我尝试返回的所有项目都比页面上的数量少。

尝试使用“simple php simple html dom”、“php advanced html dom”和本机 php dom...但在这种情况下仍然得到 14 个文章标签。

http://www.emol.com/movil/nacional/

在这个站点中有 28 个元素标记为“文章”，但我总是得到 14 个（或更少）

尝试使用经典查找（从简单到高级），所有可能的组合；使用原生的，查询 xpath 和 getelementsbytag。

$xpath->query('//article');
$xpath->query('//*[@id="listNews"]/article[6]') //even this don't work
$html->find('article:not(.sec_mas_vistas_emol), article'); //return 14

所以我的猜测是我加载 url 的方式......所以我尝试了经典的“file_get_html”、curl 和一些自定义函数......而且它们都是一样的。更奇怪的是，如果我使用在线 xpath 测试器，复制所有 html 并使用“query->('//article')... 它找到所有。这是我最后的两个测试：

//Way 1
$html = file_get_html('http://www.emol.com/movil/nacional/');
$lidata = $html->find('article');

//Way 2
$url = 'http://www.emol.com/movil/nacional';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$e = curl_exect($ch);
$dom = new DOMDocument;
@$dom->loadHTML($e); //tryed with loadHTMLFile too and the libxml_use_internal_erros
$xpath = new DOMXPath($dom);
$xpath->query('//article');

关于可能是什么问题以及解决方法的任何建议？实际上，这是我第一次使用 PHP dom，所以可能我缺少一些东西。

【问题讨论】：

在提供的链接上只有 14 个文章元素存在。
我同意@marcell。该页面上只有 14 篇文章
不，30。检查检查员，轻松找到=>
screenshot
不，如果您在该页面上查看源代码，将有 14 篇文章。这就是你从 php 获取页面时得到的结果，这就是为什么你只得到 14 篇文章的原因。自己试过了。
您可以继续使用无头浏览器来获取动态数据。 casperjs 库有一个 php wrapper，这是一个用于 PhantomJS (WebKit) 和 SlimerJS (Gecko) 无头浏览器的导航脚本和测试实用程序，用 Javascript 编写。 See here.

标签： php html domdocument

【解决方案1】：

也许我上面的评论和这个例子可以帮助你继续。

使用 phpcasperjs 包装器：

<?php 

require_once 'vendor/autoload.php';

use Browser\Casper;

$casper = new Casper();
$casper->start('http://www.emol.com/movil/nacional/');
$casper->wait(5000);
$output = $casper->getOutput();
$casper->run();
$html = $casper->getHtml();
$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$cnt = 1;
foreach ($xpath->query('//article') as $article) {
    print $cnt . ' - ' . $article->nodeName . ' - ' . $article->getAttribute('id') . "\n";
    $cnt += 1;
}

使用 file_get_contents 就像您之前尝试过的那样：

<?php

$html = file_get_contents('http://www.emol.com/movil/nacional/');
$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$cnt = 1;
foreach ($xpath->query('//article') as $article) {
    print $cnt . ' - ' . $article->nodeName . ' - ' . $article->getAttribute('id') . "\n";
    $cnt += 1;
}

计数 30（使用 phpcasperjs）与 14（使用 file_get_contents）。

【讨论】：

非常感谢，但不确定是否可以在我想实现它的地方工作。无论如何，是我继续前进的好指南。
不客气。另请注意，在您尝试使用上述脚本之前，您必须安装 phantomjs 和 casperjs：npm install -g phantomjs，npm install -g casperjs。