PHP Simple HTML DOM Parser，在没有类或id的标签内查找文本答案

【问题标题】：PHP Simple HTML DOM Parser, find text inside tags that have no class nor idPHP Simple HTML DOM Parser，在没有类或id的标签内查找文本
【发布时间】：2013-06-15 01:30:40
【问题描述】：

我有一个http://www.statistics.com/index.php?page=glossary&term_id=703

具体在这些部分：

<b>Additive Error:</b>
<p> Additive error is the error that is added to the true value and does not 
depend on the true value itself. In other words, the result of the measurement is 
considered as a sum of the true value and the additive error:   </p>

我尽力获取标签<p> 和</p> 之间的文本，用这个：

include('simple_html_dom.php');
$url = 'http://www.statistics.com/index.php?page=glossary&term_id=703';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);

foreach ( $html->find('b') as $e ) {
echo $e->innertext . '<br>';
}

它给了我：

Additive Error:
Browse Other Glossary Entries

我尝试将 foreach 更改为：foreach ( $html->find('b p') as $e ) {

然后foreach ( $html->find('/b p') as $e ) {

然后它一直只给我空白页。我做错了什么？谢谢。

【问题讨论】：

标签： php html dom html-parsing

【解决方案1】：

为什么不使用 PHP 内置的 DOM 扩展和 xpath？

libxml_use_internal_errors(true);  // <- you might needs this if that page has errors
$dom = new DomDocument();
$dom->loadHtml($curl_scraped_page);
$xpath = new DomXPath($dom);
print $xpath->evaluate('string(//p[preceding::b]/text())');
//                             ^
//  this will get you text content from <p> tags preceded by <b> tags

如果<b> 前面有多个<p> 标签，而您只想获取第一个标签，请将xpath 查询调整为：

string((//p[preceding::b]/text())[1])

要将它们全部作为DOMNodeList，省略string() 函数：//p[preceding::b]/text()，然后您可以遍历列表并访问每个节点的textContent 属性...

【讨论】：

天哪，你救了我的命！非常感谢……再次感谢。
嘿，我还有一个问题。我想从其他页面做一些解析，但我读到我们不能在删除前一个之前创建新对象。我的问题是：如何在创建 simple_html_dom 对象之前删除该 $dom？谢谢，，
通过为变量分配一个新对象，例如$dom = new DomDocument() ...但是为什么要使用“simple_html_dom”而不是直接使用DomDocument呢？

【解决方案2】：

如果你想要 b 或 p 标签内的所有内容，你可以简单地做foreach ($html->find('b,p') as $e) { ... }。

【讨论】：

不，我只想要上面p标签内的文字，只有那个..我该怎么办？
如果你只想要那个，我怀疑你可能有点搞砸了。我会帮你，但我不知道怎么做。
是的，你是对的。我搞砸了，:(我从事这件事已经很长时间了，但我一直在代码上失败。你认为有可能做到这一点吗？
可能是，但我不知道如何。根据 One Trick Pony 的（优秀）解决方案，您可以查找前面有 b 标签的 p 标签，但您总是会冒返回多个段落的风险。
是的，我只是看了一下。感谢上帝，也感谢你。 :) 幸运的是，每个具有不同 term_id 的链接只有一个 b 标签，后跟一个 p 标签。不会有事的，不是吗？

【解决方案3】：

试试这个

<?php
$dom = new DOMDocument();
@$dom->loadHTMLFile('http://www.statistics.com/index.php?page=glossary&term_id=703');
$xpath = new DOMXPath($dom);

$mytext = '';
foreach($xpath->query('//font') as $font){
    $mytext =  $xpath->query('.//p', $font)->item(0)->nodeValue;
    break;
}

echo $mytext;
?>

【讨论】：

我只想要上面那个 p 标签里面的文字，只有那个。我该怎么办？