首先你需要用 tidy 清理 HTML(示例中为 $html_str):
$tidy_config = array(
"indent" => true,
"output-xml" => true,
"output-xhtml" => false,
"drop-empty-paras" => false,
"hide-comments" => true,
"numeric-entities" => true,
"doctype" => "omit",
"char-encoding" => "utf8",
"repeated-attributes" => "keep-last"
);
$xml_str = tidy_repair_string($html_str, $tidy_config);
然后您可以将 XML ($xml_str) 加载到 DOMDocument 中:
$doc = DOMDocument::loadXML($xml_str);
最后你可以使用 Horia Dragomir 的方法:
$list = $doc->getElementsByTagName("h1");
for ($i = 0; $i < $list->length; $i++) {
print($list->item($i)->nodeValue . "<br/>\n");
}
或者您也可以使用 XPath 对 DOMDocument 进行更复杂的查询(请参阅 http://www.php.net/manual/en/class.domxpath.php)
$xpath = new DOMXPath($doc);
$list = $xpath->evaluate("//h1");