php simple_html_dom 解析器出错答案

【问题标题】：Error with php simple_html_dom parserphp simple_html_dom 解析器出错
【发布时间】：2018-01-04 11:43:28
【问题描述】：

PHP 专家。

我在使用 simple_html_dom 类时发现了一个错误。

我要解析的html字符串是这样的。

<!DOCTYPE html>
<html lang="en">
<head>
<title>Y-shaped ZnO Nanobelts Driven from Twinned</title>

<meta name="site" content="Reports"/>

<meta name="description" content="Description with twinned planes {11&#"/>

<meta name="image" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a"/>


...


</body>
</html>

我尝试使用 find("meta[name=image]") 获取名为 image 的元标记，但是我做不到。

我查了一下原因，发现是因为上面一行中间的字符''。

<meta name="description" content="Description with twinned planes {11&#"/>

我得到了那个元标记的内容属性

 Description with twinned planes {11&#"/>   <meta name="image" ....

那么在这种情况下，我应该怎么做才能让simple_html_dom正确解析html？

否则是否有任何其他库可以正确解析此 html？

【问题讨论】：

{11 应该是 {11&# 不是问题吗？

标签： php html dom html-parsing

【解决方案1】：

试试这个代码：使用 php DomDocument

你可以使用getElementsByTagName获取meta，使用getAttribute获取属性值

$hml = '<!DOCTYPE html>
<html lang="en">
<head>
<title>Y-shaped ZnO Nanobelts Driven from Twinned</title>

<meta name="site" content="Reports"/>

<meta name="description" content="Description with twinned planes {11&#"/>

<meta name="image" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a"/>
</head>
<body>

</body>
</html>';

$dom = new DOMDocument();
libxml_use_internal_errors(true);

$dom->loadHTML($hml);

$metas = $dom->getElementsByTagName('meta');

foreach($metas as $meta){

if($meta->getAttribute('name')=="image"){echo $meta->getAttribute('content');}

}

输出：

https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a

注意：如果您从页面加载内容，请使用 $dom->loadHTMLFile("your_pagename.html"); 而不是这个 $dom->loadHTML($hml);

【讨论】：