使用 Goutte 提取命名空间属性值答案

【问题标题】：Using Goutte to extract a namespaced attribute value使用 Goutte 提取命名空间属性值
【发布时间】：2018-06-22 10:20:47
【问题描述】：

我正在尝试检查是否可以读取网页的<html> 属性以获取所有者声明的语言。

在我检查的 99% 的网站中，我发现该信息写为 <html lang="XX"> 或 <html lang="XX-YY">，但在 1 个特定网站中，我发现它写为 <html xml:lang="XX">，最后一种情况让我头疼。

我试过了

$scraper_client = new \Goutte\Client();
$scraper_crawler = $scraper_client->request('GET', $link);
$response = $scraper_client->getResponse();

var_dump( $scraper_crawler->filter('html')->extract('xml:lang')) );
var_dump( $scraper_crawler->filter('html')->extract('xml|lang')) );
var_dump( $scraper_crawler->filter('html')->extract('xml::lang')) );
var_dump( $scraper_crawler->filter('html')->extract('@[xml:lang]')) );

但它们似乎都不起作用。有人已经做过类似的事情了吗？先感谢您。 S.

编辑

为了完成这个问题，这里有一个链接，其中包含导致我出现问题的 xml:lang 属性：

http://www.ilgiornale.it/news/politica/silvio-berlusconi-centrodestra-oggi-pi-forte-passato-1482545.html

【问题讨论】：

标签： php goutte

【解决方案1】：

我不知道为什么，但它就像 Goutte 切断了这个属性。

我只能通过正则表达式获取值：

$scraper_client = new \Goutte\Client();
$scraper_crawler = $scraper_client->request('GET', $link);
$response = $scraper_client->getResponse();
if (preg_match('/xml:lang=["\']{1}(.*?)["\']{1}/', $response, $matches)) {
    var_dump($matches[1]);
} else {
    echo 'not found';
}

【讨论】：

感谢您的帮助，不幸的是，它们都不起作用。第一个返回一个数组，整个页面在 [0] 中，在 [1] 中为 null，另外两个总是返回 null :)
你试过没有 lang 命名空间吗？只有 xml 属性。
是的，谢谢，同样的结果。我知道我在尝试读取 og:XXX 元标记时必须使用 xPath 过滤，但最后我想出了如何做到这一点......而使用这个我真的不明白该怎么做:)
我只能使用正则表达式获取xml:lang 值。我知道这不是你要问的，但我还没有找到其他方法，所以我更愿意，至少，分享这个。请参阅我的更新答案。