为什么我的 simple-html-dom 允许例如'ä' 用于维基百科，但不适用于 wikisource？答案

【问题标题】：Why does my simple-html-dom allows e.g. 'ä' for wikipedia but not for wikisource?为什么我的 simple-html-dom 允许例如'ä' 用于维基百科，但不适用于 wikisource？
【发布时间】：2012-07-18 10:21:02
【问题描述】：

我的问题是以下脚本适用于某些 IRI 而其他人则不适用，我的问题是它为什么会这样以及如何解决它。我认为字符集有问题，但这只是一个猜测，因为它在维基百科中有效。

<?php
include('C:\xampp\htdocs\php\simple_html_dom.php');
$html = file_get_html('http://de.wikisource.org/wiki/Am_B%C3%A4chle');
//Titel
foreach($html->find('span#ws-title') as $f)
echo $f->plaintext;

//1   http://de.wikisource.org/wiki/7._August_1929           OK
//2   http://de.wikisource.org/wiki/%E2%80%99s_ist_Krieg!    -
//3   http://de.wikisource.org/wiki/Am_B%C3%A4chle           -
//4   http://de.wikipedia.org/wiki/Guillaume-Aff%C3%A4re     OK
//5   http://de.wikisource.org/wiki/Solidit%C3%A4t           -
?>

这 5 个 IRI 就是示例。最后 3 个 IRI 包含 %C3%A4，它是一个“ä”，但只有来自 wikipedia 的那个有效。 2. IRI 包含 %E2%80%99 它是一个“'” - 不起作用。

但是来自 wikisource 的第一个 IRI 有效。 wikisource 中不包含任何 ä、ö、...的每个 IRI 都是相同的

当它不起作用时，我会收到以下警告：

警告： file_get_contents(http://de.wikisource.org/wiki/Solidit%C3%A4t)：打开流失败：HTTP 请求失败！ HTTP/1.0 403 Forbidden in C:\xampp\htdocs\php\simple_html_dom.php on line 70

致命错误：在第 5 行调用 C:\xampp\htdocs\php\frage.php 中非对象的成员函数 find()

simple_html_dom.php 中包含第 70 行的函数如下所示：

//65    function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT)
//66    {
//67    // We DO force the tags to be terminated.
//68    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $defaultBRText);
//69    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
//70    $contents = file_get_contents($url, $use_include_path, $context, $offset);
//71    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//72    //    $contents = retrieve_url_contents($url);
//73    if (empty($contents))
//74    {
//75        return false;
//76    }
//77    // The second parameter can force the selectors to all be lowercase.
//78    $dom->load($contents, $lowercase, $stripRN);
//79    return $dom;
//80    }

有什么方法可以让脚本适用于 Wikipedia 或 Wikisource 中的每个 IRI？（我知道并不总是有span#ws-title，这不是我的问题。）

【问题讨论】：

仅供参考：这些不是 IRI，它们只是带有编码字符的普通 URL。
好的，但是 IRI 的 URL 不是允许国际化 Web 寻址吗？如果您使用通常用于德国的 ä ö ß 是否不是这种情况？
我的理解是 IRI 中有实际的 UTF-8 字符，即http://example.com/fööbär，而不是编码字符（仅限 ASCII），即http://example.com/f%F6%F6b%E4r）。（不是专家，所以希望有更多知识的人可以加入。）

标签： php parsing character-encoding wikipedia simple-html-dom

【解决方案1】：

真棒的问题！ :)

他们似乎按用户代理过滤，尝试类似

<?php
ini_set("user_agent", "Descriptive user agent string");
file_get_contents("http://de.wikisource.org/wiki/".urlencode("Am_Bächle"));
?>

你可以跳过 urlencode 部分，因为我只是用它来测试编码是否正确。

请注意，wikisource 显然不喜欢自动解析网页上的内容。尽管如此，可能有一个 API 可用于 wikibot 等，询问他们或搜索社区页面。无论如何，API 将更容易处理。

【讨论】：

非常感谢您快速而有帮助的回答。它对我有用:)
我会在社区中询问有关解析的问题，但我不明白为什么我可以解析某些页面而其他页面不能...我会问他们:)
您不应该通过欺骗用户代理来模拟浏览器。您应该改用描述性的用户代理。
@svick：你根本不应该这样做。 Wikisource 禁止某些用户代理是有原因的。我只使用了不同的用户代理来证明这是可能的。这就是我在回答中写最后一段的原因。

【解决方案2】：

问题与字符或编码无关。由于the Wikimedia User-Agent policy，您得到了 403，它表示：

脚本应使用包含联系信息的用户代理字符串，否则它们可能会在不通知的情况下被 IP 阻止。

这就是您应该做的：将 User-Agent 标头设置为可以识别您的应用程序并且可以在出现问题时与您联系的内容。

话虽如此，直接访问页面可能是获得所需日期的最糟糕方法。您应该改用the API，或者，如果您想访问大量页面，请使用the database dumps。

【讨论】：

谢谢！我刚刚下载了我需要的文件:)