Robotparser 似乎无法正确解析答案

【问题标题】：Robotparser doesn't seem to parse correctlyRobotparser 似乎无法正确解析
【发布时间】：2013-03-11 16:58:42
【问题描述】：

我正在编写一个爬虫，为此我正在实现 robots.txt 解析器，我正在使用标准库 robotparser。

robotparser 似乎没有正确解析，我正在使用 Google 的 robots.txt 调试我的爬虫。

（以下示例来自 IPython）

In [1]: import robotparser

In [2]: x = robotparser.RobotFileParser()

In [3]: x.set_url("http://www.google.com/robots.txt")

In [4]: x.read()

In [5]: x.can_fetch("My_Crawler", "/catalogs") # This should return False, since it's on Disallow
Out[5]: False

In [6]: x.can_fetch("My_Crawler", "/catalogs/p?") # This should return True, since it's Allowed
Out[6]: False

In [7]: x.can_fetch("My_Crawler", "http://www.google.com/catalogs/p?")
Out[7]: False

这很有趣，因为有时它似乎“工作”，有时它似乎失败了，我也尝试了来自 Facebook 和 Stackoverflow 的 robots.txt。这是来自robotpaser 模块的错误吗？还是我在这里做错了什么？如果有，是什么？

我想知道this bug 是否有任何相关的东西

【问题讨论】：

我也在 Linux 机器（Arch Linux）上使用 Python 2.7.3

标签： python python-2.7 web-crawler robots.txt

【解决方案1】：

这不是错误，而是解释上的差异。根据draft robots.txt specification（从未获得批准，也不太可能获得批准）：

要评估是否允许访问 URL，机器人必须尝试将 Allow 和 Disallow 行中的路径与 URL 匹配，在它们在记录中出现的顺序。使用找到的第一个匹配项。如果不找到匹配项，默认假设该 URL 是允许的。

（第 3.2.2 节，允许和禁止行）

使用该解释，然后是“/catalogs/p?”应该被拒绝，因为之前有一个“Disallow: /catalogs”指令。

在某个时候，Google 开始以不同于该规范的方式解释 robots.txt。他们的方法似乎是：

Check for Allow. If it matches, crawl the page.
Check for Disallow. If it matches, don't crawl.
Otherwise, crawl.

问题是robots.txt的解释没有正式的约定。我见过使用 Google 方法的爬虫和其他使用 1996 年草案标准的爬虫。当我操作爬虫时，当我使用 Google 解释时，我从网站管理员那里得到了 nastygram，因为我爬取了他们认为不应该被爬取的页面，如果我使用其他解释，我会从其他人那里得到讨厌的图，因为他们认为应该索引的东西，不是。

【讨论】：

【解决方案2】：

经过几次 Google 搜索后，我没有找到任何关于 robotparser 问题的信息。我最终得到了其他东西，我发现了一个名为 reppy 的模块，我对其进行了一些测试，它看起来非常强大。可以通过pip;

安装

pip install reppy

这里有几个使用 reppy 的示例（在 IPython 上），再次使用 Google 的 robots.txt

In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it's allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it's not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it's allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

In [10]: # It also has a x.disallowed function. The contrary of x.allowed

【讨论】：

【解决方案3】：

有趣的问题。我查看了源代码（我只有 python 2.4 源可用，但我敢打赌它没有改变）并且代码通过执行来规范化正在测试的 url：

urllib.quote(urlparse.urlparse(urllib.unquote(url))[2])

这是你问题的根源：

>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo"))[2]) 
'/foo'
>>> urllib.quote(urlparse.urlparse(urllib.unquote("/foo?"))[2]) 
'/foo'

所以它要么是 python 库中的一个错误，要么是谷歌通过包含“？”来破坏 robots.txt 规范。规则中的字符（这有点不寻常）。

[以防万一不清楚，我会以不同的方式再说一遍。上面的代码被 robotsparser 库用作检查 url 的一部分。所以当网址以“？”结尾时该字符被删除。因此，当您检查/catalogs/p? 时，实际执行的测试是针对/catalogs/p。因此你的结果令人惊讶。]

我建议 filing a bug 与 python 人一起使用（您可以在此处发布链接作为解释的一部分）[编辑：谢谢]。然后使用您找到的其他库...

【讨论】：

谢谢！你是对的，我对我找到的库做了同样的检查，不幸的是他们做了同样的事情，虽然它比机器人解析器工作得更好，但它们的问题是一样的。我报告了这个错误 -> bugs.python.org/issue17403

【解决方案4】：

大约一周前，我们合并了一个提交，其中包含导致此问题的错误。我们刚刚将 0.2.2 版推送到 repo 中的 pip 和 master，包括针对这个问题的回归测试。

版本 0.2 包含细微的接口更改——现在您必须创建一个 RobotsCache 对象，其中包含 reppy 最初拥有的确切接口。这主要是为了使缓存显式化，并使同一进程中可以有不同的缓存。但是看哪，它现在又可以工作了！

from reppy.cache import RobotsCache
cache = RobotsCache()
cache.allowed('http://www.google.com/catalogs', 'foo')
cache.allowed('http://www.google.com/catalogs/p', 'foo')
cache.allowed('http://www.google.com/catalogs/p?', 'foo')

【讨论】：

谢谢！那太棒了！ Reppy +10，做了一个快速的问题，不到 24 小时就解决了！再次感谢！