使用java解析robot.txt并识别是否允许url答案

【问题标题】：Parsing robot.txt using java and identify whether an url is allowed使用java解析robot.txt并识别是否允许url
【发布时间】：2013-10-12 10:07:26
【问题描述】：

我目前在应用程序中使用 jsoup 来解析和分析网页。但我想确保我遵守 robots.txt 规则并且只访问允许的页面。

我很确定 jsoup 不是为此而生的，它完全是关于网络抓取和解析的。所以我打算有一个函数/模块，它应该读取域/站点的robot.txt并识别我要访问的url是否被允许。

我做了一些研究，发现了以下内容。但我不确定这些，所以如果有人做涉及robot.txt解析的同类项目，那就太好了，请分享你的想法和想法。

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

【问题讨论】：

到底是什么问题？解析 robots.txt 似乎有点超出 Jsoup 的范围。 Jsoup 是用来解析网页的，就像你自己说的那样。
谢谢，是的，我正在使用 jsoup 来解析页面......但要求是解析仅在 robots.txt 中允许（不受限制）的 url .. 对于这个验证似乎 JSoup不是最好的或没有能力的。所以我需要知道的是如何在进行实际解析之前对 robots.txt 进行验证。
好的，这很好。我正在寻找一个使用 jsoup 的小项目，所以我可以自己做。
@alkis 你有什么想法吗？

标签： java web-scraping jsoup crawler4j

【解决方案1】：

一个迟到的答案，以防万一您 - 或其他人 - 仍在寻找一种方法来做到这一点。我在 0.2 版中使用https://code.google.com/p/crawler-commons/，它似乎运行良好。这是我使用的代码中的一个简化示例：

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);

显然这与 Jsoup 没有任何关系，它只是检查是否允许为某个 USER_AGENT 抓取给定的 URL。为了获取 robots.txt，我使用 4.2.1 版的 Apache HttpClient，但这也可以用 java.net 的东西代替。

请注意，此代码仅检查允许或禁止，不考虑其他 robots.txt 功能，如“抓取延迟”。但是由于crawler-commons也提供了这个功能，所以可以很容易地添加到上面的代码中。

【讨论】：

【解决方案2】：

以上对我不起作用。我设法把这个放在一起。我 4 年来第一次做 Java，所以我确信这可以改进。

public static boolean robotSafe(URL url) 
{
    String strHost = url.getHost();

    String strRobot = "http://" + strHost + "/robots.txt";
    URL urlRobot;
    try { urlRobot = new URL(strRobot);
    } catch (MalformedURLException e) {
        // something weird is happening, so don't trust it
        return false;
    }

    String strCommands;
    try 
    {
        InputStream urlRobotStream = urlRobot.openStream();
        byte b[] = new byte[1000];
        int numRead = urlRobotStream.read(b);
        strCommands = new String(b, 0, numRead);
        while (numRead != -1) {
            numRead = urlRobotStream.read(b);
            if (numRead != -1) 
            {
                    String newCommands = new String(b, 0, numRead);
                    strCommands += newCommands;
            }
        }
       urlRobotStream.close();
    } 
    catch (IOException e) 
    {
        return true; // if there is no robots.txt file, it is OK to search
    }

    if (strCommands.contains(DISALLOW)) // if there are no "disallow" values, then they are not blocking anything.
    {
        String[] split = strCommands.split("\n");
        ArrayList<RobotRule> robotRules = new ArrayList<>();
        String mostRecentUserAgent = null;
        for (int i = 0; i < split.length; i++) 
        {
            String line = split[i].trim();
            if (line.toLowerCase().startsWith("user-agent")) 
            {
                int start = line.indexOf(":") + 1;
                int end   = line.length();
                mostRecentUserAgent = line.substring(start, end).trim();
            }
            else if (line.startsWith(DISALLOW)) {
                if (mostRecentUserAgent != null) {
                    RobotRule r = new RobotRule();
                    r.userAgent = mostRecentUserAgent;
                    int start = line.indexOf(":") + 1;
                    int end   = line.length();
                    r.rule = line.substring(start, end).trim();
                    robotRules.add(r);
                }
            }
        }

        for (RobotRule robotRule : robotRules)
        {
            String path = url.getPath();
            if (robotRule.rule.length() == 0) return true; // allows everything if BLANK
            if (robotRule.rule == "/") return false;       // allows nothing if /

            if (robotRule.rule.length() <= path.length())
            { 
                String pathCompare = path.substring(0, robotRule.rule.length());
                if (pathCompare.equals(robotRule.rule)) return false;
            }
        }
    }
    return true;
}

你需要帮助类：

/**
 *
 * @author Namhost.com
 */
public class RobotRule 
{
    public String userAgent;
    public String rule;

    RobotRule() {

    }

    @Override public String toString() 
    {
        StringBuilder result = new StringBuilder();
        String NEW_LINE = System.getProperty("line.separator");
        result.append(this.getClass().getName() + " Object {" + NEW_LINE);
        result.append("   userAgent: " + this.userAgent + NEW_LINE);
        result.append("   rule: " + this.rule + NEW_LINE);
        result.append("}");
        return result.toString();
    }    
}

【讨论】：

这个支持通配符匹配吗？
readLine() 应该用于由行组成的文本文件，而不是逐个字符地读取。我怀疑这段代码真的能正常工作。