【问题标题】:Resolve HtmlCleaner issue of getting HTTP respond code 403解决 HtmlCleaner 获取 HTTP 响应码 403 的问题
【发布时间】:2013-09-06 07:02:09
【问题描述】:

我正在使用 html 清理器从网站获取数据...但我不断收到此错误。

服务器返回 HTTP 响应代码:403 对应 URL:http://www.groupon.com/browse/chicago?z=skip

我不确定我做错了什么,因为我以前使用过相同的代码并且它工作得很好。 有人能帮帮我吗?。

代码如下:

public ArrayList ParseGrouponDeals(ArrayList arrayList) {
    try {
        CleanerProperties props = new CleanerProperties();

        props.setTranslateSpecialEntities(true);
        props.setTransResCharsToNCR(true);
        props.setOmitComments(true);

        TagNode root = new HtmlCleaner(props).clean(new URL("http://www.groupon.com/browse/chicago?z=skip"));

        //Get the Wrapper.
        Object[] objects = root.evaluateXPath("//*[@id=\"browse-deals\"]");
        TagNode dealWrapper = (TagNode) objects[0];

        //Get the childs
        TagNode[] todayDeals = dealWrapper.getElementsByAttValue("class", "deal-list-tile grid_5_third", true, true);
        System.out.println("++++ Groupon Deal Today: " + todayDeals.length + " deals");
        for (int i = 0; i < todayDeals.length; i++) {
            String link = String.format("http://www.groupon.com%s", todayDeals[i].findElementByAttValue("class", "deal-permalink", true, true).getAttributeByName("href").toString());
            arrayList.add(link);
        }
        return arrayList;
    } catch (Exception e) {
        System.out.println("Error parsing Groupon:" + e.getMessage());
        e.printStackTrace();
    }
    return null;
}

【问题讨论】:

    标签: java url httpresponse http-status-code-403 htmlcleaner


    【解决方案1】:

    对我来说,添加“用户代理”可以解决问题;像这样使用它 sn-p:

            final URL urlSB = new URL("http://www.groupon.com/browse/chicago?z=skip");
            final URLConnection urlConnection = urlSB.openConnection();
            urlConnection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0");
            urlConnection.connect();
            final HtmlCleaner cleaner = new HtmlCleaner();
            final CleanerProperties props = cleaner.getProperties();
            props.setNamespacesAware(false);
            final TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-09-30
      • 2019-06-28
      • 1970-01-01
      • 2019-06-19
      相关资源
      最近更新 更多