使用 Java 从网页中抓取信息？答案

【问题标题】：Scrape information from Web Pages with Java?使用 Java 从网页中抓取信息？
【发布时间】：2018-06-18 17:15:03
【问题描述】：

我正在尝试从网页中提取数据，例如，假设我希望从 chess.org 获取信息。

我知道玩家的 ID 是 25022，这意味着我可以请求 http://www.chess.org.il/Players/Player.aspx?Id=25022

在该页面中，我可以看到该玩家的真实 ID = 2821109。
从那里，我可以请求这个页面：
http://ratings.fide.com/card.phtml?event=2821109

从中我可以看到 stdRating=1602。

如何从 Java 中的给定“localID”输入中获取“stdRating”输出？

（localID、fideID 和 stdRating 是我用来澄清问题的辅助参数）

【问题讨论】：

欢迎来到 Stack Overflow。除了您对您正在做的事情的口头描述之外，查看一些代码以显示您已尝试过的内容会很有帮助。查看minimal reproducible example 了解有关创建代码示例的一些提示。另请阅读 How to Ask 以获取有关改进问题的提示。
您需要从返回给这些请求的页面中解析这些参数。 JSoup、jsoup.org 之类的工具非常适用。

标签： java http web-scraping parameters response

【解决方案1】：

你可以试试univocity-html-parser，它非常好用，避免了很多意大利面条式的代码。

例如，要获得标准评级，您可以使用以下代码：

public static void main(String... args) {
    UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
    url.getRequest().setUrlParameter("EVENT", 2821109);

    HtmlElement doc = HtmlParser.parseTree(url);

    String rating = doc.query()
            .match("small").withText("std.")
            .match("br").getFollowingText()
            .getValue();

    System.out.println(rating);
}

产生值1602。

但是通过查询单个节点并尝试将所有部分拼接在一起来获取数据并不容易。

我扩展了代码以说明如何使用解析器将更多信息放入记录中。在这里，我为玩家和她的排名详细信息创建了记录，这些记录可在第二页的表格中找到。我用了不到 1 小时就完成了这项工作：

public static void main(String... args) {
    UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
    url.getRequest().setUrlParameter("PLAYER_ID", 25022);

    HtmlEntityList entities = new HtmlEntityList();
    HtmlEntitySettings player = entities.configureEntity("player");
    player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
    player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
    player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
    player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();

    HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
    playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
    playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
    playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
    playerCard.setNesting(Nesting.REPLACE_JOIN);

    HtmlEntitySettings ratings = playerCard.addEntity("ratings");
    configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
    configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
    configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");

    Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
    HtmlParserResult playerData = results.get("player");
    String[] playerFields = playerData.getHeaders();

    for(HtmlRecord playerRecord : playerData.iterateRecords()){
        for(int i = 0; i < playerFields.length; i++){
            System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
        }
        System.out.println();

        HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
        for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
            System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
            System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
        }
    }
}

private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
    Group group = ratings.newGroup()
            .startAt("table").match("b").withExactText(startingHeader)
            .endAt("b").withExactText(endingHeader);

    group.addField("rank_type", rankType);

    group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
    group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
    group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
    group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
    group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}

输出将是：

id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526; 
 * world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
 * country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
 * continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}

希望对你有帮助

披露：我是这个库的作者。它是商业封闭源代码，但可以为您节省大量开发时间。

【讨论】：

【解决方案2】：

正如@Alex R 指出的那样，您需要一个 Web Scraping 库。
他推荐的 JSoup 非常健壮，并且在 Java 中非常常用于此任务，至少在我的经验中是这样。

您首先需要构建一个获取您的页面的文档，例如：

int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();

从这个Document 对象中，您可以获取很多信息，例如您请求的 FIDE ID，不幸的是，您链接的网页非常容易抓取，并且您基本上需要遍历每个链接找到相关链接的页面，例如：

Elements fidelinks = doc.select("a[href*=fide.com]");

这个Elements 对象应该为您提供所有链接的列表，这些链接指向包含文本 fide.com 的任何内容，但您可能只想要第一个，例如：

Element fideurl = doc.selectFirst("a[href=*=fide.com]");

从那时起，我不想为你编写所有代码，但希望这个答案可以作为一个好的起点！

您可以通过在 Element 对象上调用 text() 方法单独获取 ID，但您也可以通过调用 Element.attr('href') 来获取链接本身

您可以用来获取其他值的 css 选择器是 div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type，它会专门为您提供标准分数，至少使用标准 css，所以这也适用于 jsoup。

【讨论】：

我使用：元素 fideurl = doc.selectFirst("a[href*=fide.com]"); int fideID = Integer.parseInt(fideurl.text());它可以工作，但是对于第二个参数： Element stdRatingurl = doc2.selectFirst("a[small*=std.]");或：元素 stdRatingurl = doc2.selectFirst("table.contentpaneopen.tr.td.table.tr.td.table.tr.td.Federation"); （对于另一个参数）不起作用...你能帮帮我吗？
他们真的没有让找到那个元素变得简单，一会儿。
我已经添加了问题的解决方案，如果他们解决了您的问题，请不要忘记将答案标记为已接受
如果你想从 fide id 中获取 stdRatingURl，你可以调用 fideurl.attr('href') 它应该给你锚元素链接到的链接