你可以试试univocity-html-parser,它非常好用,避免了很多意大利面条式的代码。
例如,要获得标准评级,您可以使用以下代码:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);
HtmlElement doc = HtmlParser.parseTree(url);
String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();
System.out.println(rating);
}
产生值1602。
但是通过查询单个节点并尝试将所有部分拼接在一起来获取数据并不容易。
我扩展了代码以说明如何使用解析器将更多信息放入记录中。在这里,我为玩家和她的排名详细信息创建了记录,这些记录可在第二页的表格中找到。我用了不到 1 小时就完成了这项工作:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();
HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);
HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");
Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();
for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();
HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}
private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);
group.addField("rank_type", rankType);
group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}
输出将是:
id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526;
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}
希望对你有帮助
披露:我是这个库的作者。它是商业封闭源代码,但可以为您节省大量开发时间。