【问题标题】:Scrape webpages that are loaded using JavaScript function [closed]抓取使用 JavaScript 函数加载的网页 [关闭]
【发布时间】:2018-02-22 05:32:59
【问题描述】:
我目前正在尝试使用 jsoup 抓取 this site。
到目前为止我的代码:
public class Main {
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx").get();
} catch (IOException e) {
e.printStackTrace();
}
Elements list = doc.getElementsByClass("name showframe");
for (int i = 0; i < list.size() ; i++) {
System.out.println(list.get(i).html() + " \n" + list.get(i).absUrl("href"));
}
}
}
我的问题是,上面的代码只抓取了通过调用 JavaScript 函数加载的 71 个页面中的第一页。
如何使用 jsoup 抓取其他页面?
【问题讨论】:
标签:
java
web-scraping
jsoup
html-parsing
【解决方案1】:
有问题的 JavaScript 函数只是将 POST 请求发送到具有不同 __EVENTARGUMENT 的同一 URL,这是页面的编号。
您可以通过模仿此行为轻松获取其他页面:
进口:
import org.jsoup.*;
import org.jsoup.Connection.Response;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import static java.net.URLEncoder.encode;
代码:
public static void main(String[] args){
String url = "http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx";
String userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0";
try {
Response response = Jsoup.connect(url).execute();
Document document = response.parse();
String viewState = encode(document.getElementById("__VIEWSTATE").attr("value"), "UTF-8");
String eventTarget = encode("p$lt$ctl12$pageplaceholder$p$lt$ctl01$UniPager$pagerElem", "UTF-8");
for(int i = 1; i < 72; ++i) {
document = Jsoup.connect(url).userAgent(userAgent)
.requestBody(
String.format(
"__EVENTTARGET=%s"
+ "&__EVENTARGUMENT=%d"
+ "&__VIEWSTATE=%s",
eventTarget, i, viewState ))
.cookies(response.cookies())
.post();
Elements list = document.getElementsByClass("name showframe");
for (int x = 0; x < list.size() ; x++) {
System.out.println(list.get(x).html() + " \n" + list.get(x).absUrl("href"));
}
}
} catch (Exception ex) {
// TODO Handle exceptions
ex.printStackTrace();
}
}
【解决方案2】:
所以,终于拿到了这个……
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import static java.net.URLEncoder.encode;
public class Main {
public static void main(String[] args) throws IOException {
PrintWriter pw = new PrintWriter(new File("Participants.csv"), "windows-1251");
final String colNames = "Name;Country;Address;Phone;Fax;Site;Descr";
StringBuilder builder = new StringBuilder();
builder.append(colNames + "\n");
final String url = "http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx";
String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36";
Connection.Response response = Jsoup.connect(url).execute();
Document document = response.parse();
String viewState = encode(document.getElementById("__VIEWSTATE").attr("value"), "UTF-8");
String eventTarget = encode("p$lt$ctl12$pageplaceholder$p$lt$ctl01$UniPager$pagerElem" , "UTF-8");
for (int i = 1; i <=71 ; i++) {
document = Jsoup.connect(url).userAgent(userAgent)
.requestBody(
String.format(
"__EVENTTARGET=%s" + "&__EVENTARGUMENT=%d" + "&__VIEWSTATE=%s",
eventTarget, i , viewState ))
.cookies(response.cookies())
.post();
Elements subList = document.getElementsByClass("name showframe");
for (int j = 0; j < subList.size(); j++) {
Document compPage = Jsoup.connect(subList.get(j).absUrl("href")).post();
Elements compPageData = compPage.getElementsByClass("name");
builder.append(subList.get(j).html().replace(';',',') + ";");
// System.out.println(subList.get(j).html().replace(';',',') + ";");
builder.append(subList.get(j).siblingElements().first().text()+ ";");
// System.out.println(subList.get(j).siblingElements().first().text()+ ";");
for (int k = 0; k <compPageData.size() ; k++) {
builder.append(compPageData.get(k).siblingElements().text().replace(';', ',') + ";");
// System.out.print(compPageData.get(k).siblingElements().text().replace(';', ',') + ";");
}
builder.append('\n');
}
}
pw.write(builder.toString());
pw.close();
}
}