【问题标题】:Scrape webpages that are loaded using JavaScript function [closed]抓取使用 JavaScript 函数加载的网页 [关闭]
【发布时间】:2018-02-22 05:32:59
【问题描述】:

我目前正在尝试使用 jsoup 抓取 this site

到目前为止我的代码:

public class Main {
    public static void main(String[] args) {
        Document doc = null;
        try {
            doc = Jsoup.connect("http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx").get();
        } catch (IOException e) {
            e.printStackTrace();
        }

        Elements list = doc.getElementsByClass("name showframe");

        for (int i = 0; i < list.size() ; i++) {
            System.out.println(list.get(i).html() + " \n" + list.get(i).absUrl("href"));
        }
    }
}

我的问题是,上面的代码只抓取了通过调用 JavaScript 函数加载的 71 个页面中的第一页。

如何使用 jsoup 抓取其他页面?

【问题讨论】:

    标签: java web-scraping jsoup html-parsing


    【解决方案1】:

    有问题的 JavaScript 函数只是将 POST 请求发送到具有不同 __EVENTARGUMENT 的同一 URL,这是页面的编号。
    您可以通过模仿此行为轻松获取其他页面:

    进口:

    import org.jsoup.*;
    import org.jsoup.Connection.Response;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    import static java.net.URLEncoder.encode;    
    

    代码:

    public static void main(String[] args){
        String url = "http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx";
        String userAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0";
        try {
            Response  response = Jsoup.connect(url).execute();
            Document document = response.parse();
    
            String viewState = encode(document.getElementById("__VIEWSTATE").attr("value"), "UTF-8");
            String eventTarget = encode("p$lt$ctl12$pageplaceholder$p$lt$ctl01$UniPager$pagerElem", "UTF-8");
    
            for(int i = 1; i < 72; ++i) {
                document = Jsoup.connect(url).userAgent(userAgent)
                    .requestBody(
                            String.format(
                                    "__EVENTTARGET=%s"
                                    + "&__EVENTARGUMENT=%d"
                                    + "&__VIEWSTATE=%s",
                                    eventTarget, i, viewState ))
                    .cookies(response.cookies())
                    .post();
    
                Elements list = document.getElementsByClass("name showframe");
    
                for (int x = 0; x < list.size() ; x++) {
                    System.out.println(list.get(x).html() + " \n" + list.get(x).absUrl("href"));
                }
            }
        } catch (Exception ex) {
            // TODO Handle exceptions
            ex.printStackTrace();
        }
    }
    

    【讨论】:

    • 太棒了...非常感谢 =)
    【解决方案2】:

    所以,终于拿到了这个……

    import org.jsoup.Connection;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    import java.io.File;
    import java.io.IOException;
    import java.io.PrintWriter;
    import static java.net.URLEncoder.encode;
    
    public class Main {
        public static void main(String[] args) throws IOException {
    
            PrintWriter pw = new PrintWriter(new File("Participants.csv"), "windows-1251");
            final String colNames = "Name;Country;Address;Phone;Fax;Site;Descr";
            StringBuilder builder = new StringBuilder();
            builder.append(colNames + "\n");
    
            final String url = "http://www.world-food.ru/ru-RU/about/exhibitor-list.aspx";
            String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36";
    
            Connection.Response response = Jsoup.connect(url).execute();
            Document document = response.parse();
            String viewState = encode(document.getElementById("__VIEWSTATE").attr("value"), "UTF-8");
            String eventTarget = encode("p$lt$ctl12$pageplaceholder$p$lt$ctl01$UniPager$pagerElem" , "UTF-8");
    
            for (int i = 1; i <=71 ; i++) {
                document = Jsoup.connect(url).userAgent(userAgent)
                        .requestBody(
                                String.format(
                                        "__EVENTTARGET=%s" + "&__EVENTARGUMENT=%d" + "&__VIEWSTATE=%s",
                                        eventTarget, i , viewState ))
                        .cookies(response.cookies())
                        .post();
    
                Elements subList = document.getElementsByClass("name showframe");
                for (int j = 0; j < subList.size(); j++) {
                    Document compPage = Jsoup.connect(subList.get(j).absUrl("href")).post();
                    Elements compPageData = compPage.getElementsByClass("name");
    
                    builder.append(subList.get(j).html().replace(';',',') + ";");
    //                System.out.println(subList.get(j).html().replace(';',',') + ";");
                    builder.append(subList.get(j).siblingElements().first().text()+ ";");
    //                System.out.println(subList.get(j).siblingElements().first().text()+ ";");
    
                    for (int k = 0; k <compPageData.size() ; k++) {
                        builder.append(compPageData.get(k).siblingElements().text().replace(';', ',') + ";");
    //                    System.out.print(compPageData.get(k).siblingElements().text().replace(';', ',') + ";");
                    }
                    builder.append('\n');
                }
            }
            pw.write(builder.toString());
            pw.close();
        }
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2010-10-25
      • 1970-01-01
      • 1970-01-01
      • 2019-06-06
      • 2019-12-18
      • 1970-01-01
      相关资源
      最近更新 更多