【问题标题】:Crawl through JavaScript redirect通过 JavaScript 重定向爬行
【发布时间】:2017-04-06 18:08:18
【问题描述】:

我正在用 Java 编写一个蜘蛛程序,但在处理 URL 重定向时遇到了一些麻烦。到目前为止,我遇到了两种 URL 重定向,第一种是 HTTP 响应代码为 3xx 的那些,我可以关注this answer

但第二种是服务器返回 HTTP 响应代码 200,页面仅包含一些 JavaScript 代码,如下所示:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
function detectmob() { 
    var u=(document.URL);
    if( navigator.userAgent.match(/Android/i) || some other browser...){
        window.location.href="web/mobile/index.php";
    } else {
        window.location.href="web/desktop/index.php";
    }
}

detectmob();
</script>
</head>
<body></body></html>

如果原始 URL 是 http://example.com,那么如果我使用启用了 JavaScript 的桌面网络浏览器,它将自动重定向到 http://example.com/web/desktop/index.php

但是,我的蜘蛛通过获取HTTP response code 200 来检查HttpURLConnection#getResponseCode() 以查看它是否已到达最终URL,如果收到HTTP response code 3xx,则使用URLConnection#getHeaderField() 获取Location 字段。以下是我的蜘蛛的sn-p代码:

public String getFinalUrl(String originalUrl) {
        try {
            URLConnection con = new URL(originalUrl).openConnection();
            HttpURLConnection hCon = (HttpURLConnection) con;
            hCon.setInstanceFollowRedirects(false);
            if(hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_PERM 
                    || hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_TEMP) {
                System.out.println("redirected url: " + con.getHeaderField("Location"));
                return getFinalUrl(con.getHeaderField("Location"));
            }
        } catch (IOException ex) {
            System.err.println(ex.toString());
        }

        return originalUrl;
    }

所以获取上面的页面将有一个HTTP response code 200,我的蜘蛛会假设不会有进一步的重定向并开始解析内容文本为空的页面。

我在谷歌上搜索了一下这个问题,显然javax.script 有点相关,但我不知道如何使它起作用。如何对我的蜘蛛进行编程,以便它能够获取正确的 URL?

【问题讨论】:

    标签: javascript java web-crawler url-redirection


    【解决方案1】:

    这是一个解决方案,它使用 Apache HttpClient 处理响应代码重定向,使用 Jsoup 从 html 中提取 javascript,然后使用正则表达式从可以在 javascript 中执行重定向的几种方式获取重定向字符串。

    package com.yourpackage;
    
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.io.StringWriter;
    import java.net.MalformedURLException;
    import java.net.URL;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    import org.apache.http.HttpResponse;
    import org.apache.http.client.HttpClient;
    import org.apache.http.client.methods.HttpGet;
    import org.apache.http.impl.client.HttpClientBuilder;
    import org.jsoup.Jsoup;
    import org.jsoup.helper.StringUtil;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    
    import com.google.common.base.Joiner;
    import com.google.common.net.HttpHeaders;
    
    public class CrawlHelper {
    
      /**
       * Get end contents of a urlString. Status code is not checked here because
       * org.apache.http.client.HttpClient effectively handles the 301 redirects.
       * 
       * Javascript is extracted using Jsoup, and checked for references to
       * &quot;window.location.replace&quot;.
       * 
       * @param urlString Url. &quot;http&quot; will be prepended if https or http not already there.
       * @return Result after all redirects, including javascript.
       * @throws IOException
       */
      public String getResult(final String urlString) throws IOException {
        String html = getTextFromUrl(urlString);
        Document doc = Jsoup.parse(html);
        for (Element script : doc.select("script")) {
          String potentialURL = getTargetLocationFromScript(urlString, script.html());
          if (potentialURL.indexOf("/") == 0) {
            potentialURL = Joiner.on("").join(urlString, potentialURL);
          }
          if (!StringUtil.isBlank(potentialURL)) {
            return getTextFromUrl(potentialURL);
          }
        }
        return html;
      }
    
      /**
       * 
       * @param urlString Will be prepended if the target location doesn't start with &quot;http&quot;.
       * @param js Javascript to scan.
       * @return Target that matches window.location.replace or window.location.href assignments.
       * @throws IOException
       */
      String getTargetLocationFromScript(String urlString, String js) throws IOException {
        String potentialURL = getTargetLocationFromScript(js);
        if (potentialURL.indexOf("http") == 0) {
          return potentialURL;
        }
        return Joiner.on("").join(urlString, potentialURL);
      }
    
      String getTargetLocationFromScript(String js) throws IOException {
        int i = js.indexOf("window.location.replace");
        if (i > -1) {
          return getTargetLocationFromLocationReplace(js);
        }
        i = js.indexOf("window.location.href");    
        if (i > -1) {
          return getTargetLocationFromHrefAssign(js);
        }
        return "";
      }
    
      private String getTargetLocationFromHrefAssign(String js) {
        return findTargetFrom("window.location.href\\s?=\\s?\\\"(.+)\\\"", js);
      }
    
      private String getTargetLocationFromLocationReplace(String js) throws IOException {
        return findTargetFrom("window.location.replace\\(\\\"(.+)\\\"\\)", js);
      }
    
      private String findTargetFrom(String regex, String js) {
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(js);
        while (m.find()) {
          String potentialURL = m.group(1);
          if (!StringUtil.isBlank(potentialURL)) {
            return potentialURL;
          }
        }
        return "";
      }
    
      private String getTextFromUrl(String urlString) throws IOException {
        if (StringUtil.isBlank(urlString)) {
          throw new IOException("Supplied URL value is empty.");
        }
        String httpUrlString = prependHTTPifNecessary(urlString);
        HttpClient client = HttpClientBuilder.create().build();
        HttpGet request = new HttpGet(httpUrlString);
        request.addHeader("User-Agent", HttpHeaders.USER_AGENT);
        HttpResponse response = client.execute(request);
        try (BufferedReader rd =
            new BufferedReader(new InputStreamReader(response.getEntity().getContent()))) {
          StringWriter result = new StringWriter();
          String line = "";
          while ((line = rd.readLine()) != null) {
            result.append(line);
          }
          return result.toString();
        }
      }
    
      private String prependHTTPifNecessary(String urlString) throws IOException {
        if (urlString.indexOf("http") != 0) {
          return Joiner.on("://").join("http", urlString);
        }
        return validateURL(urlString);
      }
    
      private String validateURL(String urlString) throws IOException {
        try {
          new URL(urlString);
        } catch (MalformedURLException mue) {
          throw new IOException(mue);
        }
        return urlString;
      }
    }
    

    TDD...修改/增强以匹配各种场景:

    package com.yourpackage;
    
    import java.io.IOException;
    
    import org.junit.Assert;
    import org.junit.Test;
    
    public class CrawlHelperTest {
    
      @Test
      public void testRegex() throws IOException {
        String targetLoc = 
        new CrawlHelper().getTargetLocationFromScript("somesite.com", "function goHome() { window.location.replace(\"/s/index.html\")}");
        Assert.assertEquals("somesite.com/s/index.html", targetLoc);
        targetLoc = 
            new CrawlHelper().getTargetLocationFromScript("window.location.href=\"web/mobile/index.php\";");
        Assert.assertEquals("web/mobile/index.php", targetLoc);
      }
    
      @Test
      public void testCrawl() throws IOException {
        Assert.assertTrue(new CrawlHelper().getResult("somesite.com").indexOf("someExpectedContent") > -1);
      }
    
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2011-12-05
      • 1970-01-01
      • 2013-12-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-09-08
      • 1970-01-01
      相关资源
      最近更新 更多