以下内容仅供学习交流使用,请勿做他用,否则后果自负。
一.使用的技术
这个爬虫是近半个月前学习爬虫技术的一个小例子,比较简单,怕时间久了会忘,这里简单总结一下.主要用到的外部Jar包有HttpClient4.3.4,HtmlParser2.1,使用的开发工具(IDE)为intelij 13.1,Jar包管理工具为Maven,不习惯用intelij的同学,也可以使用eclipse新建一个项目.
二.爬虫基本知识
1.什么是网络爬虫?(爬虫的基本原理)
网络爬虫,拆开来讲,网络即指互联网,互联网就像一个蜘蛛网一样,爬虫就像是蜘蛛一样可以到处爬来爬去,把爬来的数据再进行加工处理.
百科上的解释:网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁,自动索引,模拟程序或者蠕虫。
基本原理:传统爬虫从一个或若干初始网页的URL开始,获得初始网页上的URL,在抓取网页的过程中,不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件,流程图所示。聚焦爬虫的工作流程较为复杂,需要根据一定的网页分析算法过滤与主题无关的链接,保留有用的链接并将其放入等待抓取的URL队列。然后,它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL,并重复上述过程,直到达到系统的某一条件时停止
2.常用的爬虫策略有哪些?
网页的抓取策略可以分为深度优先、广度优先和最佳优先三种。深度优先在很多情况下会导致爬虫的陷入(trapped)问题,目前常见的是广度优先和最佳优先方法。
2.1广度优先(Width-First)
广度优先遍历是连通图的一种遍历策略。因为它的思想是从一个顶点V0开始,辐射状地优先遍历其周围较广的区域,故得名.
其基本思想:
如下图所示:
2.2深度优先(Depth-First)
下面以一个有向图和一个无向图为例:
广度和深度和区别:
广度优先遍历是以层为顺序,将某一层上的所有节点都搜索到了之后才向下一层搜索;而深度优先遍历是将某一条枝桠上的所有节点都搜索到了之后,才转向搜索另一条枝桠上的所有节点。
2.3 最佳优先搜索
最佳优先搜索策略按照一定的网页分析算法,预测候选URL与目标网页的相似度,或与主题的相关性,并选取评价最好的一个或几个URL进行抓取。它只访问经过网页分析算法预测为“有用”的网页。这种搜索适合暗网数据的爬取,只要符合要求的内容.
3.本文爬虫示例图
本文介绍的例子是抓取新闻类的信息,因为一般新闻类的信息,重要的和时间近的都会放在首页,处在网络层中比较深的信息的重要性一般将逐级降低,所以广度优先算法更适合,下图是本文将要抓取的网页结构图:
三.广度优先爬虫示例
1.需求:抓取复旦新闻信息(只抓取100个网页信息)
这里只抓取100条信息,并用url必须以new.fudan.edu.cn开头.
2.代码实现
使用maven引入外部jar包:
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.3.4</version> </dependency> <dependency> <groupId>org.htmlparser</groupId> <artifactId>htmlparser</artifactId> <version>2.1</version> </dependency>
程序主入口:
package com.amos.crawl; import java.util.Set; /** * Created by amosli on 14-7-10. */ public class MyCrawler { /** * 使用种子初始化URL队列 * * @param seeds */ private void initCrawlerWithSeeds(String[] seeds) { for (int i = 0; i < seeds.length; i++) { LinkQueue.addUnvisitedUrl(seeds[i]); } } public void crawling(String[] seeds) { //定义过滤器,提取以http://news.fudan.edu.cn/的链接 LinkFilter filter = new LinkFilter() { @Override public boolean accept(String url) { if (url.startsWith("http://news.fudan.edu.cn")) { return true; } return false; } }; //初始化URL队列 initCrawlerWithSeeds(seeds); int count=0; //循环条件:待抓取的链接不为空抓取的网页最多100条 while (!LinkQueue.isUnvisitedUrlsEmpty() && LinkQueue.getVisitedUrlNum() <= 100) { System.out.println("count:"+(++count)); //附头URL出队列 String visitURL = (String) LinkQueue.unVisitedUrlDeQueue(); DownLoadFile downloader = new DownLoadFile(); //下载网页 downloader.downloadFile(visitURL); //该URL放入怩访问的URL中 LinkQueue.addVisitedUrl(visitURL); //提取出下载网页中的URL Set<String> links = HtmlParserTool.extractLinks(visitURL, filter); //新的未访问的URL入列 for (String link : links) { System.out.println("link:"+link); LinkQueue.addUnvisitedUrl(link); } } } public static void main(String args[]) { //程序入口 MyCrawler myCrawler = new MyCrawler(); myCrawler.crawling(new String[]{"http://news.fudan.edu.cn/news/"}); } }
工具类:Tools.java
package com.amos.tool; import java.io.*; import java.net.URI; import java.net.URISyntaxException; import java.net.UnknownHostException; import java.security.KeyManagementException; import java.security.KeyStoreException; import java.security.NoSuchAlgorithmException; import java.security.cert.CertificateException; import java.security.cert.X509Certificate; import java.util.Locale; import javax.net.ssl.SSLContext; import javax.net.ssl.SSLException; import org.apache.http.*; import org.apache.http.client.CircularRedirectException; import org.apache.http.client.CookieStore; import org.apache.http.client.HttpRequestRetryHandler; import org.apache.http.client.RedirectStrategy; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpHead; import org.apache.http.client.methods.HttpUriRequest; import org.apache.http.client.methods.RequestBuilder; import org.apache.http.client.protocol.HttpClientContext; import org.apache.http.client.utils.URIBuilder; import org.apache.http.client.utils.URIUtils; import org.apache.http.conn.ConnectTimeoutException; import org.apache.http.conn.HttpClientConnectionManager; import org.apache.http.conn.ssl.SSLConnectionSocketFactory; import org.apache.http.conn.ssl.SSLContextBuilder; import org.apache.http.conn.ssl.TrustStrategy; import org.apache.http.cookie.Cookie; import org.apache.http.impl.client.*; import org.apache.http.impl.conn.BasicHttpClientConnectionManager; import org.apache.http.impl.cookie.BasicClientCookie; import org.apache.http.protocol.HttpContext; import org.apache.http.util.Args; import org.apache.http.util.Asserts; import org.apache.http.util.TextUtils; import org.omg.CORBA.Request; /** * Created by amosli on 14-6-25. */ public class Tools { /** * 写文件到本地 * * @param httpEntity * @param filename */ public static void saveToLocal(HttpEntity httpEntity, String filename) { try { File dir = new File(Configuration.FILEDIR); if (!dir.isDirectory()) { dir.mkdir(); } File file = new File(dir.getAbsolutePath() + "/" + filename); FileOutputStream fileOutputStream = new FileOutputStream(file); InputStream inputStream = httpEntity.getContent(); byte[] bytes = new byte[1024]; int length = 0; while ((length = inputStream.read(bytes)) > 0) { fileOutputStream.write(bytes, 0, length); } inputStream.close(); fileOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } } /** * 写文件到本地 * * @param bytes * @param filename */ public static void saveToLocalByBytes(byte[] bytes, String filename) { try { File dir = new File(Configuration.FILEDIR); if (!dir.isDirectory()) { dir.mkdir(); } File file = new File(dir.getAbsolutePath() + "/" + filename); FileOutputStream fileOutputStream = new FileOutputStream(file); fileOutputStream.write(bytes); //fileOutputStream.write(bytes, 0, bytes.length); fileOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } } /** * 输出 * @param string */ public static void println(String string){ System.out.println("string:"+string); } /** * 输出 * @param string */ public static void printlnerr(String string){ System.err.println("string:"+string); } /** * 使用ssl通道并设置请求重试处理 * @return */ public static CloseableHttpClient createSSLClientDefault() { try { SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() { //信任所有 public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException { return true; } }).build(); SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext); //设置请求重试处理,重试机制,这里如果请求失败会重试5次 HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() { @Override public boolean retryRequest(IOException exception, int executionCount, HttpContext context) { if (executionCount >= 5) { // Do not retry if over max retry count return false; } if (exception instanceof InterruptedIOException) { // Timeout return false; } if (exception instanceof UnknownHostException) { // Unknown host return false; } if (exception instanceof ConnectTimeoutException) { // Connection refused return false; } if (exception instanceof SSLException) { // SSL handshake exception return false; } HttpClientContext clientContext = HttpClientContext.adapt(context); HttpRequest request = clientContext.getRequest(); boolean idempotent = !(request instanceof HttpEntityEnclosingRequest); if (idempotent) { // Retry if the request is considered idempotent return true; } return false; } }; //请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向 RequestConfig requestConfig = RequestConfig.custom() .setConnectionRequestTimeout(20000).setConnectTimeout(20000) .setCircularRedirectsAllowed(false) .build(); Cookie cookie ; return HttpClients.custom().setSSLSocketFactory(sslsf) .setUserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") .setMaxConnPerRoute(25).setMaxConnPerRoute(256) .setRetryHandler(retryHandler) .setRedirectStrategy(new SelfRedirectStrategy()) .setDefaultRequestConfig(requestConfig) .build(); } catch (KeyManagementException e) { e.printStackTrace(); } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } catch (KeyStoreException e) { e.printStackTrace(); } return HttpClients.createDefault(); } /** * 带cookiestore * @param cookieStore * @return */ public static CloseableHttpClient createSSLClientDefaultWithCookie(CookieStore cookieStore) { try { SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() { //信任所有 public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException { return true; } }).build(); SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext); //设置请求重试处理,重试机制,这里如果请求失败会重试5次 HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() { @Override public boolean retryRequest(IOException exception, int executionCount, HttpContext context) { if (executionCount >= 5) { // Do not retry if over max retry count return false; } if (exception instanceof InterruptedIOException) { // Timeout return false; } if (exception instanceof UnknownHostException) { // Unknown host return false; } if (exception instanceof ConnectTimeoutException) { // Connection refused return false; } if (exception instanceof SSLException) { // SSL handshake exception return false; } HttpClientContext clientContext = HttpClientContext.adapt(context); HttpRequest request = clientContext.getRequest(); boolean idempotent = !(request instanceof HttpEntityEnclosingRequest); if (idempotent) { // Retry if the request is considered idempotent return true; } return false; } }; //请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向 RequestConfig requestConfig = RequestConfig.custom() .setConnectionRequestTimeout(20000).setConnectTimeout(20000) .setCircularRedirectsAllowed(false) .build(); return HttpClients.custom().setSSLSocketFactory(sslsf) .setUserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") .setMaxConnPerRoute(25).setMaxConnPerRoute(256) .setRetryHandler(retryHandler) .setRedirectStrategy(new SelfRedirectStrategy()) .setDefaultRequestConfig(requestConfig) .setDefaultCookieStore(cookieStore) .build(); } catch (KeyManagementException e) { e.printStackTrace(); } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } catch (KeyStoreException e) { e.printStackTrace(); } return HttpClients.createDefault(); } }