基于Node.js的爬虫工具 – Node Crawler

Node Crawler的目标是成为最好的node.js爬虫工具，目前已经停止维护。

我们来抓取光合新知博客tech栏目中的文章信息。
访问http://dev.guanghe.tv/category/tech/，右键查看页面源代码，可以看到文章信息等内容，如下所示：

因为每篇文章都是一个<li>标签，所以我们从页面代码的所有<li>中获取文章的发布时间、链接和标题。

爬虫代码：

;

{

;

}

;

npm install安装crawler模块，node app.js运行程序。
你将会获得如下内容（仅展示部分内容）：

2015

//dev.guanghe.tv/category/tech//2015/12/Getting-Started-With-React-And-JSX.html

JSX入门指导

2015

//dev.guanghe.tv/category/tech//2015/12/ReactJS-For-Stupid-People.html

懒人教程

2015

//dev.guanghe.tv/category/tech//2015/12/iOSCustomProblem.html

iOS开发常见问题

2015

//dev.guanghe.tv/category/tech//2015/12/iOSXcodeDebug.html

Debug技巧