网页抓取时内存泄漏答案

【问题标题】：Memory leaks while web scraping网页抓取时内存泄漏
【发布时间】：2017-08-09 16:33:42
【问题描述】：

我正在尝试为所有电影、连续剧建立索引……此网页中：www.newpct1.com。对于每个媒体内容，我想保存其标题、种子文件 URL 和文件大小。为此，我将 NodeJS 与模块 Cheerio（使用 JQuery 像 sintax 提取 HTML 内容）和请求（发出请求）一起使用。代码如下：

const cheerio = require('cheerio');
const request = require('request');


console.log('\"Site\",\"Title\",\"Size\",\"URL\"');
const baseURL = 'http://newpct1.com/';
const sites = ['documentales/pg/', 'peliculas/pg/', 'series/pg/', 'varios/pg/'];
for (let i = 0; i < sites.length; i++) {
  let site = sites[i].split('/')[0];
  for (let j = 1; true; j++) { // Infinite loop
    let siteURL = baseURL + sites[i] + j;
    // getMediaURLs
    // -------------------------------------------------------------------------
    request(siteURL, (err, resp, body) => {
      if (!err) {
        let $ = cheerio.load(body);
        let lis = $('li', 'ul.pelilist');
        // If exists media
        if (lis.length) {
          $('a', lis).each((k, elem) => {
            let mediaURL = $(elem).attr('href');
            // getMediaAttrs
            //------------------------------------------------------------------
            request(mediaURL, (err, resp, body) => {
              if (!err) {
                let $ = cheerio.load(body);
                let title = $('strong', 'h1').text();
                let size = $('.imp').eq(1).text().split(':')[1];
                let torrent = $('a.btn-torrent').attr('href');
                console.log('\"%s\",\"%s\",\"%s\",\"%s\"', site, title, size,
                  torrent);
              }
            });
            //------------------------------------------------------------------
          });
        }
      }
    });
    // -------------------------------------------------------------------------
  }
}

这段代码的问题是永远不会结束执行，抛出这个错误（内存泄漏）：

<--- Last few GCs --->

   22242 ms: Mark-sweep 1372.4 (1439.0) -> 1370.7 (1439.0) MB, 1088.7 / 0.0 ms [allocation failure] [GC in old space requested].
   23345 ms: Mark-sweep 1370.7 (1439.0) -> 1370.7 (1439.0) MB, 1103.0 / 0.0 ms [allocation failure] [GC in old space requested].
   24447 ms: Mark-sweep 1370.7 (1439.0) -> 1370.6 (1418.0) MB, 1102.1 / 0.0 ms [last resort gc].
   25527 ms: Mark-sweep 1370.6 (1418.0) -> 1370.6 (1418.0) MB, 1079.5 / 0.0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x272c0e23fa99 <JS Object>
    1: httpify [/home/marco/node_modules/caseless/index.js:~50] [pc=0x3f51b4a2c2c5] (this=0x1e65c39fbdb9 <JS Function module.exports (SharedFunctionInfo 0x1e65c39fb581)>,resp=0x2906174cf6a9 <a Request with map 0x2efe262dbef9>,headers=0x11e0242443f1 <an Object with map 0x2efe26206829>)
    2: init [/home/marco/node_modules/request/request.js:~144] [pc=0x3f51b4a3ee1d] (this=0x2906174cf6a9 <a Requ...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0x10d3f9c [node]
 3: v8::Utils::ReportApiFailure(char const*, char const*) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Handle<v8::internal::JSFunction> v8::internal::Factory::New<v8::internal::JSFunction>(v8::internal::Handle<v8::internal::Map>, v8::internal::AllocationSpace) [node]
 6: v8::internal::Factory::NewFunction(v8::internal::Handle<v8::internal::Map>, v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Handle<v8::internal::Context>, v8::internal::PretenureFlag) [node]
 7: v8::internal::Factory::NewFunctionFromSharedFunctionInfo(v8::internal::Handle<v8::internal::SharedFunctionInfo>, v8::internal::Handle<v8::internal::Context>, v8::internal::PretenureFlag) [node]
 8: v8::internal::Runtime_NewClosure_Tenured(int, v8::internal::Object**, v8::internal::Isolate*) [node]
 9: 0x3f51b47060c7

我尝试在具有更多 RAM (16 GB) 的机器上执行，但抛出了同样的错误。

我也做了一个堆快照，但我看不出问题出在哪里。截图在这里：https://drive.google.com/open?id=0B5Ysugq64wdLSHdHVHctUXZaNGM

【问题讨论】：

我相信你在那里做无限的请求，请求占用空间
看起来像。如果我在第一次迭代后中断两个循环，效果很好。我不知道如何限制此请求或执行其中一些，等到完成，然后继续其余的。
代码中没有中断，因此“直到没有更多页面”不会发生
Infinite loop 如果您没有正确结束它们，设计将导致内存泄漏。这就是你应该开始调查的地方。
@k0pernikus 他们的内容可以免费访问（在使用政策和法律建议中）。我还要控制每分钟的请求数。

标签： javascript node.js request cheerio

【解决方案1】：

您可以尝试使用--expose-gc 标志启动节点，通过在console.log 调用之前/之后调用$ = null; global.gc(); 来强制GC。并尝试测试该变体。

如果问题相同，我们会尝试执行算法更改并优化内存使用。

非常有用的参考： https://github.com/cheeriojs/cheerio/issues/830 https://github.com/cheeriojs/cheerio/issues/263

【讨论】：

错误仍然发生，但看起来内存增加的速度较低。我要阅读参考文献。
@MarcoCanora 好的，告诉你调查，我们试图解决问题。一种方法是避免嵌套请求并单独执行。在第一阶段提取站点 mediaURL，清理内存，然后使用 mediaURL。

【解决方案2】：

关于如何摆脱无限循环的总体思路：您开始为每个站点发出请求，并且每当请求完成时，您就为该站点请求以下页面。

for (let i = 0; i < sites.length; i++) {
  let site = sites[i].split('/')[0];
  let siteURL = baseURL + sites[i];
  scrapSite(siteURL, 0);
}

function scrapSite(siteURL, idx) {
    request(siteURL + idx, (err, resp, body) => {
        if (!err) {
            ...
            scrapMedia();

            if (pageExists) {
                scrapSite(siteURL, idx + 1);
            }
        }
    }
}

【讨论】：

谢谢（唯一泄漏的是我的想法哈哈）。 BTW 优雅的解决方案（递归）。