使用 Node 和 jsdom 蜘蛛站点时进程内存不足答案

【问题标题】：process out of memory when using Node and jsdom to spider site使用 Node 和 jsdom 蜘蛛站点时进程内存不足
【发布时间】：2017-02-14 12:15:11
【问题描述】：

我正在尝试从存储在数组中的一堆 HTML 页面中提取一个字符串。我有以下代码：

const jsdom = require('jsdom')
desc('Import pages');
task('handleSpots', [], function (params) {

  allSpots.forEach(function(spotUrl){
    handleSpot(spotUrl)
  })
});

function handleSpot (href) {
  jsdom.env(
    href,
    ["http://code.jquery.com/jquery.js"],
    function (err, window) {
      if (err) {
        console.log(host+href+" "+err)
        return
      }
      const data = {url: host+href}
      data['name'] = window.$("h1.wanna-item-title-title a").text()
      console.log(data['name'])
      window.close()
    }
  );
}

allSpots 数组中有大约 600 个 url。当我运行它时，我得到了一堆错误：

/the_hook/index.html Error: read ECONNRESET

这发生在一堆网址上，显示了一些名称，最后我得到了这个错误。

<--- Last few GCs --->

80660 ms: Scavenge 1355.3 (1460.0) -> 1355.3 (1460.0) MB, 2.3 / 0 ms (+ 1.4 ms in 1 steps since last GC) [allocation failure] [incremental marking delaying mark-sweep].
82149 ms: Mark-sweep 1355.3 (1460.0) -> 1354.8 (1460.0) MB, 1488.7 / 0 ms (+ 2.8 ms in 2 steps since start of marking, biggest step 1.4 ms) [last resort gc].
83657 ms: Mark-sweep 1354.8 (1460.0) -> 1354.6 (1460.0) MB, 1508.2 / 0 ms [last resort gc].


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x38f1b4237339 <JS Object>
    1: create [native v8natives.js:~755] [pc=0x22e6902f1923] (this=0x38f1b4236b61 <JS Function Object (SharedFunctionInfo 0x38f1b4236ad1)>,an=0x1590d58f6941 <an Object with map 0x1b19e3c1e251>,aD=0x38f1b4204131 <undefined>)
    2: arguments adaptor frame: 1->2
    3: createImpl [/Users/craig/Programming/node_wannasurf_importer/node_modules/jsdom/lib/jsdom/living/generated/Text.js:~90] [pc=0x22e...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Abort trap: 6

仅当 allSpots 数组中的项目超过 125 个时才会发生这种情况。比这少，一切正常。

我对节点很陌生，但我假设 Javascript 试图同时获取太多这些页面，最终内存不足。理想情况下，我可以写一些处理 100 的东西，等到它完成然后移动到下一个 100。

我试过这个： async.eachLimit(allSpots, 100, handleSpot) 但这只处理前 100 个然后停止。

我也试过： async.eachSeries(allSpots, handleSpot) 但这仅处理第一个 url 并停止。

我有点走投无路，所以我非常感谢任何人能给我的任何建议。谢谢，

克雷格

【问题讨论】：

标签： javascript node.js async.js jsdom

【解决方案1】：

我决定放弃 jsdom 并用cheerio 和 https 代替它，这样我可能对请求过程有更多的控制权。然后我想出了如何同步请求每个url（在请求上使用on（'end'））然后开始循环处理url，所以循环迭代的次数就是并发进程的数量。

代码如下：

const https = require('https');
const cheerio = require('cheerio')

desc('Import pages');
task('handleSpots', [], function (params) {
  var totalLoop = 10;
  for( var i = 0; i < totalLoop; i++ ) {
    handleSpotAndNext()
  }
});

function handleSpotAndNext() {
  spot = allSpots.pop()
  https.get(spot,function(res){
    var chunks = '';
    res.on('data',function(d){
      chunks += d;
    });
    res.on('end',function(){
      console.log(spotData(chunks, spot))
      if(allSpots.length){
        handleSpotAndNext();
      }
    })
  })

}

function spotData(spotHtml, url) {
  $ = cheerio.load(spotHtml)
  const data = {url: url}
  data['name'] = $("h1.wanna-item-title-title a").text() 
  return data
}

这是我想出的，但如果你看到这个并能想到一个更优雅的解决方案，很高兴收到你的来信。

【讨论】：