nodeJS - 巨大的字符串文件故障答案

【问题标题】：nodeJS - huge string files malfunctionnodeJS - 巨大的字符串文件故障
【发布时间】：2016-02-14 04:01:25
【问题描述】：

我的 nodeJS 代码遇到了非常奇怪的问题。代码基本上是将 JSON 对象序列化为相对较大但不是非常大的文件 - ~150mb。问题是，当我尝试加载此文件时，会发生真正不确定的事情：

lapsio@linux-qzuq /d/g/GreenStorage> node
> k1=fs.readFileSync('../etc/md5index/green-Documents.extindex');0
0
> k1.length
157839101
> k2=fs.readFileSync('../etc/md5index/green-Documents.extindex');0
0
> k2.lengFATAL ERROR: invalid array length Allocation failed - process out of memory
fish: “node” terminated by signal SIGABRT (Abort)

第二次尝试

> k1=fs.readFileSync('../etc/md5index/green-Documents.extindex');0
0
> k2=fs.readFileSync('../etc/md5index/green-Documents.extindex');0
0
> k1.length
157839101
> k2.length
157839101
> k1==k2
false

从响应时间来看，这一步ofc文件已经缓存在RAM中了，所以不是存储问题。我的实际应用：

try{
  var ind = JSON.parse(args.legacyconvert?bfile:content),
      ostr = String(args.legacyconvert?bfile:content),
      str = JSON.stringify(ind,null,2);

  for (var i = 0, l = str.length ; i < l ; i++)
    if (str[i]!=ostr[i]){
      console.error('Soft bug occured - it\'s serious bug and probably classifies as node bug or linux memcache bug. Should be reported');
      throw ('Original string and reparsed don\'t match at '+i+' byte - system string conversion malfunction - abtorting')
    }

  return ind;
} catch (e) {
  console.error('Could not read index - aborting',p,e);
  process.exit(11);
}

结果：

lapsio@linux-qzuq /d/g/G/D/c/fsmonitor> sudo ./reload.js -e ../../../../etc/md5index/*.extindex
Reading index... ( ../../../../etc/md5index/green-Documents.extindex )
Soft bug occured - it's serious bug and probably classifies as node bug or linux memcache bug. Should be reported
Could not read index - aborting ../../../../etc/md5index/green-Documents.extindex Original string and reparsed don't match at 116655242 byte - system string conversion malfunction - abtorting
lapsio@linux-qzuq /d/g/G/D/c/fsmonitor> sudo ./reload.js -e ../../../../etc/md5index/*.extindex
Reading index... ( ../../../../etc/md5index/green-Documents.extindex )
Soft bug occured - it's serious bug and probably classifies as node bug or linux memcache bug. Should be reported
Could not read index - aborting ../../../../etc/md5index/green-Documents.extindex Original string and reparsed don't match at 39584906 byte - system string conversion malfunction - abtorting

它每次都返回随机字节不匹配。保存后文件损坏的可能性也有 50%。有时它甚至无法正确解析，因为它发现了一些奇怪的非 ASCII 字符，例如 [SyntaxError: Unexpected token 䀠]。它是来自 OpenSUSE 存储库的节点。我试过很多机器。重现此错误相对较难，因为它非常随机发生，但一旦第一次出现，它或多或少都会出现，直到重新启动。

lapsio@linux-qzuq /d/g/GreenStorage> node -v
v0.12.7

PC 有 16 GB 内存，而节点甚至没有达到 10%，所以我确信这不是内存不足的问题。而且这似乎不是文件系统相关的问题，因为 md5sum 和其他哈希生成器总是返回有效的校验和。只有节点失败。我不知道该怎么想。它真的被归类为错误吗？

【问题讨论】：

我无法确认这两个事件之间的直接关联，但我认为允许节点使用更多内存有助于解决这个问题 (node --max_old_space_size=4096 ./file.js)

标签： node.js

【解决方案1】：

您的代码显示您正在吞食大 JSON 文件，然后对其进行解析。这意味着您需要为原始文件和生成的解析对象留出空间。这可能部分归咎于您不可预知的内存耗尽问题。

大多数处理您提到的大小文件的人都尝试使用流式或增量式解析方法。这样，原始数据就可以流经您的程序，而不必同时存在。

您可能想查看这个流式 JSON 解析器。它可能会让你成功地通过这块数据。 https://github.com/dominictarr/JSONStream

第二种可能性是（ab-）使用JSON.parse() 的第二个参数。称为revifify，它是一个函数，它被JSON 文本文件中的每个对象调用。您可以通过以某种方式将对象写入文件（或者可能是 dbms），然后返回空结果来响应对该函数的每次调用。这样，JSON.parse 就不需要存储它遇到的每个对象。您必须解决这个问题才能使其正常工作。使用这种策略，您仍然会吞食大输入文件，但会流式传输输出。

另一种可能性是尽最大努力将单个 JSON 文档拆分为一系列记录，即较小的文档。（这样大小的数据集似乎可以合理地拆分。）

【讨论】：

仍然很奇怪，它无法简单地将文件加载到字符串

【解决方案2】：

我高度怀疑这是由于文件大小造成的。听起来像是加载问题。

看到这个帖子：Max recommended size of external JSON object in JavaScript

我建议通过 json 研究 SQL，它更适合管理这种大小的数据集。

【讨论】：