Neo4j Batch Inserter 非常慢并且会创建巨大的数据库文件答案

【问题标题】：Neo4j Batch Inserter is very slow and creates huge database filesNeo4j Batch Inserter 非常慢并且会创建巨大的数据库文件
【发布时间】：2014-08-18 17:00:40
【问题描述】：

我正在尝试将一个相对较小的图表（2M 关系，几个 100K 节点）从 CSV 文件插入到 Neo4j 2.0.3 中。该文件中的每一行都是一个关系。我正在使用 BatchInserter API。

为了测试我的代码，我使用了输入文件的一个子集。当这个子集有 500 个大关系时，插入运行得很快（包括 JVM 启动在内的几秒钟）。当它有 1000 个大关系时，导入需要 20 分钟，生成的数据库大小为 130 GB！更奇怪的是，结果（在时间和空间上）与 5000 条关系完全相同。 20 分钟中有 99% 用于将 GB 写入磁盘。

我不明白这里发生了什么。我尝试使用the recommendations from the official documentation 之后的各种设置配置插入器。

Files
  .asCharSource(new File("/path/to/input.csv"), Charsets.UTF_8)
  .readLines(new LineProcessor<Void>() {

    BatchInserter inserter = BatchInserters.inserter(
      "/path/to/db", 
      new HashMap<String, String>() {{
        put("dump_configuration","false");
        put("cache_type","none");
        put("use_memory_mapped_buffers","true");
        put("neostore.nodestore.db.mapped_memory","500M");
        put("neostore.relationshipstore.db.mapped_memory","1G");
        put("neostore.propertystore.db.mapped_memory","500M");
        put("neostore.propertystore.db.strings.mapped_memory","500M");
      }}
    );
    RelationshipType relationshipType = 
      DynamicRelationshipType.withName("relationshipType");
    Set<Long> createdNodes = new HashSet<>();

    @Override public boolean processLine(String line) throws IOException {
        String[] components = line.split("\\|");
        long sourceId = parseLong(components[1]);
        long targetId = parseLong(components[3]);

        if (!createdNodes.contains(sourceId)) {
           createdNodes.add(sourceId);
           inserter.createNode(sourceId, new HashMap<>());
        }
        if (!createdNodes.contains(targetId)) {
            createdNodes.add(targetId);
            inserter.createNode(targetId, new HashMap<>());
        }
        inserter.createRelationship(
            sourceNodeId, targetNodeId, relationshipType, new HashMap<>()); 

        return true;
    }

    @Override public Void getResult() {
        inserter.shutdown();
        return null;
    }

});

【问题讨论】：

标签： neo4j

【解决方案1】：

我通过弄乱我的代码偶然发现了解决方案。

事实证明，如果我在不指定节点 ID 的情况下调用 createNode，那么它工作得很好。

我之所以指定节点 ID，是因为 API 允许这样做，让节点 ID 与输入文件中的 ID 相匹配很方便。

猜测根本原因：节点可能存储在一个按其 ID 索引的连续数组中。我的输入文件中的大多数 ID 都很小（4 位），但有些可能是 12 位长。因此，当我尝试插入其中一个时，Neo4j 会将一个千兆字节长的数组写入磁盘，只是为了将该节点放在最后。也许有人可以证实这一点。令人惊讶的是，Neo4j API documentation for this method 中似乎没有记录这种行为。

【讨论】：

已确认。节点 ID 是文件中固定大小记录的索引。你不应该自己编造它们......
感谢您的确认。 createNode 的文档应该反映这一点。
就是这样。一旦我删除了 id 覆盖，它就开始工作了。自述文件不应显示此方法