图数据库的内存问题答案

【问题标题】：Memory issue with graph databases图数据库的内存问题
【发布时间】：2013-11-11 22:07:35
【问题描述】：

我正在尝试对Titan、OrientDB 和Neo4j 这三个不同的图形数据库进行基准测试。我想测量数据库创建的执行时间。作为测试用例，我使用这个数据集 http://snap.stanford.edu/data/web-flickr.html 。尽管数据存储在本地而不是计算机内存中，但我注意到它消耗了很多内存，不幸的是，过了一会儿 eclipse 崩溃了。为什么会这样？

这里有一些代码 sn-ps： Titan图创建

public long createGraphDB(String datasetRoot, TitanGraph titanGraph) {
    long duration;
    long startTime = System.nanoTime();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = titanGraph.addVertex(null);
                srcVertex.setProperty( "nodeId", parts[0] );
                Vertex dstVertex = titanGraph.addVertex(null);
                dstVertex.setProperty( "nodeId", parts[1] );
                Edge edge = titanGraph.addEdge(null, srcVertex, dstVertex, "similar");
                titanGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
    catch(IOException ioe) {
        ioe.printStackTrace();
    }
    catch( Exception e ) {    
        titanGraph.rollback();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;
}

OrientDB 图创建：

public long createGraphDB(String datasetRoot, OrientGraph orientGraph) {
    long duration;
    long startTime = System.nanoTime();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;    
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = orientGraph.addVertex(null);
                srcVertex.setProperty( "nodeId", parts[0] );
                Vertex dstVertex = orientGraph.addVertex(null);
                dstVertex.setProperty( "nodeId", parts[1] );
                Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");
                orientGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
    catch(IOException ioe) {
        ioe.printStackTrace();
    }
    catch( Exception e ) {    
        orientGraph.rollback();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;

Neo4j 图创建：

public long createDB(String datasetRoot, GraphDatabaseService neo4jGraph) {
    long duration;
    long startTime = System.nanoTime(); 
    Transaction tx = neo4jGraph.beginTx();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Node srcNode = neo4jGraph.createNode();
                srcNode.setProperty("nodeId", parts[0]);
                Node dstNode = neo4jGraph.createNode();
                dstNode.setProperty("nodeId", parts[1]);
                Relationship relationship = srcNode.createRelationshipTo(dstNode, RelTypes.SIMILAR);
            }
            lineCounter++;
        }
        tx.success();
        reader.close();
    } 
    catch (IOException e) {
        e.printStackTrace();
    }
    finally {
        tx.finish();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;
}

编辑：我尝试了 BatchGraph 解决方案，似乎它将永远运行。它昨天运行了一整夜，从未结束。我不得不阻止它。我的代码有什么问题吗？

TitanGraph graph = TitanFactory.open("data/titan");
    BatchGraph<TitanGraph> batchGraph = new BatchGraph<TitanGraph>(graph, VertexIDType.STRING, 1000);
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("data/flickrEdges.txt")));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = batchGraph.getVertex(parts[0]);
                if(srcVertex == null) {
                    srcVertex = batchGraph.addVertex(parts[0]);
                }
                Vertex dstVertex = batchGraph.getVertex(parts[1]);
                if(dstVertex == null) {
                    dstVertex = batchGraph.addVertex(parts[1]);
                }
                Edge edge = batchGraph.addEdge(null, srcVertex, dstVertex, "similar");
                batchGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }

【问题讨论】：

你能分享堆栈跟踪吗？您正在使用哪些 JVM 内存参数？要打印隐式设置，请使用 -XX:+PrintFlagsFinal。
我将重申@StefanArmbruster 询问您对JVM 的内存参数的问题？你也一次运行这些负载吗？请注意，“BatchGraph”将希望使用内存来缓存顶点，因此请尽可能多地提供。你在每一行上都做一个 batchGraph.commit() ，这是一个问题。 “BatchGraph”将根据您在构造函数中传递的批量大小（当前为 1000 ......您可能会更大）定期为您处理提交。确保在加载结束时调用图表上的shutdown() 以清理最终事务。
对于 Titan，您还需要将 storage.batch-loading 设置为等于 true。当您这样做时，Titan 将忽略锁并消除一些读取。顺便说一句，你在这里做的不是微不足道的事情。尝试将非常大的图数据集加载到任何图都需要特定于该图的策略以获得最佳加载性能。此外，可能值得在您的代码中包含一些日志记录以跟踪负载的进展情况。
@StefanArmbruster 这些是我的参数 -Xms64m -Xmx3200m 。
@stephenmallette 不错。这工作得很好。我想知道 Titan 项目中是否有类似 OrientGraphNoTx 的东西，因为它根本不需要内存。

标签： neo4j out-of-memory graph-databases orientdb titan

【解决方案1】：

这个答案只涵盖了 Neo4j 部分。

您基本上是在单个事务中运行完整的导入。事务在内存中建立并提交到磁盘。根据要导入的数据的大小，这可能是 OOME 的原因。为了解决这个问题，我看到了 3 个选项：

1) 使用Neo4j batch inserter。这是一种构建 Neo4j 数据存储的非事务方式。由于上面的其他两个 sn-ps 不使用事务，我猜批处理插入器是产生可比较结果的最佳方式。

2) 采用你的 JVM 的内存参数

3) 拆分交易大小。一个典型的好选择是将 10k - 100k 原子操作捆绑到一个事务中。

附注：看看https://github.com/jexp/batch-import，这允许您直接从 csv 文件运行导入，而无需 java 编码。

【讨论】：

很高兴听到。与他人相处的经验太少，无法提供合格的建议。

【解决方案2】：

当您尝试比较多个数据库时，我建议您将代码概括为蓝图。 Flickr 数据集看起来适合BatchGraph Graph wrapper 之类的大小。使用BatchGraph，您可以调整提交大小并专注于管理加载的代码。这样，您可以拥有一个简单的类来加载所有不同的图表（您甚至可以轻松地将您的测试扩展到其他支持蓝图的图表）。

@Stefan 对内存提出了一个很好的观点……您可能需要提高 JVM 上的 -Xmx 设置来处理该数据。每个 Graph 以不同的方式处理内存（即使它们持久化到磁盘），如果您在同一个 JVM 中同时加载所有三个，我敢打赌那里会有一些争用。

如果您打算比您引用的 Flickr 数据集更大，那么BatchGraph 可能不正确。 BatchGraph 通常适用于几亿个图形元素。当您开始谈论比这更大的图时，您可能想忘记我所说的关于尝试不特定于图的一些内容。对于要测试的每个图表，您可能希望使用最佳工具来完成这项工作。对于 Neo4j，这意味着 Neo4jBatchGraph（如果这对你很重要，至少你仍然在使用蓝图），对于 Titan 意味着 Faunus 或自定义编写的并行批处理加载器，对于 OrientDB OrientBatchGraph

【讨论】：

Flickr 数据集只是一个测试用例。我想用更多的大数据集运行这段代码，我认为 BatchGraph 不适合这项工作。在这种情况下我应该使用什么？ Titan 或 OrientDB 中是否有类似事务的东西？任何代码示例都会非常有帮助
更新了我的答案以反映更大的图表。不幸的是，我不知道任何示例，但是除了使用BatchGraph 之外，我构建的加载程序看起来与您的代码并没有太大的不同。显然，基于 Titan 的并行批处理加载器需要您能够将负载分解为几个单独的进程，因此那里的代码也有些不同。 gpars(gpars.codehaus.org) 库可能对你有用。

【解决方案3】：

使用OrientDB，您可以通过两种方式优化此导入：

使用自定义扩展和
完全避免使用事务

所以使用 OrientGraphNoTx 而不是 OrientGraph 打开图形，然后试试这个 sn-p：

OrientVertex srcVertex = orientGraph.addVertex(null, "nodeId", parts[0] );
OrientVertex dstVertex = orientGraph.addVertex(null, "nodeId", parts[1] );
Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");

无需调用 .commit()。

【讨论】：

嘿 luca...OrientBatchGraph 不再推荐了吗？
我试过这个解决方案，我认为它会快一点，但内存问题仍然存在。
OrientBatchGraph 可以创建大量事务，但需要更多内存。请记住遵循性能指南：github.com/orientechnologies/orientdb/wiki/…
@Lvca 这很好用。我有两个问题。看来这个 OrientVertex srcVertex = orientGraph.getVertex(parts[0]);不起作用，如果顶点已添加失败。所以我有很多重复。此外，似乎在创建图形后我无法检索边缘。如果您需要我用于创建东方图数据库的代码，请告诉我。
您可以将代码发布到官方支持社区组吗？ groups.google.com/group/orient-database