【发布时间】:2013-11-11 22:07:35
【问题描述】:
我正在尝试对Titan、OrientDB 和Neo4j 这三个不同的图形数据库进行基准测试。我想测量数据库创建的执行时间。作为测试用例,我使用这个数据集 http://snap.stanford.edu/data/web-flickr.html 。尽管数据存储在本地而不是计算机内存中,但我注意到它消耗了很多内存,不幸的是,过了一会儿 eclipse 崩溃了。为什么会这样?
这里有一些代码 sn-ps: Titan图创建
public long createGraphDB(String datasetRoot, TitanGraph titanGraph) {
long duration;
long startTime = System.nanoTime();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
String line;
int lineCounter = 1;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split(" ");
Vertex srcVertex = titanGraph.addVertex(null);
srcVertex.setProperty( "nodeId", parts[0] );
Vertex dstVertex = titanGraph.addVertex(null);
dstVertex.setProperty( "nodeId", parts[1] );
Edge edge = titanGraph.addEdge(null, srcVertex, dstVertex, "similar");
titanGraph.commit();
}
lineCounter++;
}
reader.close();
}
catch(IOException ioe) {
ioe.printStackTrace();
}
catch( Exception e ) {
titanGraph.rollback();
}
long endTime = System.nanoTime();
duration = endTime - startTime;
return duration;
}
OrientDB 图创建:
public long createGraphDB(String datasetRoot, OrientGraph orientGraph) {
long duration;
long startTime = System.nanoTime();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
String line;
int lineCounter = 1;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split(" ");
Vertex srcVertex = orientGraph.addVertex(null);
srcVertex.setProperty( "nodeId", parts[0] );
Vertex dstVertex = orientGraph.addVertex(null);
dstVertex.setProperty( "nodeId", parts[1] );
Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");
orientGraph.commit();
}
lineCounter++;
}
reader.close();
}
catch(IOException ioe) {
ioe.printStackTrace();
}
catch( Exception e ) {
orientGraph.rollback();
}
long endTime = System.nanoTime();
duration = endTime - startTime;
return duration;
Neo4j 图创建:
public long createDB(String datasetRoot, GraphDatabaseService neo4jGraph) {
long duration;
long startTime = System.nanoTime();
Transaction tx = neo4jGraph.beginTx();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
String line;
int lineCounter = 1;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split(" ");
Node srcNode = neo4jGraph.createNode();
srcNode.setProperty("nodeId", parts[0]);
Node dstNode = neo4jGraph.createNode();
dstNode.setProperty("nodeId", parts[1]);
Relationship relationship = srcNode.createRelationshipTo(dstNode, RelTypes.SIMILAR);
}
lineCounter++;
}
tx.success();
reader.close();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
tx.finish();
}
long endTime = System.nanoTime();
duration = endTime - startTime;
return duration;
}
编辑: 我尝试了 BatchGraph 解决方案,似乎它将永远运行。它昨天运行了一整夜,从未结束。我不得不阻止它。我的代码有什么问题吗?
TitanGraph graph = TitanFactory.open("data/titan");
BatchGraph<TitanGraph> batchGraph = new BatchGraph<TitanGraph>(graph, VertexIDType.STRING, 1000);
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("data/flickrEdges.txt")));
String line;
int lineCounter = 1;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split(" ");
Vertex srcVertex = batchGraph.getVertex(parts[0]);
if(srcVertex == null) {
srcVertex = batchGraph.addVertex(parts[0]);
}
Vertex dstVertex = batchGraph.getVertex(parts[1]);
if(dstVertex == null) {
dstVertex = batchGraph.addVertex(parts[1]);
}
Edge edge = batchGraph.addEdge(null, srcVertex, dstVertex, "similar");
batchGraph.commit();
}
lineCounter++;
}
reader.close();
}
【问题讨论】:
-
你能分享堆栈跟踪吗?您正在使用哪些 JVM 内存参数?要打印隐式设置,请使用 -XX:+PrintFlagsFinal。
-
我将重申@StefanArmbruster 询问您对JVM 的内存参数的问题?你也一次运行这些负载吗?请注意,“BatchGraph”将希望使用内存来缓存顶点,因此请尽可能多地提供。你在每一行上都做一个 batchGraph.commit() ,这是一个问题。 “BatchGraph”将根据您在构造函数中传递的批量大小(当前为 1000 ......您可能会更大)定期为您处理提交。确保在加载结束时调用图表上的
shutdown()以清理最终事务。 -
对于 Titan,您还需要将
storage.batch-loading设置为等于true。当您这样做时,Titan 将忽略锁并消除一些读取。顺便说一句,你在这里做的不是微不足道的事情。尝试将非常大的图数据集加载到任何图都需要特定于该图的策略以获得最佳加载性能。此外,可能值得在您的代码中包含一些日志记录以跟踪负载的进展情况。 -
@StefanArmbruster 这些是我的参数 -Xms64m -Xmx3200m 。
-
@stephenmallette 不错。这工作得很好。我想知道 Titan 项目中是否有类似 OrientGraphNoTx 的东西,因为它根本不需要内存。
标签: neo4j out-of-memory graph-databases orientdb titan