【问题标题】:Easiest way to import a simple csv file to a graph with OrientDB ETL使用 OrientDB ETL 将简单的 csv 文件导入图形的最简单方法
【发布时间】:2015-06-21 17:47:06
【问题描述】:

我想将一个非常简单的 csv 有向图文件导入 OrientDB。具体来说,该文件是来自 SNAP 集合https://snap.stanford.edu/data/roadNet-PA.html 的 roadNet-PA 数据集。文件的第一行如下:

# Directed graph (each unordered pair of nodes is saved once)
# Pennsylvania road network
# Nodes: 1088092 Edges: 3083796
# FromNodeId    ToNodeId
0       1
0       6309
0       6353
1       0
6353    0
6353    6354

只有一种类型的顶点(道路交叉口)并且边没有信息(我想 OrientDB 轻量边是最好的选择)。另请注意,顶点之间用制表符隔开。

我尝试创建一个简单的 etl 来导入文件,但没有成功。这是etl:

{
  "config": {
    "log": "debug"
  },
  "source" : {
    "file": { "path": "/tmp/roadNet-PA.csv" }
  },
  "extractor": { "row": {} },
  "transformers": [
    { "csv": { "separator": "   ", "skipFrom": 1, "skipTo": 4 } },
    { "vertex": { "class": "Intersection" } },
    { "edge": { "class": "Road" } }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/roads",
       "dbType": "graph",
       "classes": [
         {"name": "Intersection", "extends": "V"},
         {"name": "Road", "extends": "E"}
       ], "indexes": [
         {"class":"Intersection", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
} 

etl 可以工作,但它没有按我的预期导入文件。我想问题出在变压器上。我的想法是逐行读取 csv 并创建连接两个顶点的边,但我不确定如何在 etl 文件中表达这一点。有什么想法吗?

【问题讨论】:

    标签: graph import etl orientdb nosql


    【解决方案1】:

    试试这个:

    {
      "config": {
        "log": "debug"
      },
      "source" : {
        "file": { "path": "/tmp/roadNet-PA.csv" }
      },
      "extractor": { "row": {} },
      "transformers": [
        { "csv": { "separator": "\t", "skipFrom": 1, "skipTo": 4,
                   "columnsOnFirstLine": false, 
                   "columns":["id", "to"] } },
        { "vertex": { "class": "Intersection" } },
        { "merge": { "joinFieldName":"id", "lookup":"Intersection.id" } },
        { "edge": {
           "class": "Road",
           "joinFieldName": "to",
           "lookup": "Intersection.id",
           "unresolvedLinkAction": "CREATE"
          }
        },
      ],
      "loader": {
        "orientdb": {
           "dbURL": "remote:localhost/roads",
           "dbType": "graph",
           "wal": false,
           "batchCommit": 1000,
           "tx": true,
           "txUseLog": false,
           "useLightweightEdges" : true,
           "classes": [
             {"name": "Intersection", "extends": "V"},
             {"name": "Road", "extends": "E"}
           ], "indexes": [
             {"class":"Intersection", "fields":["id:integer"], "type":"UNIQUE" }
           ]
        }
      }
    } 
    

    为了加快加载速度,我建议您关闭服务器,并使用“plocal:”而不是“remote:”来导入 ETL。将现有替换为:

           "dbURL": "plocal:/orientdb/databases/roads",
    

    【讨论】:

    • 感谢您的回答。我不确定我是否做错了什么,但我检测到两个错误。首先,skipFrom 和 skipTo 配置不起作用,因为第一行被传递给了转换器。我已经手动删除了这些行,并且发现了第二个问题:OrientVertex 无法转换为 ODocument。这是日志pastebin.com/i6QGRcUV
    • 尝试在顶点之前移动合并
    【解决方案2】:

    终于成功了。按照 Luca 的建议,我已将合并移至顶点线之前。我还将“id”字段更改为“from”以避免错误“property key is reserved for all elements id”。这是sn-p:

    {
      "config": {
        "log": "debug"
      },
      "source" : {
        "file": { "path": "/tmp/roads.csv" }
      },
      "extractor": { "row": {} },
      "transformers": [
        { "csv": { "separator": "\t",
                   "columnsOnFirstLine": false, 
                   "columns":["from", "to"] } },
        { "merge": { "joinFieldName":"from", "lookup":"Intersection.from" } },
        { "vertex": { "class": "Intersection" } },
        { "edge": {
           "class": "Road",
           "joinFieldName": "to",
           "lookup": "Intersection.from",
           "unresolvedLinkAction": "CREATE"
          }
        },
      ],
      "loader": {
        "orientdb": {
           "dbURL": "remote:localhost/roads",
           "dbType": "graph",
           "wal": false,
           "batchCommit": 1000,
           "tx": true,
           "txUseLog": false,
           "useLightweightEdges" : true,
           "classes": [
             {"name": "Intersection", "extends": "V"},
             {"name": "Road", "extends": "E"}
           ], "indexes": [
             {"class":"Intersection", "fields":["from:integer"], "type":"UNIQUE" }
           ]
        }
      }
    } 
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2010-11-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2010-10-15
      相关资源
      最近更新 更多