【问题标题】:Creating a metabolic pathway in Neo4j在 Neo4j 中创建代谢途径
【发布时间】:2018-04-05 22:13:12
【问题描述】:

我正在尝试使用这些数据在 Neo4j 中创建此问题底部图片中显示的糖酵解途径:

糖酵解_生物实体.csv

name
α-D-glucose
glucose 6-phosphate
fructose 6-phosphate
"fructose 1,6-bisphosphate"
dihydroxyacetone phosphate
D-glyceraldehyde 3-phosphate
"1,3-bisphosphoglycerate"
3-phosphoglycerate
2-phosphoglycerate
phosphoenolpyruvate
pyruvate
hexokinase
glucose-6-phosphatase
phosphoglucose isomerase
phosphofructokinase
"fructose-bisphosphate aldolase, class I"
triosephosphate isomerase (TIM)
glyceraldehyde-3-phosphate dehydrogenase
phosphoglycerate kinase
phosphoglycerate mutase
enolase
pyruvate kinase

糖酵解关系.csv

source,relation,target
α-D-glucose,substrate_of,hexokinase
hexokinase,yields,glucose 6-phosphate
glucose 6-phosphate,substrate_of,glucose-6-phosphatase
glucose-6-phosphatase,yields,α-D-glucose
glucose 6-phosphate,substrate_of,phosphoglucose isomerase
phosphoglucose isomerase,yields,fructose 6-phosphate
fructose 6-phosphate,substrate_of,phosphofructokinase
phosphofructokinase,yields,"fructose 1,6-bisphosphate"
"fructose 1,6-bisphosphate",substrate_of,"fructose-bisphosphate aldolase, class I"
"fructose-bisphosphate aldolase, class I",yields,D-glyceraldehyde 3-phosphate
D-glyceraldehyde 3-phosphate,substrate_of,glyceraldehyde-3-phosphate dehydrogenase
D-glyceraldehyde 3-phosphate,substrate_of,triosephosphate isomerase (TIM)
triosephosphate isomerase (TIM),yields,dihydroxyacetone phosphate
glyceraldehyde-3-phosphate dehydrogenase,yields,"1,3-bisphosphoglycerate"
"1,3-bisphosphoglycerate",substrate_of,phosphoglycerate kinase
phosphoglycerate kinase,yields,3-phosphoglycerate
3-phosphoglycerate,substrate_of,phosphoglycerate mutase
phosphoglycerate mutase,yields,2-phosphoglycerate
2-phosphoglycerate,substrate_of,enolase
enolase,yields,phosphoenolpyruvate
phosphoenolpyruvate,substrate_of,pyruvate kinase
pyruvate kinase,yields,pyruvate

到目前为止,这就是我所拥有的,

...使用此密码(传递给Cyclicypher-shell):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Glycolysis {source: row.source})
MERGE (r:Glycolysis {relation: row.relation})
MERGE (t:Glycolysis {target: row.target})
FOREACH (x in case row.relation when "substrate_of" then [1] else [] end |
  MERGE (s)-[r:substrate_of]->(t)
)
FOREACH (x in case row.relation when "yields" then [1] else [] end |
  MERGE (s)-[r:yields]->(t)
  );

我想创建完全连接的路径,在所有节点上都有标题。建议?

【问题讨论】:

标签: neo4j cypher


【解决方案1】:

[更新]

存在多个问题和可能的改进:

  1. 应该删除第二个MERGE,因为它会创建孤立节点。不应将关系类型调整为 Glycolysis 节点,并且此类节点永远不会连接到任何其他节点。
  2. 第一个和第三个MERGE 子句必须对源节点和目标节点使用相同的属性名称(例如name),否则相同的化学物质可能最终有2 个节点(具有不同的属性键)。这就是为什么您最终得到的节点没有所有预期的连接。
  3. APOC 过程apoc.cypher.doIt 可用于在一定程度上简化MERGE 与动态名称的关系。
  4. 此用例不需要glycolysis_bioentities.csv

通过上述更改,您最终会得到这样的结果,它将生成一个与您的输入数据匹配的连接图:

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Glycolysis {name: row.source})
MERGE (t:Glycolysis {name: row.target})
WITH s, t, row
CALL apoc.cypher.doIt(
  'MERGE (s)-[r:' + row.relation + ']->(t)',
  {s:s, t:t}) YIELD value
RETURN 1;

【讨论】:

  • 很好 - 非常感谢!我真的很感谢解释,提示,代码! :-D
  • 更新:我注意到使用 APOC 过程重新运行您的代码会导致每次迭代都添加额外的关系;我的版本(下面的单独“答案”)——不使用 APOC——不这样做。稍后我会尝试对此进行调查。
  • 是的,你是对的,如果我的原始查询运行两次,它会创建重复的关系。我已经用一个不会创建重复的查询更新了我的答案(但它比以前更难看)。
  • 为了更新这个帖子的其他人,@cybersham 的原始答案与这个相同:markhneedham.com/blog/2016/10/30/… [... WITH s, t, row CALL apoc.create.relationship(s, row.relation, {}, t) YIELD rel RETURN COUNT(*) ...]。但是,该代码导致添加额外的(重复的)关系,代码块的每次迭代(重新运行) - 即重新运行 .cypher 脚本。再次感谢@cybersam 提供更新的解决方案! :-)
  • 我对此进行了更多调查:github.com/neo4j-contrib/neo4j-apoc-procedures/issues/271 看来,您的原始答案有一个解决方案。如果您将原始代码的相应行替换为以下内容(注意添加了第 5 个属性 {},),它似乎可以按预期工作:CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
【解决方案2】:

@cybersam 的回答非常好,提供了最优雅的解决方案(再次感谢您!)——请为接受的答案投票。

由于其他人可能会对此问题/答案/主题感兴趣,我想提一下我的代码(基于此 SO 线程 How to specify relationship type in CSV?,并根据 @cybersam 提供的提示进行了修改)现在可以工作,并显示结果:

解决方案 1(我的原始帖子,已更新):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Glycolysis {name:row.source})
MERGE (t:Glycolysis {name:row.target})
FOREACH (x in case row.relation when "substrate_of" then [1] else [] end |
  MERGE (s)-[r:substrate_of]->(t)
)
FOREACH (x in case row.relation when "yields" then [1] else [] end |
  MERGE (s)-[r:yields]->(t)
  );

解决方案 2(cybersam 的,已更新):

LOAD CSV WITH HEADERS FROM "file:/glycolysis_relations.csv" AS row
MERGE (s:Metabolism:Glycolysis {name: row.source})
MERGE (t:Metabolism:Glycolysis {name: row.target})
WITH s, t, row
  // "Bug" -- additional duplicate relations with each iteration of this statement/script:
  // CALL apoc.create.relationship(s, row.relation, {}, t) YIELD rel
  // Solution: 
  // https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/271
  // https://stackoverflow.com/questions/47808421/neo4j-load-csv-to-create-dynamic-relationship-types
  CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
RETURN COUNT(*);

两种解决方案都会生成相同的图表,如下所示。 :-D

【讨论】:

  • 一个不错的解决方案,可在其他途径中复制。然后你可以有相交的图表和很多有趣的机会。例如,糖酵解产生 ATP,而其他途径消耗它。您正在使用基材和产品;添加酶及其动力学(Km 等)可以确定速率和限制步骤。我希望我还在实验室时拥有这些工具!
  • 没错!这是我更大计划的一部分;除了其他代谢途径之外,您还可以添加细胞信号传导途径(相关的,例如调节),以及从各种来源(例如 PubMed)中提取的其他数据。在上面的例子中,己糖激酶的同工酶与 T2D、胰岛素抵抗、溶血性贫血、癌症有关。在其他地方,各种生物实体的变体/数量与人类疾病、干预、建模...
  • Use 不仅可以使用 PubMed,还可以使用您似乎暗示的 OMIM 和整个 Entrez 数据集市。对于那些不熟悉的人,这是一个利用许多 Entrez 资源的联合查询的一个很好的例子:ncbi.nlm.nih.gov/search/?term=fructose
【解决方案3】:

如果允许,我想再发布一个后续答案——我的原因是目前在 Neo4j 中重建代谢途径的内容很少,下面将提供一个完整的总结 StackOverflow 标题/主题,“在 Neo4j 中创建代谢途径”。

就像我上面的糖酵解途径一样,我在 Neo4j 中重新创建了 TCA柠檬酸循环 | 克雷布循环) 途径:

【TCA循环图片来源:https://metabolicpathways.stanford.edu/]

在创建 TCA 通路图期间出现的一个问题是其中一个节点(酶,“乌头酸酶”)被使用了两次,因此在图创建期间 MERGE 将公共节点 aconitase 合并为单个实体,导致此布局,

...不是这个,根据需要,

我对该问题的解决方案是使用节点属性创建“TCA 图”,以临时对受影响的源节点和目标节点进行差异标记(稍后在正确创建图后删除这些标记)。

我还添加了:Metabolism 标签,以便我可以根据需要选择单个途径 (:Glycolysis | :TCA) 或完整的代谢网络 (:Metabolism)。

最后,我需要通过它们的公共节点 pyruvate 连接两条路径 (:Glycolysis | :TCA),我可以通过 APOC 程序(这里,附加到我的 @ 末尾987654340@(密码)脚本。

这是我的 CSV 数据文件、*.cql Cypher 脚本、脚本执行和结果图。

糖酵解.csv:

source,relation,target
α-D-glucose,substrate_of,hexokinase
hexokinase,yields,glucose 6-phosphate
glucose 6-phosphate,substrate_of,glucose-6-phosphatase
glucose-6-phosphatase,yields,α-D-glucose
glucose 6-phosphate,substrate_of,phosphoglucose isomerase
phosphoglucose isomerase,yields,fructose 6-phosphate
fructose 6-phosphate,substrate_of,phosphofructokinase
phosphofructokinase,yields,"fructose 1,6-bisphosphate"
"fructose 1,6-bisphosphate",substrate_of,"fructose-bisphosphate aldolase, class I"
"fructose-bisphosphate aldolase, class I",yields,D-glyceraldehyde 3-phosphate
D-glyceraldehyde 3-phosphate,substrate_of,glyceraldehyde-3-phosphate dehydrogenase
D-glyceraldehyde 3-phosphate,substrate_of,triosephosphate isomerase (TIM)
triosephosphate isomerase (TIM),yields,dihydroxyacetone phosphate
glyceraldehyde-3-phosphate dehydrogenase,yields,"1,3-bisphosphoglycerate"
"1,3-bisphosphoglycerate",substrate_of,phosphoglycerate kinase
phosphoglycerate kinase,yields,3-phosphoglycerate
3-phosphoglycerate,substrate_of,phosphoglycerate mutase
phosphoglycerate mutase,yields,2-phosphoglycerate
2-phosphoglycerate,substrate_of,enolase
enolase,yields,phosphoenolpyruvate
phosphoenolpyruvate,substrate_of,pyruvate kinase
pyruvate kinase,yields,pyruvate

tca.csv:

source,relation,target,tag1,tag2
pyruvate,substrate_of,pyruvate dehydrogenase,,
pyruvate dehydrogenase,yields,acetyl CoA,,
acetyl CoA,substrate_of,citrate synthase,,
oxaloacetate,substrate_of,citrate synthase,,
citrate synthase,yields,citrate,,
citrate,substrate_of,aconitase,,1
aconitase,yields,cis-aconitate,1,
cis-aconitate,substrate_of,aconitase,,2
aconitase,yields,isocitrate,2,
isocitrate,substrate_of,isocitrate dehydrogenase,,
isocitrate dehydrogenase,yields,α-ketoglutarate,,
α-ketoglutarate,substrate_of,α-ketoglutarate dehydrogenase,,
α-ketoglutarate dehydrogenase,yields,succinyl-CoA,,
succinyl-CoA,substrate_of,succinyl-CoA synthetase,,
succinyl-CoA synthetase,yields,succinate,,
succinate,substrate_of,succinate dehydrogenase,,
succinate dehydrogenase,yields,fumarate,,
fumarate,substrate_of,fumarase,,
fumarase,yields,S-malate,,
S-malate,substrate_of,malate dehydrogenase,,
malate dehydrogenase,yields,oxaloacetate,,

“tsv.csv”中的“tag1”和“tag”2在通过“tca.cql”脚本创建时用于唯一地标识那些源节点和目标节点:

tca.cql:

// CREATE INDICES:
CREATE INDEX ON :Metabolism(name);
CREATE INDEX ON :TCA(name);

// CREATE GRAPH:
// USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/tca.csv" AS row
MERGE (s:Metabolism:TCA {name: row.source, tag:COALESCE(row.tag1, '')})
MERGE (t:Metabolism:TCA {name: row.target, tag:COALESCE(row.tag2, '')})
WITH s, t, row
  CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
  REMOVE s.tag, t.tag
RETURN COUNT(*);

糖酵解.cql:

// CREATE INDICES:
CREATE INDEX ON :Metabolism(name);
CREATE INDEX ON :Glycolysis(name);

// CREATE GRAPH:
//USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/glycolysis.csv" AS row
MERGE (s:Metabolism:Glycolysis {name: row.source})
MERGE (t:Metabolism:Glycolysis {name: row.target})
WITH s, t, row
  CALL apoc.merge.relationship(s, row.relation, {}, {}, t) YIELD rel
RETURN COUNT(*);

// MERGE COMMON NODE (GLYCOLYSIS: PYRUVATE; TCA: PYRUVATE):
// As presented, run "tca.cql" first, then "glycolysis.cql"

MATCH (g:Glycolysis), (t:TCA) WHERE g.name = t.name
CALL apoc.refactor.mergeNodes([g,t]) YIELD node
  RETURN node;

脚本执行:

$ cat tca.cql |  cypher-shell -u *** -p ***
  COUNT(*)
  21

$ cat glycolysis.cql |  cypher-shell -u *** -p ***
  COUNT(*)
  22
  node
  (:Metabolism:TCA:Glycolysis {name: "pyruvate"})

$ 

Neo4j 图(:Metabolism 视图):

【讨论】:

猜你喜欢
  • 2023-02-02
  • 1970-01-01
  • 2015-06-11
  • 1970-01-01
  • 1970-01-01
  • 2016-06-14
  • 1970-01-01
  • 2015-04-06
  • 1970-01-01
相关资源
最近更新 更多