【发布时间】:2020-09-04 08:42:30
【问题描述】:
我的数据看起来像这样
id1,id2,similarity
CHEMBL1,CHEMBL1,1
CHEMBL2,CHEMBL1,0.18
CHEMBL3,CHEMBL1,0.56
CHEMBL4,CHEMBL1,0.64
CHEMBL5,CHEMBL1,0.12
CHEMBL1,CHEMBL2,0.18
CHEMBL2,CHEMBL2,1
CHEMBL3,CHEMBL2,0.26
CHEMBL4,CHEMBL2,0.78
CHEMBL5,CHEMBL2,0.33
CHEMBL1,CHEMBL3,0.56
CHEMBL2,CHEMBL3,0.26
CHEMBL3,CHEMBL3,1
CHEMBL4,CHEMBL3,0.04
CHEMBL5,CHEMBL3,0.85
CHEMBL1,CHEMBL4,0.64
CHEMBL2,CHEMBL4,0.78
CHEMBL3,CHEMBL4,0.04
CHEMBL4,CHEMBL4,1
CHEMBL5,CHEMBL4,0.49
CHEMBL1,CHEMBL5,12
CHEMBL2,CHEMBL5,0.33
CHEMBL3,CHEMBL5,0.85
CHEMBL4,CHEMBL5,0.49
CHEMBL5,CHEMBL5,1
整个文件大约有 1.97 亿行 (10GB)。我的目标是比较第 1 列中每种化合物的第 3 列的分布。通过大量重构,我设法获得了这段代码
import pandas as pd
from scipy.stats import ks_2samp
import re
with open('example.csv', 'r') as f, open('Metrics.tsv', 'a') as f_out:
f_out.write('compound_1' + '\t' + 'compound_2' + '\t' + 'Similarity' + '\t' + 'KS Distance' + '\n')
df = pd.read_csv(f, delimiter = ',', lineterminator = '\n', header = None)
d = {}
l_id1 = []
l_id2 = []
l_sim = []
uniq_comps = df.iloc[:, 0].unique().tolist()
for i in uniq_comps:
d[i] = []
for j in range(df.shape[0]):
d[df.iloc[j, 0]].append(df.iloc[j, 2])
l_id1.append(df.iloc[j, 0])
l_id2.append(df.iloc[j, 1])
l_sim.append(df.iloc[j, 2])
for k in range(len(l_id1)):
sim = round(l_sim[k]*100, 0)/100
ks = re.findall(r"statistic=(.*)\,.*$", str(ks_2samp(d[l_id1[k]], d[l_id2[k]])))
f_out.write(l_id1[k] + '\t' + l_id2[k] + '\t' + str(sim) + '\t' + str(''.join(ks)) + '\n')
运行但如预期的那样非常慢。有没有人知道如何让它更快?我想要的输出是这样的
compound_1,compound_2,Similarity,KS Distance
CHEMBL1,CHEMBL1,1.0,0.0
CHEMBL2,CHEMBL1,0.18,0.4
CHEMBL3,CHEMBL1,0.56,0.2
CHEMBL4,CHEMBL1,0.64,0.2
CHEMBL5,CHEMBL1,0.12,0.4
CHEMBL1,CHEMBL2,0.18,0.4
CHEMBL2,CHEMBL2,1.0,0.0
CHEMBL3,CHEMBL2,0.26,0.2
CHEMBL4,CHEMBL2,0.78,0.4
CHEMBL5,CHEMBL2,0.33,0.2
CHEMBL1,CHEMBL3,0.56,0.2
CHEMBL2,CHEMBL3,0.26,0.2
CHEMBL3,CHEMBL3,1.0,0.0
CHEMBL4,CHEMBL3,0.04,0.2
CHEMBL5,CHEMBL3,0.85,0.2
CHEMBL1,CHEMBL4,0.64,0.2
CHEMBL2,CHEMBL4,0.78,0.4
CHEMBL3,CHEMBL4,0.04,0.2
CHEMBL4,CHEMBL4,1.0,0.0
CHEMBL5,CHEMBL4,0.49,0.2
CHEMBL1,CHEMBL5,12.0,0.4
CHEMBL2,CHEMBL5,0.33,0.2
CHEMBL3,CHEMBL5,0.85,0.2
CHEMBL4,CHEMBL5,0.49,0.2
CHEMBL5,CHEMBL5,1.0,0.0
由于数据的大小,在 Pyspark 中运行它会更明智吗?如果是这样,如何达到类似的效果?
【问题讨论】:
-
我投票结束这个问题,因为它应该被问到codereview.stackexchange.com
-
你能把每个文件的几行贴出来看看格式吗?为了减少数据量,一种选择是生成直方图或 ECDF 以限制内存中的 de 大小。
-
@jlandercy 我的数据图像文件不可见吗?
-
数据图像是非常糟糕的数据通信方式。您应该复制粘贴可重用代码以使您的问题符合 SO 标准。您还可以阅读minimal reproducible example 以了解更多信息。是的,拥有文件的结构很有趣,因为您在其上运行正则表达式。
-
@MarcinOrlowski 虽然这可能是 CR 的主题,但在未来,请不要以 Code Review 站点的存在作为关闭问题的理由。评估请求并使用需要关注(就像我在这里所做的那样)、主要基于意见等原因。然后您可以向 OP 提及它可以是如果是 on-topic,则发布在 Code Review 上。请看Does being on-topic at another Stack Exchange site automatically make a question off-topic for Stack Overflow?
标签: python pandas csv pyspark kolmogorov-smirnov