一些建议:
使用defaultdict(list) 而不是自己创建内部列表或使用dict.setdefault()。
dict.setfdefault() 每次都会创建默认值,这是一个时间燃烧器 - defautldict(list) 没有 - 它已被优化:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("\t")
evalIDs[ids[0]].append(ids[1])
eval.close()
如果您的密钥是有效的文件名,您可能需要调查awk 以获得更高的性能,然后在 python 中执行此操作。
类似
awk -F $'\t' '{print > $1}' file1
将更快地创建您的拆分文件,您可以简单地使用以下代码的后半部分从每个文件中读取(假设您的密钥是有效的文件名)来构建您的列表。 (署名:here) - 您需要使用os.walk 或类似方式获取您创建的文件。文件中的每一行仍然是制表符分隔的,并在前面包含 ID
如果您的密钥本身不是文件名,请考虑将不同的行存储到不同的文件中,并且只保留 key,filename 的字典。
拆分数据后,再次将文件加载为列表:
创建测试文件:
with open ("file.txt","w") as w:
w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti
""")
代码:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"\n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
输出:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}