【发布时间】:2021-05-13 05:23:45
【问题描述】:
给定两个文件file1.txt
abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432
和file2.txt
foo bar \t hello world
abc def \t good morning
xyz \t 456
任务是提取第一列匹配的行并实现:
abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world
我可以在 Python 中这样做:
from io import StringIO
file1 = """abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432"""
file2 = """foo bar \t hello world
abc def \t good morning
xyz \t 456"""
map1, map2 = {}, {}
with StringIO(file1) as fin1:
for line in file1.split('\n'):
one, two = line.strip().split('\t')
map1[one] = two
with StringIO(file2) as fin2:
for line in file2.split('\n'):
one, two = line.strip().split('\t')
map2[one] = two
for k in set(map1).intersection(set(map2)):
print('\t'.join([k, map1[k], map2[k]]))
实际的任务文件有数十亿行,有没有更快的解决方案,无需加载所有内容并保留哈希图/字典?
也许使用 unix/bash 命令?对文件进行预排序有帮助吗?
【问题讨论】:
标签: python shell csv dictionary hashmap