如何从两个制表符分隔的文件中获取枢轴线？答案

【问题标题】：How to get the pivot lines from two tab-separated files?如何从两个制表符分隔的文件中获取枢轴线？
【发布时间】：2021-05-13 05:23:45
【问题描述】：

给定两个文件file1.txt

abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432

和file2.txt

foo bar \t hello world
abc def \t good morning
xyz \t 456

任务是提取第一列匹配的行并实现：

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

我可以在 Python 中这样做：

from io import StringIO

file1 = """abc def \t 123 456
jkl mno \t 987 654
foo bar \t 789 123
bar bar \t 432"""


file2 = """foo bar \t hello world
abc def \t good morning
xyz \t 456"""

map1, map2 = {}, {}

with StringIO(file1) as fin1:
    for line in file1.split('\n'):
        one, two = line.strip().split('\t')
        map1[one] = two
    
    
with StringIO(file2) as fin2:
    for line in file2.split('\n'):
        one, two = line.strip().split('\t')
        map2[one] = two
        
        
for k in set(map1).intersection(set(map2)):
    print('\t'.join([k, map1[k], map2[k]]))

实际的任务文件有数十亿行，有没有更快的解决方案，无需加载所有内容并保留哈希图/字典？

也许使用 unix/bash 命令？对文件进行预排序有帮助吗？

【问题讨论】：

标签： python shell csv dictionary hashmap

【解决方案1】：

你可以试试这个awk:

awk '{key = $1 FS $2} FNR==NR {sub(/^([^[:blank:]]+[[:blank:]]+){2}/, ""); map[key] = $0; next} key in map {print $0, map[key]}' file2.txt file1.txt

abc def \t 123 456 \t good morning
foo bar \t 789 123 \t hello world

更易读的版本：

awk '{
   key = $1 FS $2
}
FNR == NR {
   sub(/^([^[:blank:]]+[[:blank:]]+){2}/, "")
   map[key] = $0
   next
}
key in map {
   print $0, map[key]
}' file2.txt file1.txt

它只将file2的数据加载到内存中，并逐行处理file1的记录。

【讨论】：

【解决方案2】：

join 命令有时很难使用，但在这里很简单：

join -t $'\t' <(sort file1.txt) <(sort file2.txt)

使用 bash 的 ANSI-C quoting 指定制表符分隔符，并使用 process substitutions 将程序输出视为文件。

要查看输出，请将以上内容通过管道传输到 cat -A 以查看表示为 ^I 的选项卡：

abc def^I123 456^Igood morning$
foo bar^I789 123^Ihello world$

【讨论】：

【解决方案3】：

使用 Miller (https://github.com/johnkerl/miller) 及其连接动词

mlr --tsv --implicit-csv-header --headerless-csv-output join -j 1 --rp 2 -f file1.txt file2.txt >output.tsv

输出将是（它只是一个预览，你会有制表符分隔符）：

| foo bar | 789 123 | hello world  |
| abc def | 123 456 | good morning |

【讨论】：