在 bash 中比较文件的更快解决方案答案

【问题标题】：Faster solution to compare files in bash在 bash 中比较文件的更快解决方案
【发布时间】：2017-03-01 01:21:05
【问题描述】：

文件1：

chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198

和文件2：

NR_024540 11

我需要在file1 中找到匹配的file2 并打印整个file1 + second column of file2

所以输出是：

  chr1  14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11

我的解决方案在 bash 中很慢：

#!/bin/bash

while read line; do

c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output


 done < file2

我更喜欢 FASTER 任何 bash 或 awk 解决方案。输出可以修改，但需要保留所有信息（列的顺序可以不同）。

编辑：

根据@chepner，现在它看起来是最快的解决方案：

#!/bin/bash

while read -r c d; do

grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' 

done < file2 > output

【问题讨论】：

是否以某种方式对file1 或/和file2 进行了排序？
将数据存储在适当的数据库中以支持查询数据将是最快的。
是的，它们是按照 sor -k 1V,2 -k 2n,2 排序的，但是为这个任务排序并按照我的要求排序是没有问题的。
@Geroge：这些数字可以出现在 file1 的任何位置，还是只出现在最后一列？
@Geroge, sqlite.org 可能是一个不错的起点，因为它可以找到一个能够比扫描文本文件更快地进行查找的轻量级索引数据库。

标签： linux bash awk sed

【解决方案1】：

在单个Awk 命令中，

awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1

chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

多线程中更易读的版本

FNR==NR {
    # map the values from 'file2' into the hash-map 'map'
    map[$1]=$2
    next
}
# On 'file1' do
{
    # Iterate through the array map
    for (i in map){
        # If there is a direct regex match on the line with the 
        # element from the hash-map, print it and append the 
        # hash-mapped value at last
        if($0 ~ i){
            $(NF+1)=map[i]
            print
            next
        }
    }
}

【讨论】：

split($4,v,"_"); key=v[1]"_"v[2]; if(key in map) 比 for (i in map) if(match($0,i)) 更好
非常感谢，但还是很慢 - 我在 file1 中有 cca 600K 行。但是在 awk 中很好的解决方案..
@Geroge：可以用的就是极简的awk。
^1 一次不错，在print 之后添加next 语句，遍历数组相当慢，可以使用(var in array)
@Inian：哎呀！是的，令人困惑，OP 是寻找完全匹配还是部分匹配，不便之处敬请见谅。

【解决方案2】：

使用join和sed的另一种解决方案，假设file1和file2已排序

join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output

如果输出顺序无关紧要，使用awk

awk 'FNR==NR{d[$1]=$2; next}
    {split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]}
' file2 file1

你明白了，

chr1 14361 14829 NR_024540_0_r_DDX11L1，WASH7P_468 11 chr1 14969 15038 NR_024540_1_r_WASH7P_69 11 chr1 15795 15947 NR_024540_2_r_WASH7P_152 11 chr1 16606 16765 NR_024540_3_r_WASH7P_15 11 chr1 16857 17055 NR_024540_4_r_WASH7P_198 11

【讨论】：

甚至这个awk 'FNR==NR{map[$1]=$2; next}{k=$4"_"$5 }(k in map){print $0,map[k]}' file2 FS='[ _]+' file1
@Jose 太棒了，awk 解决方案就像一个魅力，非常非常快！谢谢！

【解决方案3】：

试试这个 -

 cat file2
NR_024540 11
NR_024541 12

 cat file11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14361   14829   NR_024542_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15
chr1    16857   17055   NR_024540_4_r_WASH7P_198
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468
chr1    14969   15038   NR_024540_1_r_WASH7P_69
chr1    15795   15947   NR_024540_2_r_WASH7P_152
chr1    16606   16765   NR_024540_3_r_WASH7P_15


awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11
chr1    16857   17055   NR_024540_4_r_WASH7P_198 11
chr1    14361   14829   NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1    14969   15038   NR_024540_1_r_WASH7P_69 11
chr1    15795   15947   NR_024540_2_r_WASH7P_152 11
chr1    16606   16765   NR_024540_3_r_WASH7P_15 11

性能 -（测试 55000 条记录）

time awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file1 > output1

real    0m0.16s
user    0m0.14s
sys     0m0.01s

【讨论】：

这里的每个人都至少有一次删除了自己的答案，因为只有两秒的差异......即使是巧合，15分钟的差异也是巨大的。发生在我们所有人身上。
@AkshayHegde - 检查我的更新答案，让我知道您的评论。
@VIPINKUMAR: ^1 现在好了，(var in array) is right method
@AkshayHegde - 我尝试了这个问题的 SO 成员提供的所有解决方案，我发现这个解决方案至少减少了 60% 的时间..
@VIPINKUMAR：也试试这个awk 'FNR==NR{map[$1]=$2; next}{k=$4"_"$5 }(k in map){print $0,map[k]}' file2 FS='[ _]+' file1 看看需要多少时间

【解决方案4】：

您正在不必要地启动许多外部程序。让read 为您拆分来自file2 的传入线路，而不是两次调用awk。也不需要运行grep； awk 可以自己做过滤。

while read -r c d; do
    awk -v field="$c" -v line="$d" -v OFS='\t' '$0 ~ field {print $1,$2,$3,$4"_"line}' file1
done < file2 > output

【讨论】：

糟糕：file 应该是 awk 的最后一个参数，已修复。
非常感谢.. 很高兴知道如何使用 read -r c d :) 当我在玩时，我认为你的“read -r c d ...”和我的 grep 是最快的解决方案..

【解决方案5】：

如果搜索到的字符串总是相同的长度（length("NR_024540")==9）：

awk 'NR==FNR{a[$1]=$2;next} (i=substr($4,1,9)) && (i in a){print $0, a[i]}' file2 file1

解释：

NR==FNR {                         # process file2
    a[$1]=$2                      # hash record using $1 as the key
    next                          # skip to next record
} 
(i=substr($4,1,9)) && (i in a) {  # read the first 9 bytes of $4 to i and search in a
    print $0, a[i]                # output if found
}

【讨论】：

【解决方案6】：

awk -F '[[:blank:]_]+' '
   FNR==NR { a[$2]=$3 ;next }
   { if ( $5 in a ) $0 = $0 " " a[$5] }
   7
   ' file2 file1

评论：

使用_ 作为额外的字段分隔符，以便在两个文件中更容易比较文件名（仅使用数字部分）。
7 是为了好玩，它只是一个非 0 值 -> 打印该行
我没有更改字段（NF+1，...），所以我们保留原始格式，只添加引用的数字

较小的单行代码（针对代码大小进行了优化）（假设 file1 中的非空行是强制性的）。如果分隔符只是空格，则可以将 [:blank:] 替换为空格符

awk -F '[[:blank:]_]+' 'NF==3{a[$2]=$3;next}$0=$0" "a[$5]' file2 file1

【讨论】：

【解决方案7】：

不需要awk 或sed。这假设 file2 只有一行：

n="`cut -f 2 file2`" ; while read x ; do echo "$x $n" ; done < file1

【讨论】：