在另一个文件的列中从一个文件中查找公共元素并输出后者的整行答案

【问题标题】：Finding common elements from one file in a column of another file and output the entire row of the latter在另一个文件的列中从一个文件中查找公共元素并输出后者的整行
【发布时间】：2015-04-22 22:35:57
【问题描述】：

我需要从一个列表 (list.txt) 中提取所有匹配项，该列表可以在另一个列表的一列中找到（在 Data.txt 中）到第三个 (output.txt) 中。

Data.txt（制表符分隔）

some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E   4 Btm;Lol 16 2
T   3 Whizz 13 3

列表.txt

Gee
Whiz
Lol

理想的 output.txt 看起来像

some_data more_data other_data here yet_more_data etc
A B 2 Gee;Whiz;Hello 13 12
A B 2 Gee;Whizz;Hi 56 32
E   4 Btm;Lol 16 2

所以我尝试了一个shell脚本

for ids in List.txt 
do
grep $ids Data.txt >> output.txt
done

除了我在上述脚本的List.txt 中输入了所有内容（实际上是剪切和粘贴）。

不幸的是，它给了我一个output.txt，包括最后一行，我认为“Whizz”包含“Whiz”。

我也试过cat Data.txt | egrep -F "List.txt"，结果是grep: conflicting matchers specified——我想我太天真了。实际文件：List.txt 包含 985 个单词的排序列表，Data.txt 有 115576 行 17 列。

非常感谢一些帮助/指导。

【问题讨论】：

查找有关 linux/unix join 实用程序的教程。最坏的情况，man join 或 info join。祝你好运。

标签： unix grep

【解决方案1】：

试试这样的：

for ids in List.txt 
do
  grep "[TAB;]$ids[TAB;]" Data.txt >> output.txt
done

但它有两个缺点：

“Data.txt”被多次扫描
您可以多次获取一条线路。

如果有问题，请尝试两步版本：

cat List.txt | sed -e "s/.*/[TAB;]\0[TAB;]/g" > List_mod.txt
grep -f List_mod.txt Data.txt > output.txt

注意： TAB 字符可以通过组合 Ctrl-V 插入命令行中的 Tab 键和编辑器中的 Tab 字符。您必须检查您的编辑是否不会将制表符更改为一系列空格。

【讨论】：

【解决方案2】：

用于一般文本处理的 UNIX 工具是“awk”：

awk '
NR==FNR { list[$0]; next }
{
    for (word in list) {
        if ($0 ~ "[\t;]" word "[\t;]") {
            print
            next
        }
    }
}
' List.txt Data.txt > output.txt

【讨论】：