将 2 个文件与可搜索键进行比较答案

【问题标题】：comparing 2 files with a searchable key将 2 个文件与可搜索键进行比较
【发布时间】：2017-09-17 00:56:34
【问题描述】：

早安，

我一直在尝试找到解决comparing 2 files using awk 中提出的类似问题的解决方案，但我似乎无法理解它。寻求帮助。

我有 2 个文件要比较。 file1和file2的mockup内容如下：

文件 1：

50       0004312805201        06740         2310821                                                                                                                                
50      0004986504201        00845         2310837                                                                                                                                
50      0003913155201        47679         2310762                                                                                                                                
50      0004997395201        2035          2311180                                                                                                                                
50      0001147242201        15000         23108723                                                                                                                                
50      0005771878201        13545         I3840000

文件2：

0003913155 A

0005771878 A

0004312805 A

0000000015 B

0000000012 B

1111111111 E

我需要从 file1 对 field2 执行 substring 以生成 10 character length searchable key value 并在 file2 的 field1 中找到匹配值。

如果找到匹配项，则打印出整个 file1 行，并将 file2 中的 field2 作为新字段附加。

如果不匹配，则打印出整个 file1 行，并附加字符串“NO”作为新字段。输出最好重定向到文件。

示例输出如下所示。

输出：

50       0004312805201        06740         2310821 A                                                                                                                               
50      0004986504201        00845         2310837 NO                                                                                                                             
50      0003913155201        47679         2310762 A                                                                                                                               
50      0004997395201        2035          2311180 NO                                                                                                                             
50      0001147242201        15000         23108723 NO                                                                                                                               
50      0005771878201        13545         I3840000 A

你们建议我如何通过awk 或GNU-awk 解决这个问题？在准备可搜索的键子字符串并在 awk/GNU-awk 中使用它来构建数组时遇到问题。

任何帮助将不胜感激。在这一点上，我正在转动我的车轮。

谢谢。

【问题讨论】：

生成一个 10 字符长度的可搜索键 - 该键不能从字段的开头开始吗？
using awk to match a column in log file and print the entire line的可能重复

标签： awk gawk

【解决方案1】：

awk '
     FNR==NR{ a[$1]=$2; next }
     { s=substr($2,1,10); print $0,(s in a ?a[s]:"No") }
    ' file2 file1 > your_output_file

输入：

$ cat file1
50 0004312805201 06740 2310821
50 0004986504201 00845 2310837
50 0003913155201 47679 2310762
50 0004997395201 2035 2311180
50 0001147242201 15000 23108723
50 0005771878201 13545 I3840000 

$ cat file2
0003913155 A
0005771878 A
0004312805 A
0000000015 B
0000000012 B
1111111111 E

输出

$ awk 'FNR==NR{a[$1]=$2;next}{s=substr($2,1,10);print $0, (s in a ? a[s] : "No") }' file2 file1
50 0004312805201 06740 2310821 A
50 0004986504201 00845 2310837 No
50 0003913155201 47679 2310762 A
50 0004997395201 2035 2311180 No
50 0001147242201 15000 23108723 No
50 0005771878201 13545 I3840000  A

【讨论】：

我会在明天查看提供的反馈并回复 cmets。感谢所有回复。

【解决方案2】：

不确定produce a 10 character length searchable key value 中的 OP 是什么意思。我将其解释为：file2 的字段 1 中的值必须是 file1 中字段 2 的子字符串。

$ cat tst.awk
/^[0-9]/ && NR==FNR { a[$1]=$2; next }   # read values from file2 in array
/^[0-9]/{
   f=0;
   for (i in a){                         # loop over field 1 of file2
      if (index($2, i)){                 # if i can be found in field 2 of file1
         print $0, a[i];                 # print $0 with $2 from file2
         f++;
         break;
      }
   }
}
/^[0-9]/ && !f{ print $0, "NO" }         # if no match, print "NO" line

输入

$ cat file1
50 0004312805201 06740 2310821
50 0004986504201 00845 2310837
50 0003913155201 47679 2310762
50 0004997395201 2035 2311180
50 0001147242201 15000 23108723
50 0005771878201 13545 I3840000

和

$ cat file2
0003913155 A

0005771878 A

0004312805 A

0000000015 B

0000000012 B

1111111111 E

调用 tst.awk 会产生输出：

$ awk -f tst.awk file2 file1
50 0004312805201 06740 2310821 A
50 0004986504201 00845 2310837 NO
50 0003913155201 47679 2310762 A
50 0004997395201 2035 2311180 NO
50 0001147242201 15000 23108723 NO
50 0005771878201 13545 I3840000 A

或者，使用单线：

$ awk '/^[0-9]/ && NR==FNR { a[$1]=$2; next } /^[0-9]/{f=0;for (i in a){if (index($2, i)){print $0, a[i];f++;break;}}}/^[0-9]/ && !f{ print $0, "NO" }' file2 file1

【讨论】：