【发布时间】:2021-06-23 03:50:40
【问题描述】:
我正在处理以多列格式排列的CSV日志的后处理,顺序如下:第一列对应行号(ID),第二列包含其人口(POP,样本属于这个 ID),第三列(dG)代表这个 ID 的一些固有值(它总是负数):
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
27, 1, -6.6500
28, 5, -6.5700
29, 3, -6.5500
30, 2, -6.4600
31, 2, -6.4500
32, 1, -6.3000
33, 7, -6.2900
34, 1, -6.2100
35, 1, -6.2000
36, 3, -6.1800
37, 1, -6.1700
38, 4, -6.1300
39, 1, -6.1000
40, 2, -6.0600
41, 3, -6.0600
42, 8, -6.0200
43, 2, -6.0100
44, 1, -6.0100
45, 1, -5.9800
46, 2, -5.9700
47, 1, -5.9300
48, 6, -5.8800
49, 4, -5.8300
50, 4, -5.8000
51, 2, -5.7800
52, 3, -5.7200
53, 1, -5.6600
54, 1, -5.6500
55, 4, -5.6400
56, 2, -5.6300
57, 1, -5.5700
58, 1, -5.5600
59, 1, -5.5200
60, 1, -5.5000
61, 3, -5.4200
62, 4, -5.3600
63, 1, -5.3100
64, 5, -5.2500
65, 5, -5.1600
66, 1, -5.1100
67, 1, -5.0300
68, 1, -4.9700
69, 1, -4.7700
70, 2, -4.6600
为了减少行数,我过滤了这个 CSV,目的是在第二列 (POP) 中搜索编号最高的行,使用以下 AWK 表达式:
# search CSV for the line with the highest POP and save all lines before it, while keeping minimal number of the lines (3) in the case if this line is found at the beginning of CSV.
awk -v min_lines=3 -F ", " 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0; printed=NR} a > $2 && NR > 1 {arr[i]=$0; i++}END{if(printed <= min_lines) {for(idx = 0; idx <= min_lines - printed; idx++){print arr[idx]}}}' input.csv > output.csv
从而获得以下减少的输出 CSV,由于搜索字符串(具有最高 POP)位于第 26 行,因此仍然有很多行:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
如何通过修改我的 AWK 表达式(或将其传递给其他内容)来进一步自定义我的过滤器,以便仅考虑第三列负值 dG 与第一行(哪个值最负)?例如,仅考虑与第一行相比 dG 差异不超过 20% 的行,同时保持所有休息条件相同:
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
【问题讨论】: