【问题标题】:use awk to delete all lines by uniq on certain column if there are more than 2 values in other column equal to given value如果其他列中有超过 2 个值等于给定值,则使用 awk 通过 uniq 在某列上删除所有行
【发布时间】:2022-10-24 02:29:02
【问题描述】:

我有一个包含 6 列的大 ASCII 文件。该文件的行数是 24 的倍数(第四列是日期,%Y%m%d%H%M:24 行--> 1 天)并表示唯一(这 24 行中的 1、2、5、6 列的值相同:是一个测量站)。

这是一个 2x24 行的剪切示例,即 2 个不同的站点:

1_200061208 0 0.000000 202202150000 36.680573 15.094369
1_200061208 0 0.000000 202202150100 36.680573 15.094369
1_200061208 0 -99999 202202150200 36.680573 15.094369
1_200061208 0 0.000000 202202150300 36.680573 15.094369
1_200061208 0 0.000000 202202150400 36.680573 15.094369
1_200061208 0 0.000000 202202150500 36.680573 15.094369
1_200061208 0 0.000000 202202150600 36.680573 15.094369
1_200061208 0 0.000000 202202150700 36.680573 15.094369
1_200061208 0 -99999 202202150800 36.680573 15.094369
1_200061208 0 0.000000 202202150900 36.680573 15.094369
1_200061208 0 0.000000 202202151000 36.680573 15.094369
1_200061208 0 0.000000 202202151100 36.680573 15.094369
1_200061208 0 0.000000 202202151200 36.680573 15.094369
1_200061208 0 0.000000 202202151300 36.680573 15.094369
1_200061208 0 0.000000 202202151400 36.680573 15.094369
1_200061208 0 0.000000 202202151500 36.680573 15.094369
1_200061208 0 0.000000 202202151600 36.680573 15.094369
1_200061208 0 0.000000 202202151700 36.680573 15.094369
1_200061208 0 0.000000 202202151800 36.680573 15.094369
1_200061208 0 0.000000 202202151900 36.680573 15.094369
1_200061208 0 0.000000 202202152000 36.680573 15.094369
1_200061208 0 0.000000 202202152100 36.680573 15.094369
1_200061208 0 0.000000 202202152200 36.680573 15.094369
1_200061208 0 0.000000 202202152300 36.680573 15.094369
1_200061190 0 0.000000 202202150000 36.728195 14.993018
1_200061190 0 0.000000 202202150100 36.728195 14.993018
1_200061190 0 0.000000 202202150200 36.728195 14.993018
1_200061190 0 0.000000 202202150300 36.728195 14.993018
1_200061190 0 0.000000 202202150400 36.728195 14.993018
1_200061190 0 0.000000 202202150500 36.728195 14.993018
1_200061190 0 0.000000 202202150600 36.728195 14.993018
1_200061190 0 0.000000 202202150700 36.728195 14.993018
1_200061190 0 0.000000 202202150800 36.728195 14.993018
1_200061190 0 0.000000 202202150900 36.728195 14.993018
1_200061190 0 0.000000 202202151000 36.728195 14.993018
1_200061190 0 0.000000 202202151100 36.728195 14.993018
1_200061190 0 0.000000 202202151200 36.728195 14.993018
1_200061190 0 0.000000 202202151300 36.728195 14.993018
1_200061190 0 0.000000 202202151400 36.728195 14.993018
1_200061190 0 -99999 202202151500 36.728195 14.993018
1_200061190 0 0.000000 202202151600 36.728195 14.993018
1_200061190 0 0.000000 202202151700 36.728195 14.993018
1_200061190 0 0.000000 202202151800 36.728195 14.993018
1_200061190 0 0.000000 202202151900 36.728195 14.993018
1_200061190 0 0.000000 202202152000 36.728195 14.993018
1_200061190 0 0.000000 202202152100 36.728195 14.993018
1_200061190 0 0.000000 202202152200 36.728195 14.993018
1_200061190 0 0.000000 202202152300 36.728195 14.993018

我的目标是检查在第三列中,同一站点(第 1、2、5、6 列)每天(24 行)出现的 -99999 是否超过 1 次;在这种情况下,我想删除整个 24 行(换句话说,我想删除该站的整个测量日)。

预期的输出是相同的文件,没有满足我检查的 24xn 行。

在给出的示例中,预期的输出是:

1_200061190 0 0.000000 202202150000 36.728195 14.993018
1_200061190 0 0.000000 202202150100 36.728195 14.993018
1_200061190 0 0.000000 202202150200 36.728195 14.993018
1_200061190 0 0.000000 202202150300 36.728195 14.993018
1_200061190 0 0.000000 202202150400 36.728195 14.993018
1_200061190 0 0.000000 202202150500 36.728195 14.993018
1_200061190 0 0.000000 202202150600 36.728195 14.993018
1_200061190 0 0.000000 202202150700 36.728195 14.993018
1_200061190 0 0.000000 202202150800 36.728195 14.993018
1_200061190 0 0.000000 202202150900 36.728195 14.993018
1_200061190 0 0.000000 202202151000 36.728195 14.993018
1_200061190 0 0.000000 202202151100 36.728195 14.993018
1_200061190 0 0.000000 202202151200 36.728195 14.993018
1_200061190 0 0.000000 202202151300 36.728195 14.993018
1_200061190 0 0.000000 202202151400 36.728195 14.993018
1_200061190 0 -99999 202202151500 36.728195 14.993018
1_200061190 0 0.000000 202202151600 36.728195 14.993018
1_200061190 0 0.000000 202202151700 36.728195 14.993018
1_200061190 0 0.000000 202202151800 36.728195 14.993018
1_200061190 0 0.000000 202202151900 36.728195 14.993018
1_200061190 0 0.000000 202202152000 36.728195 14.993018
1_200061190 0 0.000000 202202152100 36.728195 14.993018
1_200061190 0 0.000000 202202152200 36.728195 14.993018
1_200061190 0 0.000000 202202152300 36.728195 14.993018

请给我代码。

【问题讨论】:

  • 你尝试了什么?你在哪里卡住了?请参阅How to Asktour
  • 我使用了关联数组,但只有在出现次数大于 1 时才能打印。awk '($3 =="-99999") {a[$1 FS $2 FS $5 FS $6]++} END {for (i in a) {if (a[i] >1) print i,a[i] }}' filename.txt
  • 如果那一大块文本是您的示例输入,那么预期的输出是什么?不要在无法格式化和可能遗漏的 cmets 中添加信息 - edit 您的问题包含所有相关信息。

标签: shell unix awk


【解决方案1】:

一个awk 想法使用输入文件的 2 遍:

awk '
FNR==NR { if ($3 == "-99999")             # 1st pass: collect count of "-99999" instances
             a[$1 FS $2 FS $5 FS $6]++
          next
        }

 a[$1 FS $2 FS $5 FS $6]+0 <= 1           # 2nd pass: print current line if "-99999" count <= 1; 
                                          # "+0" ==> force non-existent array entry to be processed as a numeric having value of "0"
' filename.txt filename.txt

这会产生:

1_200061190 0 0.000000 202202150000 36.728195 14.993018
1_200061190 0 0.000000 202202150100 36.728195 14.993018
1_200061190 0 0.000000 202202150200 36.728195 14.993018
1_200061190 0 0.000000 202202150300 36.728195 14.993018
1_200061190 0 0.000000 202202150400 36.728195 14.993018
1_200061190 0 0.000000 202202150500 36.728195 14.993018
1_200061190 0 0.000000 202202150600 36.728195 14.993018
1_200061190 0 0.000000 202202150700 36.728195 14.993018
1_200061190 0 0.000000 202202150800 36.728195 14.993018
1_200061190 0 0.000000 202202150900 36.728195 14.993018
1_200061190 0 0.000000 202202151000 36.728195 14.993018
1_200061190 0 0.000000 202202151100 36.728195 14.993018
1_200061190 0 0.000000 202202151200 36.728195 14.993018
1_200061190 0 0.000000 202202151300 36.728195 14.993018
1_200061190 0 0.000000 202202151400 36.728195 14.993018
1_200061190 0 -99999 202202151500 36.728195 14.993018
1_200061190 0 0.000000 202202151600 36.728195 14.993018
1_200061190 0 0.000000 202202151700 36.728195 14.993018
1_200061190 0 0.000000 202202151800 36.728195 14.993018
1_200061190 0 0.000000 202202151900 36.728195 14.993018
1_200061190 0 0.000000 202202152000 36.728195 14.993018
1_200061190 0 0.000000 202202152100 36.728195 14.993018
1_200061190 0 0.000000 202202152200 36.728195 14.993018
1_200061190 0 0.000000 202202152300 36.728195 14.993018

【讨论】:

  • 非常感谢@markp-fuso!它就像一个魅力。现在我要好好学习你回答...
【解决方案2】:

另一个awk 的想法需要单次通过输入文件:

awk '

function print_block() {                 # dump lines from array to stdout
    if (count+0 <= 1)                    # if count <= 1 ...
       for (i=1;i<=lineno;i++)           # loop through array ...
           print lines[i]                # printing array entries to stdout
    delete lines                         # delete array entries
    count=lineno=0                       # reset counters
}
    { key=$1 FS $2 FS $5 FS $6

      if (key != prevkey) {              # if looking at a new key then ...
         print_block()                   # dump previous block of lines to stdout
         prevkey=key
      }

      if ($3 == "-99999")                # keep count of times we see "-99999"
         count++

      if (count <= 1)                    # if count <= 1 then ...
         lines[++lineno]=$0              # save current line in array
    }

END { print_block() }                    # flush last block of lines to stdout
' filename.txt

笔记:

  • 保存给定键(又名站)的行(在数组中),直到我们读完所有 24 行(或直到 -99999 计数大于 1)然后..
  • 如果-99999 计数<= 1,我们将行(从数组)转储到标准输出
  • 但如果 -99999 计数 > 1,我们将“丢弃”行(在数组中)
  • 内存使用限制为在数组中最多容纳 24 行所需的内存

这会产生:

1_200061190 0 0.000000 202202150000 36.728195 14.993018
1_200061190 0 0.000000 202202150100 36.728195 14.993018
1_200061190 0 0.000000 202202150200 36.728195 14.993018
1_200061190 0 0.000000 202202150300 36.728195 14.993018
1_200061190 0 0.000000 202202150400 36.728195 14.993018
1_200061190 0 0.000000 202202150500 36.728195 14.993018
1_200061190 0 0.000000 202202150600 36.728195 14.993018
1_200061190 0 0.000000 202202150700 36.728195 14.993018
1_200061190 0 0.000000 202202150800 36.728195 14.993018
1_200061190 0 0.000000 202202150900 36.728195 14.993018
1_200061190 0 0.000000 202202151000 36.728195 14.993018
1_200061190 0 0.000000 202202151100 36.728195 14.993018
1_200061190 0 0.000000 202202151200 36.728195 14.993018
1_200061190 0 0.000000 202202151300 36.728195 14.993018
1_200061190 0 0.000000 202202151400 36.728195 14.993018
1_200061190 0 -99999 202202151500 36.728195 14.993018
1_200061190 0 0.000000 202202151600 36.728195 14.993018
1_200061190 0 0.000000 202202151700 36.728195 14.993018
1_200061190 0 0.000000 202202151800 36.728195 14.993018
1_200061190 0 0.000000 202202151900 36.728195 14.993018
1_200061190 0 0.000000 202202152000 36.728195 14.993018
1_200061190 0 0.000000 202202152100 36.728195 14.993018
1_200061190 0 0.000000 202202152200 36.728195 14.993018
1_200061190 0 0.000000 202202152300 36.728195 14.993018

【讨论】:

    【解决方案3】:
    $ cat tst.awk
    { key = $1 FS $2 FS $5 FS $6 }
    key != prev {
        prt()
        prev = key
    }
    $3 == -99999 { cnt++ }
    { rec = rec $0 ORS }
    END { prt() }
    
    function prt() {
        if ( cnt < 2 ) {
            printf "%s", rec
        }
        rec = cnt = ""
    }
    

    $ awk -f tst.awk file
    1_200061190 0 0.000000 202202150000 36.728195 14.993018
    1_200061190 0 0.000000 202202150100 36.728195 14.993018
    1_200061190 0 0.000000 202202150200 36.728195 14.993018
    1_200061190 0 0.000000 202202150300 36.728195 14.993018
    1_200061190 0 0.000000 202202150400 36.728195 14.993018
    1_200061190 0 0.000000 202202150500 36.728195 14.993018
    1_200061190 0 0.000000 202202150600 36.728195 14.993018
    1_200061190 0 0.000000 202202150700 36.728195 14.993018
    1_200061190 0 0.000000 202202150800 36.728195 14.993018
    1_200061190 0 0.000000 202202150900 36.728195 14.993018
    1_200061190 0 0.000000 202202151000 36.728195 14.993018
    1_200061190 0 0.000000 202202151100 36.728195 14.993018
    1_200061190 0 0.000000 202202151200 36.728195 14.993018
    1_200061190 0 0.000000 202202151300 36.728195 14.993018
    1_200061190 0 0.000000 202202151400 36.728195 14.993018
    1_200061190 0 -99999 202202151500 36.728195 14.993018
    1_200061190 0 0.000000 202202151600 36.728195 14.993018
    1_200061190 0 0.000000 202202151700 36.728195 14.993018
    1_200061190 0 0.000000 202202151800 36.728195 14.993018
    1_200061190 0 0.000000 202202151900 36.728195 14.993018
    1_200061190 0 0.000000 202202152000 36.728195 14.993018
    1_200061190 0 0.000000 202202152100 36.728195 14.993018
    1_200061190 0 0.000000 202202152200 36.728195 14.993018
    1_200061190 0 0.000000 202202152300 36.728195 14.993018
    

    【讨论】:

      猜你喜欢
      • 2017-10-16
      • 1970-01-01
      • 2022-11-23
      • 2020-05-28
      • 1970-01-01
      • 1970-01-01
      • 2021-12-04
      • 2022-11-14
      • 2011-11-28
      相关资源
      最近更新 更多