删除除最后一行以外的所有以相同字符串开头的行答案

【问题标题】：Remove all lines except the last which start with the same string删除除最后一行以外的所有以相同字符串开头的行
【发布时间】：2015-10-23 18:06:59
【问题描述】：

我正在使用 awk 处理文件以将行过滤到特定的感兴趣的行。使用生成的输出，我希望能够删除除最后一行以外的所有以相同字符串开头的行。

以下是生成的示例：

this is a line
duplicate remove me
duplicate this should go too
another unrelated line
duplicate but keep me
example remove this line
example but keep this one
more unrelated text

第 2 行和第 3 行应删除，因为它们以 duplicate 开头，第 5 行也是如此。因此应保留第 5 行，因为它是最后一行以 duplicate 开头。

第 6 行也是如此，因为它以 example 开头，第 7 行也是如此。因此应保留第 7 行，因为它是最后一行以 example 开头。

鉴于上面的例子，我想产生以下输出：

this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text

我怎样才能做到这一点？

我尝试了以下方法，但无法正常工作：

awk -f initialProcessing.awk largeFile | awk '{currentMatch=$1; line=$0; getline; nextMatch=$1; if (currentMatch != nextMatch) {print line}}' -

【问题讨论】：

你的例子不清楚

标签： bash shell unix awk

【解决方案1】：

为什么不从头到尾读取文件并打印包含duplicate 的第一行？这样您就不必担心打印或不打印的内容、等待线路等。

tac file | awk '/duplicate/ {if (f) next; f=1}1' | tac

这会在第一次看到 duplicate 时设置一个标志 f。从第二个时间开始，这个标志使该行被跳过。

如果您想以一种仅在最后一次打印每个第一个单词的方式使其通用，请使用数组方法：

tac file | awk '!seen[$1]++' | tac

这会跟踪到目前为止出现的第一个单词。它们存储在数组seen[] 中，因此通过说!seen[$1]++，我们在$1 第一次出现时使其为真；从第二次开始，它的计算结果为 False，并且该行不被打印。

测试

$ tac a | awk '!seen[$1]++' | tac
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text

【讨论】：

不只是重复，它是任何重复的字符串，参见示例。
@bkmoney 哦，感谢您的评论。通过使其更通用来修复它
这正是我所追求的。谢谢。

【解决方案2】：

您可以使用（关联）数组来始终保持最后一次出现：

awk '{last[$1]=$0;} END{for (i in last) print last[i];}' file

【讨论】：