如何从具有特定模式的 TXT 或 CSV 中删除行答案

【问题标题】：How to delete lines from TXT or CSV with specific pattern如何从具有特定模式的 TXT 或 CSV 中删除行
【发布时间】：2016-12-13 16:44:34
【问题描述】：

我有一个txt文件，格式如下：

目的是删除以“Subtotal Group 1”或“Subtotal Group 2”或“Grand Total”开头的行（此类字符串始终位于行首），但我需要删除它们仅当该行的其余部分有空白字段（或用空格填充）时。

这可以通过 awk 或 sed（1 次通过）来实现，但我目前正在使用 3 个单独的步骤（每个文本一个）。更通用的语法会很棒。谢谢大家。

我的 txt 文件如下所示：

Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00   500 First Line Text                                      1685.52
1.00   502 Second Line Text                                      280.98
       530 Other Line text                                       157.32
_________________________________________________________________________
Subtotal Group 1
Subtotal Group 1
Subtotal Group 1
Subtotal Group 1                                                2123.82
Subtotal Group 1
Subtotal Group 1

========================================================================
GROUP 2
========================================================================

7.00   701 First Line Text                                        53.63
       711 Second Line text                                       97.85
7.00   740 Third Line text                                       157.32
       741 Any Line text                                         157.32
       742 Any Line text                                          18.04
       801 Last Line text                                        128.63
_______________________________________________________________________
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2
Subtotal Group 2                                                 612.79
Subtotal Group 2
_______________________________________________________________________
Grand total
Grand total
Grand total
Grand total
Grand total
Grand total
Grand total                                                      1511.03

我想要达到的目标输出是：

Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00   500 First Line Text                                      1685.52
1.00   502 Second Line Text                                      280.98
       530 Other Line text                                       157.32
_______________________________________________________________________
Subtotal Group 1                                                2123.82

=======================================================================
GROUP 2
=======================================================================

7.00   701 First Line Text                                        53.63
       711 Second Line text                                       97.85
7.00   740 Third Line text                                       157.32
       741 Any Line text                                         157.32
       742 Any Line text                                          18.04
       801 Last Line text                                        128.63
_______________________________________________________________________
Subtotal Group 2                                                 612.79
_______________________________________________________________________
Grand total                                                     1511.03

【问题讨论】：

什么是Field1,... 数字？它们是否以 Subtotal 或 Grand Total 以外的任何其他开头？
@David 你说得对，这很令人困惑，我要编辑这个问题。谢谢。
@EdMorton 我有一个 CSV（在前几天你帮了我很多忙来格式化它并转换成可读的格式化 txt，并对齐）。现在我实现了一个几乎可打印的 txt，最后要修复的是删除多余的无用行。可能以前可以使用更有效的编码，但是我不太擅长在单个脚本中找出所有步骤，所以我正在逐步进行。谢谢埃德！
PD：如果我必须删除帖子或重新编写它以避免“超出主题”，我可以做到。
@EdMorton 我完全同意，我只是重新表述，抱歉让 Ed 感到困惑！

标签： linux bash csv awk sed

【解决方案1】：

这是 grep 被发明出来的工作：

$ grep -Ev '^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$' file
Some Generic Headers at the beginning of the file
=======================================================================
Group 1
=======================================================================
6.00   500 First Line Text                                      1685.52
1.00   502 Second Line Text                                      280.98
       530 Other Line text                                       157.32
_________________________________________________________________________
Subtotal Group 1                                                2123.82

========================================================================
GROUP 2
========================================================================

7.00   701 First Line Text                                        53.63
       711 Second Line text                                       97.85
7.00   740 Third Line text                                       157.32
       741 Any Line text                                         157.32
       742 Any Line text                                          18.04
       801 Last Line text                                        128.63
_______________________________________________________________________
Subtotal Group 2                                                 612.79
_______________________________________________________________________
Grand total                                                      1511.03

如果您愿意，可以在 awk 或 sed 中使用相同的正则表达式：

awk '!/^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$/' file
sed -E '/^(Subtotal Group [0-9]+|Grand total)[[:blank:]]*$/d' file

【讨论】：

我认为您是那些像阅读母语一样阅读代码的人之一。非常有才华的程序员，5 星！
我希望不会因为my native language 被说出来，而不是被写出来（写出来的时候，它是用拼音拼写的，这使得它变化很大并且难以阅读！）:-)。

【解决方案2】：

如果您的 good 行总是以数字结尾，而您的 Any Text 行不是，您可以使用：

sed -n '/^.*[0-9]$/p' file

-n 将禁止打印模式空间，您将只输出以[0-9] 结尾的行。给定您的示例文件，输出为：

Subtotal                                         2123.82
Total                                             625.80
Any Word                                         9999.99

【讨论】：

【解决方案3】：

你可以这样做：

grep -v -P "^(Subtotal Group \d+|Grand total)[,\s]*$" inputfile > outputfile

根据评论编辑。第二次编辑：适应新规格

【讨论】：

不用cat和|，直接用：grep -v ",,,$" infile > outfile

【解决方案4】：

如果目标是保留总计/小计行，还是应该删除它们，这个问题还不是很清楚。

此外，不清楚“#*”cmets 是输入文件的实际部分，还是仅仅是描述性的。

幸运的是，这两个都是小细节。用perl 做这件事相当简单：

$ perl -n -e 'print if /^(Subtotal|Grand Total),(,| |#.*)*/' inputfile
Subtotal,,,                     #This is unuseful --> To be removed
Subtotal,,,                     #This is unuseful --> To be removed
Subtotal,,,125.40               #This is a good line
Subtotal,,,                     #This is unuseful --> To be removed
Grand Total,,,                  #This is unuseful --> To be removed
Grand Total,,,125.40            #This is a good line

这假设您要保留总计和小计行，并删除所有其他行。

反过来，要删除总计/小计行并保留其他行，请将 if 关键字替换为 unless。

如果 cmets 实际上不在输入文件本身中，则只需稍微调整模式：

perl -n -e 'print if /^(Subtotal|Grand Total),(,| )*/' inputfile

这也会忽略任何额外的空格。如果您希望空白很重要，则变为：

perl -n -e 'print if /^(Subtotal|Grand Total),(,)*/' inputfile

就像我说的，即使你的问题不是 100% 清楚，不清楚的部分只是小细节。 perl 将轻松应对所有可能性。

如示例所示，perl 将在标准输出上打印编辑后的inputfile。要将inputfile 替换为编辑后的内容，只需在命令中添加-i 选项（在-e 选项之前）。

【讨论】：

【解决方案5】：

还有一个 awk 解决方案的尝试......

awk -F, '{for(i=2;i<=NF;i++){if($i~/[0-9.-]+/){print $0;next}}}' falzone
Subtotal,,,125.40               
Grand Total,,,125.40            
Any other text,,,9999.99

或者，查看非 csv 版本：

grep [0-9.-] falzone2 
Subtotal                                         2123.82
Total                                             625.80
Any Word                                         9999.99

【讨论】：