【问题标题】:How to emliminate duplicates among AWK search patterns using AWK and/or SED如何使用 AWK 和/或 SED 消除 AWK 搜索模式中的重复项
【发布时间】:2022-01-31 08:01:41
【问题描述】:

我有以下名为 x.txt 的文件(仅摘录):

exMap( "0Ba|Mtm|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "gn C[hu]|gn C[hu]|ent Ca","Variable Expenses","Bank – Charges" )
exMap( "t m|e Fee|^Deb|A\/C|hly pr|ged Ov|^Visa","Fixed Expenses","Bank – Charges" )
exMap( "ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )
exMap( "Acci","Variable Expenses","Business – ACC" )
exMap( "use St$|Pgg","Variable Expenses","Business – Miscellaneous" )
exMap( "utd$|^Ellm|^Ellm|a Cy|^Stihl|^Stihl|a Mow","Variable Expenses","Business – Repairs & Maintenance" )
exMap( "Nzp","Fixed Expenses","Business Services – Mail" )
exMap( "0Ki|^K S","Fixed Expenses","Business – Storage" )
exMap( "2 C|2 C|40D|Tor|Tor|Tor|e of B|^Jay|^Luk|ll J|ty AP|le Au|eemi|eemi|eemi|^N[Zz] S|^N[Zz] S|s The J|P[Bb] T|P[Bb] T|P[Bb] T|mTo|deme|deme|deme|deme|deme|deme","Variable Expenses","Capex" )
exMap( "90C","Variable Expenses","Christian to Crewcut" )
exMap( "10E","Variable Expenses","Christian to Edorne" )
exMap( "0.0Ch|0.0Ch","Variable Expenses","Edorne to Christian" )
exMap( "lyt|-J|lyt|y Ha|NZA|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|^NZ T|acle Boo","Variable Expenses","Education" )
exMap( "weat|Aca|weat|Aca|weat|^A A|^A A|^A A|^A A|^A A|^A A|^A A|weat","Fixed Expenses","Education" )
exMap( "ntac|0Tr|ntac","Fixed Expenses","Electricity" )

此文件以逗号分隔,由三列组成。第一列包含 awk 正则表达式搜索模式。其中一些是重复的,例如 |Mtm|在第一个或 |ATM|例如在第 4 行。有没有一种聪明的方法来消除整个文件中的重复并使用 awk 和/或 sed 保持管道结构完整?

第一行和第四行所需的输出将是:

exMap( "0Ba|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )

【问题讨论】:

  • 尝试自己编写一些东西,如果不起作用,请具体向我们展示您所做的事情,以便我们为您提供帮助。您启动它,然后我们提供帮助。我们不是为你写的。 向我们展示您尝试过的实际代码,然后描述发生的情况和不正确的地方,然后我们可以从那里为您提供帮助。如果您先自己尝试一下,您可能会非常接近答案。

标签: awk sed


【解决方案1】:

使用sed

$ cat rem_dupes.sed
s/\(|[^|]*|\?\"\?\)\1\+/\1/g
s/\(|\?[^|]*|\"\?\)\1\+/\1/g
s/\(\([a-z][^|]*\)|[^"]*\)|\2/\1/g
s/\(\([a-z0-9][^|]*\)|\?[^|]*\)|\1/\1/
s/\(\([a-z]*|\)[^|]*|\)\2/\1/
$ sed -f rem_dupes.sed input_file
exMap( "0Ba|Mtm","Variable Expenses","Accounting & Legal" )
exMap( "gn C[hu]|ent Ca","Variable Expenses","Bank – Charges" )
exMap( "t m|e Fee|^Deb|A\/C|hly pr|ged Ov|^Visa","Fixed Expenses","Bank – Charges" )
exMap( "ATM|^Fix C|R US","Variable Expenses","Bank – Withdrawals" )
exMap( "Acci","Variable Expenses","Business – ACC" )
exMap( "use St$|Pgg","Variable Expenses","Business – Miscellaneous" )
exMap( "utd$|^Ellm|a Cy|^Stihl|a Mow","Variable Expenses","Business – Repairs & Maintenance" )
exMap( "Nzp","Fixed Expenses","Business Services – Mail" )
exMap( "0Ki|^K S","Fixed Expenses","Business – Storage" )
exMap( "2 C|40D|Tor|e of B|^Jay|^Luk|ll J|ty AP|le Au|eemi|^N[Zz] S|s The J|P[Bb] T|mTo|deme","Variable Expenses","Capex" )
exMap( "90C","Variable Expenses","Christian to Crewcut" )
exMap( "10E","Variable Expenses","Christian to Edorne" )
exMap( "0.0Ch","Variable Expenses","Edorne to Christian" )
exMap( "lyt|-J|y Ha|NZA|^NZ Tcle Boo","Variable Expenses","Education" )
exMap( "weat|Aca|^A A","Fixed Expenses","Education" )
exMap( "ntac|0Tr","Fixed Expenses","Electricity" )

s/\(|[^|]*|\?\"\?\)\1\+/\1/g - 使用分组和反向引用,匹配并添加到缓冲区组 \1 两个管道之间的任何内容 | 其中第二个可能存在或可能不存在 \? 或双引号 "当第一个管道必须存在时,可能再次存在也可能不存在。然后重复组内匹配的所有内容并将其添加到要排除的分组之外。如果在反向引用 \1 中找到重复匹配的模式,则排除它只保留原始匹配,然后在替换中返回作为反向引用 \1

s/\(|\?[^|]*|\"\?\)\1\+/\1/g - 如上所述,使用分组和反向引用,但这次使初始管道 | 可选,而第二个必须存在。

s/\(\([a-z][^|]*\)|[^"]*\)|\2/\1/g - 此处使用嵌套分组来匹配所需组中的特定模式。这种嵌套的分组允许我们匹配出现在交错序列中的重复项,例如lyt|-J|lyt|

s/\(\([a-z0-9][^|]*\)|\?[^|]*\)|\1/\1/ - 如上所述,但这也将针对整数,交错重复。

s/\(\([a-z]*|\)[^|]*|\)\2/\1/ - 这是为了清理使用嵌套分组发现的最终交错重复。

【讨论】:

  • 感谢 HatLess 这行得通!我正在尝试围绕您的代码思考,因为除了基本教程之外,我对 sed 不太熟悉。您介意详细说明每一行的作用吗?
  • @ChristianHick 请检查编辑
  • 谢谢HatLess。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2020-04-25
  • 2015-12-05
  • 2023-03-17
  • 2017-10-19
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多