具有固定数量字段的 awk 解析答案

【问题标题】：awk parsing with fixed number of fields具有固定数量字段的 awk 解析
【发布时间】：2013-04-12 12:06:31
【问题描述】：

寻找使用 awk 记录进行解析的解决方案，其中，within 也可以是 /n 字符。记录以| 分隔。问题是当达到一定数量的字段时可以确定新行。如何在 awk 中做到这一点？

例子：

2013-03-24 15:49:40.575175 EST|aaa|tsi|p1753|th2056569632|172.30.10.212|56809|2013-03-24 15:49:32 AFT|10354453|con2326|cmd7|seg-1||dx318412|x10354453|sx1|LOG: |00000|statement: SET DATESTYLE = "ISO"; Select * 
from bb 
where cc='1'||||||SET DATESTYLE = "ISO"; Select * from bb where cc='1'|0||postgres.c|1447|
2013-04-10 12:45:48.277080 EST|aa|tsi|p22814|th1093698336|172.30.0.186|3304|2013-04-10 12:44:29 AFT|10400046|con67|cmd5|seg-1||dx341|x10400046|sx1|LOG: |00000|statement: create table xx as (select r.xx,sum(r."XX"),c.dd from region_RR r, cat_CC c
where r.aa=c.vv
group by 1)||||||create table xx as (select r.xx,sum(r."XX"),c.dd from region_RR r, cat_CC c
where r.aa=c.vv
group by 1)
|0||postgres.c|1447|

是一条记录，它有许多 \n 字符。我需要用 awk 解析并从中获取例如第 5 个字段。

【问题讨论】：

我假设每个文件会有多个记录？你能提供一个更大的样本吗？我不会使用awk 来解决这个问题。 awk 约束有什么特殊原因吗？
就像@MattH 所说，请提供至少 2 条记录的输入，它们可能是固有的记录分隔符，例如空行？
请发布给定输入的预期输出，并确认相关输入的格式与您的实际文件相同。

标签： bash awk

【解决方案1】：

从上面 sudo_O 的回答中汲取灵感... 将变量 FIELD_TO_PRINT 设置为感兴趣的字段位置，将另一个变量 FIELDS_PER_RECORD 设置为表示记录的字段数。在 Ubuntu 上使用 GNU awk 测试

awk   -v FIELDS_PER_RECORD=10 -v FIELD_TO_PRINT=5 'BEGIN{FS="|"; RS="\0"}\
{for (i=1; i<=NF; ++i) {if (i % FIELDS_PER_RECORD == FIELD_TO_PRINT) {print $i} }}' file_name.txt
th2056569632
x10354453
SET DATESTYLE = "ISO"; Select * from bb where cc='1'

【讨论】：

很好，不知道awk 能做到这一点。
OP 给出的这个输入是单条记录，他们只希望每条记录有一个字段，这为给定记录提供了 3 个字段。
@sudo_O，在我看来，OP 建议基于固定数量的字段的隐式记录长度，所以我根据这个假设编写了一个答案。
这很好，我只是指出FIELDS_PER_RECORD 不正确。对于您想要的 OP 案例FIELDS_PER_RECORD=29.
它对我有用：[gpadmin@gpblade1 数据]$ cat /tmp/aa.txt | awk -v FIELDS_PER_RECORD=29 -v FIELD_TO_PRINT=5 ' BEGIN{FS="|"; RS="\0"} {for (i=1; i

【解决方案2】：

显然，这不是您所要求的：为了比较，以下是我在 python 中的操作方式：

from cStringIO import StringIO

def records_from_file(f,separator='|',field_count=30):
  record = []
  for line in f:
    fields = line.split(separator)
    if len(record) > 0:
      # Merge last of existing with first of new
      record[-1] += fields[0]
      # Extend rest of fields
      record.extend(fields[1:])
    else:
      record.extend(fields)
    if len(record) > field_count:
      raise Exception("Concatenating records overflowed number of fields",record)
    elif len(record) == field_count:
      yield record
      record = []

sample = """2013-03-24 15:49:40.575175 EST|aaa|tsi|p1753|th2056569632|172.30.10.212|56809|2013-03-24 15:49:32 AFT|10354453|con2326|cmd7|seg-1||dx318412|x10354453|sx1|LOG: |00000|statement: SET DATESTYLE = "ISO"; Select * 
from bb 
where cc='1'||||||SET DATESTYLE = "ISO"; Select * from bb where cc='1'|0||postgres.c|1447|
2013-04-10 12:45:48.277080 EST|aa|tsi|p22814|th1093698336|172.30.0.186|3304|2013-04-10 12:44:29 AFT|10400046|con67|cmd5|seg-1||dx341|x10400046|sx1|LOG: |00000|statement: create table xx as (select r.xx,sum(r."XX"),c.dd from region_RR r, cat_CC c
where r.aa=c.vv
group by 1)||||||create table xx as (select r.xx,sum(r."XX"),c.dd from region_RR r, cat_CC c
where r.aa=c.vv
group by 1)
|0||postgres.c|1447|"""

for record in records_from_file(StringIO(sample)):
  print record[4]

产量：

th2056569632
th1093698336

【讨论】：

【解决方案3】：

对于文件中的一条记录，您不能将记录分隔符设置为空字符RS='\0'，因此输入文件将作为一条完整记录读取：

$ awk '{print $5}' FS='|' RS='\0' file
th2056569632

对于多条记录，您可以使用date 作为记录分隔符（除非它们已经用空行分隔，这会使事情变得更简单，或者除非您在输出中需要此字段） ：

$ awk 'NR>1{print $5}' FS='|' RS='(^|[^|])[0-9]{4}-[0-9]{2}-[0-9]{2} ' file
th2056569632
th1093698336

更简单的grep -o 'th[0-9]*' file 是否适合这里？

【讨论】：

我认为 OP 暗示可能有多个记录
不幸的是，它也不适用于 CentOS 和 Solaris。-bash-3.00$ awk 'NR>1{print $5}' FS='|' RS='(^|[^|])[0-9]{4}-[0-9]{2}-[0-9]{2} ' /tmp/aa.txt 0 -bash-3.00$ CentOS:[gpadmin@gpblade1 数据]$ awk 'NR>1{print $5}' FS='|' RS='(^|[^|])[0-9]{4}-[0-9]{2}-[0-9]{2} ' /tmp/aa.txt [gpadmin@gpblade1 数据] $我还附加了文件aa.txt
@martinnovoty /usr/bin/awk 在 Solaris 上是旧的/损坏的 awk 请改用 /usr/xpg4/bin/awk。如果可以的话，最好在您的系统上安装最新版本的gawk。
-bash-3.00$ /usr/xpg4/bin/awk 'NR>1{print $5}' FS='|' RS='(^|[^|])[0-9]{4}-[0-9]{2}-[0-9]{2} ' /tmp/aa.txt 1447 -bash-3.00$
:-))，好的，最后一点，我实际上是在尝试从日志中获取 SQL 语句。总之非常感谢。有时我想知道你的反应速度有多快。看起来后面有一些人工智能......，非常感谢！