awk ：提取多行数据答案

【问题标题】：awk : extracting a data which is on several linesawk ：提取多行数据
【发布时间】：2017-12-10 11:36:20
【问题描述】：

所以我有一个看起来像这样的文件：

/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
                 LITPRAAVPALKRPALKASLPASSSHGNWETF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
 CDS             complement(471..590)
                 /db_xref="SEED:fig|1240086.14.peg.2"
                 /translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
                 /product="hypothetical protein"
 CDS             717..2354
                 /db_xref="SEED:fig|1240086.14.peg.3"
                 /translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
                 QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
                 RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
                 /product="macromolecule metabolism; macromolecule
                 degradation; degradation of proteins, peptides,
                 glycopeptides"

我需要提取“/product=”后引号之间的文本，所以我需要这个：

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

我必须使用awk，所以我写了这个：

awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'

但这仅将信息与“/product”放在同一行，有时信息在两三行。我不知道如何在引号之间获取整个信息，有人可以帮忙吗？

【问题讨论】：

标签： bash awk

【解决方案1】：

awk 来救援！需要多字符 RS 支持 (gawk)

$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file


Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

说明设置记录结构（以“/”或“CDS”开头，查找相关记录（以产品开头），修剪多余的空格并打印两个引号之间的字段（第二个字段基于设置的字段分隔符为双引号）。

【讨论】：

【解决方案2】：

Awk解决方案：

awk -v RS='"' '!(NR%2) && f{ gsub(/[[:space:]]+/," "); print }
               /\/[[:alnum:]_-]+=$/{ f=(/product=/? 1:0) }' file

-v RS='"' - 将双引号 " 视为记录分隔符
!(NR%2) - 在每个 even 行上
gsub(/[[:space:]]+/," ") - 删除多余的空格
f=(/product=/? 1:0) - 将标志 f 设置为活动状态 1 在 /product= ... 行上

输出：

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

【讨论】：

/\/[[:alnum:]_-]+=$/ 是什么意思哈哈？我不明白你为什么只想处理偶数行？
@janedoe，你的“意思是哈哈”是什么意思？
我的意思是“这是什么意思”:)
@janedoe, /\/[[:alnum:]_-]+=$/ - 匹配双引号内容之前的关键字/参数，即这一行CDS complement(471..590) /db_xref="SEED:fig|1240086.14.peg.3" 将被分成两个相邻的部分
在偶数行上，这是因为行分隔符（记录分隔符）更改为" 和RS="，所以偶数行在双引号内，奇数行在外

【解决方案3】：

可以用 GNU grep 完成，输出以\0 0 字节分隔

grep -Pzo '/product="\K[^"]*'  | tr -s '\0\t\n' '\n '

或 perl 用一个空格替换多个（空格、换行符或制表符），用换行符分隔

perl -0777 -ne 'print s/\s+/ /gr."\n" for /\/product="\K[^"]*/g'

【讨论】：

@Sundeep，谢谢你的收获，实际上 grep 只能用于管道到 xargs -0 或可以拆分为 0 字节的进程
添加了 |tr 来做与 perl 中相同的操作

【解决方案4】：

使用 GNU awk 进行多字符 RS 和 RT：

$ gawk -v RS='/product="[^"]+"' -F'"' 'RT{$0=RT; gsub(/\s+/," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

【讨论】：

【解决方案5】：

假设文件名为 file.txt

echo $(cat file.txt ) | sed 's/\//\n/g' | grep product | sed 's/product="//g;s/".*//'

说明：

将所有行合并为一行

echo $(cat file.txt)
将“/”替换为新行

echo $(cat file.txt) | sed 's///\n/g'
grep 具有线 Product 的线

echo $(cat file.txt) | sed 's///\n/g' | grep 产品
替换“product=”和双引号后的所有字符

echo $(cat file.txt) | sed 's///\n/g' | grep 产品 | sed 's/product="//g;s/".*//'

【讨论】：