使用“sed”或“awk”转换文本答案

【问题标题】：Transforming text with 'sed' or 'awk'使用“sed”或“awk”转换文本
【发布时间】：2012-03-08 18:22:31
【问题描述】：

我有一个非常大的输入集，看起来像这样：

Label: foo, Other text: text description...
   <insert label> Item: item description...
   <insert label> Item: item description...
Label: bar, Other text:...
   <insert label> Item:...
Label: baz, Other text:...
   <insert label> Item:...
   <insert label> Item:...
   <insert label> Item:...
...

我想对此进行转换以提取标签名称（例如"foo"）并将以下行中的标签"<insert label>"替换为实际标签。

Label: foo, Other text: text description...
   foo Item: item description...
   foo Item: item description...
Label: bar, Other text:...
   bar Item:...
Label: baz, Other text:...
   baz Item:...
   baz Item:...
   baz Item:...
...

这可以用 sed 或 awk 或其他 unix 工具来完成吗？如果是这样，我该怎么做？

【问题讨论】：

标签： sed awk transform

【解决方案1】：

这是我的 label.awk 文件：

/^Label:/ {
    label = $2
    sub(/,$/, "", label)
}

/<insert label>/ {
    sub(/<insert label>/, label)
}

1

调用：

awk -f label.awk data.txt

【讨论】：

【解决方案2】：

一个使用sed的解决方案：

script.sed的内容：

## When line beginning with the 'label' string.
/^Label/ {
    ## Save content to 'hold space'.
    h   

    ## Get the string after the label (removing all other characters)
    s/^[^ ]*\([^,]*\).*$/\1/

    ## Save it in 'hold space' and get the original content
    ## of the line (exchange contents).
    x   

    ## Print and read next line.
    b   
}
###--- Commented this wrong behaviour ---###    
#--- G
#--- s/<[^>]*>\(.*\)\n\(.*\)$/\2\1/

###--- And fixed with this ---###
## When line begins with '<insert label>'
/<insert label>/ {
    ## Append the label name to the line.
    G   

    ## And substitute the '<insert label>' string with it.
    s/<insert label>\(.*\)\n\(.*\)$/\2\1/
}

infile的内容：

Label: foo, Other text: text description...
   <insert label> Item: item description...
   <insert label> Item: item description...
Label: bar, Other text:...
   <insert label> Item:...
Label: baz, Other text:...
   <insert label> Item:...
   <insert label> Item:...
   <insert label> Item:...

像这样运行它：

sed -f script.sed infile

结果：

Label: foo, Other text: text description...
    foo Item: item description...
    foo Item: item description...
Label: bar, Other text:...
    bar Item:...
Label: baz, Other text:...
    baz Item:...
    baz Item:...
    baz Item:...

【讨论】：

遇到错误：sed: 2: script.sed: invalid command code I。我是否使用了不同版本的 sed？
@Manish：是的。这是一个 GNU 扩展，可以忽略要匹配的字符串的大小写。已经修改程序以匹配（包括大小写）确切的单词。
现在可以工作，但如果文件中有非“”行则不能。我已将您的最后一行更改为 /<insert label>/!s/\n.*//;s/<insert label>$.*$\n$.*$$/\2\1/ 来处理。（另外，让我们具体匹配“”，文件中可能还有其他这样的“标签”。）
不改最后一行，把最后两行改成：/<insert label>/{G;s/<insert label>$.*$\n$.*$$/\2\1/}
太棒了！感谢大家。所有的答案都成功了。可悲的是，我只能接受一个，这就是我选择的那个。

【解决方案3】：

你可以像这样使用 awk：

awk '$1=="Label:" {label=$2; sub(/,$/, "", label);} 
     $1=="<insert" && $2=="label>" {$1=" "; $2=label;}
     {print $0;}' file

【讨论】：

如果要锚定模式，最好使用sub 而不是gsub。您不需要单引号内的行继续。
@glennjackman：非常感谢您的建议和编辑。欣赏它。