字符串之间的 Unix 打印模式答案

【问题标题】：Unix print pattern between the Strings字符串之间的 Unix 打印模式
【发布时间】：2015-02-10 16:49:14
【问题描述】：

我有一个文件，其内容如下。 START 和 STOP 代表一个块。

START
X | 123
Y | abc
Z | +=-
STOP
START
X | 456
Z | +%$
STOP
START
X | 789
Y | ghi
Z | !@#
STOP

我想为每个块按以下格式打印X 和Y 的值：

123 ~~ abc
456 ~~ 
789 ~~ ghi

如果 START/STOP 只出现一次，sed -n '/START/,/STOP/p' 会有所帮助。由于这是重复的，我需要你的帮助。

【问题讨论】：

标签： unix awk sed pattern-matching

【解决方案1】：

基于我自己对How to select lines between two marker patterns which may occur multiple times with awk/sed的解决方案：

awk -v OFS=" ~~ " '
       /START/{flag=1;next}
       /STOP/{flag=0; print first, second; first=second=""}
       flag && $1=="X" {first=$3}
       flag && $1=="Y" {second=$3}' file

测试

$ awk -v OFS=" ~~ " '/START/{flag=1;next}/STOP/{flag=0; print first, second; first=second=""} flag && $1=="X" {first=$3} flag && $1=="Y" {second=$3}' a
123 ~~ abc
456 ~~ 
789 ~~ ghi

【讨论】：

【解决方案2】：

对于涉及处理多行的任何问题，Sed 始终是错误的选择。 1970 年代中期，当 awk 被发明时，sed 的所有神秘构造都已过时。

每当您的输入中有名称-值对时，我发现创建一个将每个名称映射到其值的数组然后通过名称访问该数组很有用。在这种情况下，使用 GNU awk 进行多字符 RS 和删除数组：

$ cat tst.awk
BEGIN {
    RS = "\nSTOP\n"
    OFS=" ~~ "
}
{
    delete n2v
    for (i=2;i<=NF;i+=3) {
        n2v[$i] = $(i+2)
    }
    print n2v["X"], n2v["Y"]
}

$ gawk -f tst.awk file
123 ~~ abc
456 ~~ 
789 ~~ ghi

【讨论】：

我喜欢将值存储在数组中的想法，+1 和道德+1 还可以添加解释:)！
哈哈，有趣的是，你最终为解释你的答案而道歉；）是的，阅读确实很有用

【解决方案3】：

因为我喜欢脑筋急转弯（不是因为这种事情在 sed 中很实用），一个可能的 sed 解决方案是

sed -n '/START/,/STOP/ { //!H; // { g; /^$/! { s/.*\nX | \([^\n]*\).*/\1 ~~/; ta; s/.*/~~/; :a G; s/\n.*Y | \([^\n]*\).*/ \1/; s/\n.*//; p; s/.*//; h } } }'

它的工作原理如下：

/START/,/STOP/ {                        # between two start and stop lines
  //! H                                 # assemble the lines in the hold buffer
                                        # note that // repeats the previously
                                        # matched pattern, so // matches the
                                        # start and end lines, //! all others.

  // {                                  # At the end
    g                                   # That is: When it is one of the
    /^$/! {                             # boundary lines and the hold buffer
                                        # is not empty

      s/.*\nX | \([^\n]*\).*/\1 ~~/     # isolate the X value, append ~~

      ta                                # if there is no X value, just use ~~
      s/.*/~~/
      :a 

      G                                 # append the hold buffer to that
      s/\n.*Y | \([^\n]*\).*/ \1/       # and isolate the Y value so that
                                        # the pattern space contains X ~~ Y

      s/\n.*//                          # Cutting off everything after a newline
                                        # is important if there is no Y value
                                        # and the previous substitution did
                                        # nothing

      p                                 # print the result

      s/.*//                            # and make sure the hold buffer is
      h                                 # empty for the next block.
    }
  }
}

【讨论】：

我能说什么。我得到了一些答案。谢谢大家。使用样本数据，Wintermute 解决方案需要 0m0.151s，Ed Morton 需要 0m0.160s，而 fedorqui 需要 0m0.163s.. 再次感谢大家
恕我直言，sed 和 awk 解决方案之间的执行速度永远不会成为问题。只需尝试修改其中一个，例如，为读取的每一行打印一个调试语句，或者在找到“Y”或....的次数结束时打印一个计数。
有l 命令。 :P 说真的，您会想要使用其中一种 awk 解决方案。我不同意 awk 总是更好（主要是因为它没有反向引用），但这里没有竞争。我的意思是，看看这个，看看@fedorqui 的解决方案。其中一个是人类可读的，另一个是我的。您不想为 7% 的运行时间引入不可维护的代码。我写这个是为了好玩。