shell如何选择关键字范围内的内容？答案

【问题标题】：How does shell select content within the keyword range?shell如何选择关键字范围内的内容？
【发布时间】：2021-06-11 23:13:03
【问题描述】：

这是一个HTML文件，其中包含大量<section>... </section>内容的HTML文件，格式如下。

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<section>
<div>
<header><h2>This is a title (RfQVthHm)</h2></header>
More HTML codes...
</div>
</section>

<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

<section>
<div>
<header><h2>This is a title (vxzbXEGq)</h2></header>
More HTML codes...
</div>
</section>

</body>
</html>

我需要提取第二个<section>...</section> 内容。

这是预期的输出。

<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

我注意到我可以先查找 UaHaZWvm 字符（以及前面 2 行），直到遇到下一个 </section>。

OP的努力（在cmets中提到）：grep -o "hi.*bye" file

这可以使用awk、sed 或grep 工具来完成吗？

【问题讨论】：

请在您的问题中以代码的形式添加您的努力，这是非常鼓励的，谢谢。
@RavinderSingh13 对不起，我没有从网络查询中找到可行的解决方案，所以我在这里问。之前看了grep文档，发现可以使用grep -o "hi.*bye" files.html来获取指定范围的内容，但是不太行。
@Lorraine1996。您可以在段落模式下使用 awk 并提取您想要出现 (UaHaZWvm) 的部分。
@Lorraine1996，请在您的问题中添加您尝试过的代码（以避免对您的问题投赞成票），我们都在这里学习没有错或对，所以请添加显示的代码在您的问题中作为您的努力，谢谢。
@CarlosPascual 抱歉，我会查看 awk 文档。有进展会在这里更新。

标签： shell awk sed grep

【解决方案1】：

更新我的解决方案，希望对其他人有用。

这是结合grep的方案，使用-B选项设置内容的开头，-A选项输出其余内容（一般10000行就够用了），然后使用sed 或awk 定位结束关键字。

awk

cat test.html | grep 'UaHaZWvm' -B2 -A10000 | awk 'NR==1,/<\/section>/'

sed

cat test.html | grep 'UaHaZWvm' -B2 -A10000 | sed -n '1,/<\/section>/p'

【讨论】：

不要那样做——它有一个UUOC，3 个命令，硬编码的 -2，并且猜测 10,000 通常就足够了。只需使用简洁、强大、高效的awk -v RS= -v ORS='\n\n' '/UaHaZWvm/' file - 它会很简单地工作。

【解决方案2】：

由于您使用的是 HTML，因此and better 使用可识别格式的工具要简单得多，例如 xmllint 或其他允许您使用 XPath 表达式提取文档的一部分的程序：

$ xmllint --html --xpath '//section[2]' input.html 2>/dev/null
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

（xmllint给出了很多关于标签的错误；我认为它并不真正支持HTML5？无论如何，这就是为什么上面有标准错误的重定向。）

使用来自 W3C 的 HTML-XML-utils 程序集合的 hxselect 的替代方法。它使用 CSS 选择器来指定从文档中获取的内容，而不是 XPath：

hxselect 'section:nth-child(2)' < input.html

【讨论】：

【解决方案3】：

在段落模式下使用awk：

awk -v RS= -v ORS='\n\n' '/UaHaZWvm/' file
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

【讨论】：

【解决方案4】：

gawk '/<section>/,/<\/section>/{ s=s $0; }
      /<\/section>/{ i++; print i, s; s=""; }
      END{ if(s!="") print i,s}' some.html

将打印所有部分，例如：

1 <section><div><header><h2>This is a title (RfQVthHm)</h2></header>More HTML codes...</div></section>
2 <section><div><header><h2>This is a title (UaHaZWvm)</h2></header>More HTML codes...</div></section>
3 <section><div><header><h2>This is a title (vxzbXEGq)</h2></header>More HTML codes...</div></section>

这适用于 Patterns，请参阅 gawk 或 awk 的手册页。

只返回第二个应该很容易......

编辑：（基于 Ed M. 的 cmets）

gawk '/<section>/{ i=(i<0?-i:i); i++; }
      /<\/section>/{ i=-i; }
      { a[i]=a[i] $0 }
      END{ print a[2] }' some.html

使用grep，您可以：grep 'UaHaZWvm' -B2 -A3 some.html 输出：

<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

【讨论】：

【解决方案5】：

从您的问题中不清楚您是否尝试打印第二部分（无论它包含什么）或包含 UaHaZWvm 的部分（无论它以什么顺序出现）所以这里有两种解决方案：

要打印第二部分：

$ awk -v RS= -v ORS='\n\n' 'NR==3' file
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

打印任何包含UaHaZWvm的部分：

$ awk -v RS= -v ORS='\n\n' '/UaHaZWvm/' file
<section>
<div>
<header><h2>This is a title (UaHaZWvm)</h2></header>
More HTML codes...
</div>
</section>

【讨论】：

【解决方案6】：

对于您显示的示例，您能否尝试以下操作。在 GNU awk 中编写和测试，应该可以在任何 awk 中工作。

awk '
/^<\/section>/{
  if(found1==2 && found2==1){
    print val
    exit
  }
  found2++
}
/<section>/{
  found1++
}
found1==2{
  val=(val?val ORS:"")$0
}
'  Input_file

说明：为上述添加详细说明。

awk '                             ##Starting awk program from here.
/^<\/section>/{                   ##Checking condition if line starts from </section> here.
  if(found1==2 && found2==1){     ##Checking condition if found1 is 2 AND found2 is 1 then do following.
    print val                     ##printing val here.
    exit                          ##exiting from program from here.
  }
  found2++                        ##Increasing found2 with 1 here.
}
/<section>/{                      ##Checking condition if line has <section> then do following.
  found1++                        ##Increasing found1 with 1 here.
}
found1==2{                        ##Checking if found1 is 2 then do following.
  val=(val?val ORS:"")$0          ##Creating val and keep adding lines into it.
}
'

【讨论】：