使用 AWK 解析 HTML答案

【问题标题】：Parse HTML Using AWK使用 AWK 解析 HTML
【发布时间】：2021-06-27 17:30:24
【问题描述】：

我有以下 HTML 结构，并希望使用 awk 从中提取数据。

<body>
<div>...</div>
<div>...</div>
<div class="body-content">
    <div>...</div>
    <div class="product-list" class="container">
        <div class="w3-row" id="product-list-row">
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product A</div>
                    <div class="product-price">100,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product B</div>
                    <div class="product-price">200,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product C</div>
                    <div class="product-price">300,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product D</div>
                    <div class="product-price">400,56</div>
                </div>
            </div>
        </div>
    </div>
</div>
</body>

我想要的结果如下。

我正在试验以下awk 脚本（我知道选择product-price 两次没有意义，我正要修改这个脚本）

awk -F '<[^>]+>' 'found { sub(/^[[:space:]]*/,";"); print title $0; found=0 } /<div class="product-price">/ { title=$2 } /<div class="product-price">/  { found=1 }'

但它给了我结果

100,56                </div>
200,56                </div>
300,56                </div>
400,56                </div>

我之前从未使用过awk，所以无法弄清楚这里出了什么问题或如何修改上面的代码。你会怎么做呢？

【问题讨论】：

你能用一个能理解xml的工具来代替吗，例如xmlstarlet?
Awk 是用于多种文本搜索的出色工具，但它不适合 HTML 等层次结构。使用专为这项工作设计的工具会更好。 @Ed Morton 的建议 xmlstarlet 是从 shell 使用的不错选择。或者，如果您了解任何脚本语言（例如 Perl、Python、Ruby、Javascript 等），它们中的大多数都有用于 HTML 解析的可安装库。
实际上，GNU awk 也有一个 XML 库 - 请参阅 gawkextlib.sourceforge.net/xml/xml.html。
另见：stackoverflow.com/a/1732454/7552
@EdMorton 是的，尽管上次我检查了安装 gawk 附加组件并不像使用 cpanm、pip、gem、npm 等那样简单。

标签： html shell awk

【解决方案1】：

快速谷歌搜索xmlstarlet print div contents 的结果，然后经过几秒钟的反复试验：

$ xmlstarlet sel -t -m "//*[@class='product-price']" -v "." -n file
100,56
200,56
300,56
400,56

关于解释 - 询问谷歌 :-)。

【讨论】：

我刚刚安装了xmlstarlet 并尝试对其进行测试，但不幸的是服务器给了我一个格式不正确的HTML。但我仍然会赞成你的回答！
这对于 awk 脚本来说比 XML 感知工具更可能是一个问题。这就是您应该使用 XML 感知工具的原因。

【解决方案2】：

如果有人正在寻找 Python 相关的解决方案，我建议使用 Python 的 beautifulsoup 库，以下是在 Python3.8 中编写和测试的。为了将它与我之前的答案分开，我在这里添加了另一个答案。

#!/bin/python3
##import library here.  
from bs4 import BeautifulSoup
##Read Input_file and get its all contents.
with open('Input_file', 'r') as f:
    contents = f.read()
    f.close()
##Get contents in form of xml in soup variable here.
soup = BeautifulSoup(contents, 'lxml')
##get only those values which specifically needed by OP of div class.
vals = (soup.find_all("div", {"class": "product-price"}))
##Print actual values out of tags.
for val in vals:
    print (val.text)

注意：

应该在 Python 中安装 BeautifulSoup，并使用 pip3 或 pip 安装 lxml，具体取决于您的系统。
Input_file 是程序读取所有数据的地方。

【讨论】：

【解决方案3】：

使用您显示的示例/尝试，请尝试遵循awk 代码。

awk -F"[><]" '{gsub(/\r/,"")} /^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{print $3}' Input_file

说明：为上述添加详细说明。这仅用于解释运行代码的目的，请使用上述代码。

awk -F"[><]" '      ##Starting awk program from here and setting field separator as ><
{gsub(/\r/,"")}     ##Substituting control M chars at last of lines.
/^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{ ##checking condition if line starts
                    ##from space followed by <div class=product-price"> till div close tag.
  print $3          ##printing 3rd column here.
}
' Input_file        ##Mentioning Input_file name here.

根据 Ed 在 cmets 中的建议，将正则表达式更改为 /^[ \t]+<div[ \t]+class。此外，专家总是建议使用 xmlstarlet/xml 感知工具，以防有人在他们的系统中使用。

【讨论】：

@RavinderSingh13，不错的收获！该文件包含控制 M 个字符。
输入中的 control-Ms 不会导致 Ravinders 原始脚本不产生输出，无论哪种方式都可以正常工作，因为它不会对每行末尾的字符做任何事情。跨度>
阅读茶叶 - control-Ms 不是问题。
@Javiator 如果您复制/粘贴脚本没有出错，并且您的实际输入看起来确实像您提供的示例，那么我最好的猜测是 a) 在div 之后不是空白或者，更有可能 b）您使用的 awk 不理解字符类。尝试将/^[[:space:]]+<div class 更改为/^[ \t]+<div[ \t]+class。
@EdMorton，通过将/^[[:space:]]+<div class 更改为/^[ \t]+<div[ \t]+class，我现在得到了所需的输出！谢谢！

【解决方案4】：

让我感到困惑的是，人们一次又一次地尝试解析 HTML，而不是使用 HTML 解析器，而是使用一种完全不理解 HTML 的工具，尤其是 RegEx！
使用像 xidel 这样的 HTML 解析器就很简单：

$ xidel -s "<url> or input.html" -e '//div[@class="product-price"]'

【讨论】：

【解决方案5】：

你会怎么做？

如果可能的话，使用专为处理 HTML 而设计的工具，GNU AWK 不是。

如果您被允许安装，则使用hxselect 它会处理标准输入并理解 CSS 选择器的（子集），所以在这种情况下类似于：

echo file.html | hxselect -i -c -s '\n' div.product-price

应该给你想要的结果（免责声明：我没有能力测试它）

【讨论】：