如何从 shell 中的 <a> 标记中提取 href 和链接的文本或标记？答案

【问题标题】：How to extract href and the linked text or tag out of an <a> tag in the shell?如何从 shell 中的 <a> 标记中提取 href 和链接的文本或标记？
【发布时间】：2021-12-02 05:27:46
【问题描述】：

我有很多 HTML 文件，其中包含很多不同的内容，我总是使用名为 pup 的命令行工具提取其中的特定部分。摘录有时包含如下所示的标签：

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

...或者像这样：

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

...甚至像这样：

<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class>
    email

</a>

我想做的是……

...提取href值和锚文本（<a ...>和</a>之间的文本）。
... 将两个摘录放在单独的行中，但顺序相反：首先是文本，然后是 href 值。
... 在每个 href 值前面放置三个字符：=>

所以结果看起来像这样：

Visit Duck Duck Go!
=> https://www.duckduckgo.com

如果一切都在一行中，我可以通过创建组/模式并切换它们的打印顺序，通过一些连接的sed 命令和一些正则表达式来获得我想要的东西，就像在第一个示例中一样。但是如果锚标签分布在多行上，我不知道如何获得我想要的东西。我试图只用sed 来实现我的目标，但我没有运气。昨天我一直在阅读其他人的类似问题，并且 sed 不适合在换行符之外工作。这是真的？ awk 可以这样做吗？我可以使用其他工具吗？

【问题讨论】：

Don't Parse XML/HTML With Regex. 我建议使用 XML/HTML 解析器 (xmlstarlet, xmllint ...)。
我感觉 pup 应该能够做到这一点，如果没有，你总是可以使用 pup 转换为 JSON，然后使用 jq 之类的东西来稳健地提取它。

标签： regex bash shell awk sed

【解决方案1】：

我会假设pup 的输出是格式良好的 XML，如下所示：

<root>
<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class="x">
Visit Duck Duck Go!
</a>

<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class="x">
  email
</a>
</root>

这意味着您需要一个根元素，例如本例中的root 标签，并且每个属性都有一个值，这就是我将js-class 更改为js-class="x" 的原因。

xmlstarlet 命令提取你想要的内容是：

xmlstarlet sel -t -m "//a" -v "normalize-space()" -n -o "== " -v "@href" -n input.xml

上面输入对应的输出是：

anchor text
== https://www.stackoverflow.com
Visit Duck Duck Go!
== https://www.duckduckgo.com
email
== mailto:this.is.an@email.com

由于xmlstarlet 无法输出>，据我所知，您可能希望通过在命令末尾添加过滤器将== 字符串更正为=>，如下所示：

xmlstarlet sel -t -m "//a" -v "normalize-space()" -n -o "== " -v "@href" -n input.xml | 
sed 's/==/=>/'

给出最终结果：

anchor text
=> https://www.stackoverflow.com
Visit Duck Duck Go!
=> https://www.duckduckgo.com
email
=> mailto:this.is.an@email.com

但再三：不要使用regex 和sed 来处理HTML 文件。

【讨论】：

【解决方案2】：

将 GNU awk 用于多字符 RS，第三个参数为 match() 和 \s/\S 速记：

$ cat tst.awk
BEGIN { RS="</a>" }
match($0,/<a[^>]+href="([^"]+).*>\s*(\S.*\S)/,a) {
    print a[2] "!" ORS "=> " a[1]
}

例如给定这个输入文件：

$ cat file
The extract contains sometimes tags which can look like this:

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

... or like this:

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

... or even like this:

<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class>
    email

</a>

$ awk -f tst.awk file
anchor text!
=> https://www.stackoverflow.com
Visit Duck Duck Go!!
=> https://www.duckduckgo.com
email!
=> mailto:this.is.an@email.com

【讨论】：

【解决方案3】：

如果所有内容都在一行中，我可以通过创建组/模式并切换它们的打印顺序，通过一些连接的sed 命令和一些 RegEx 获得我想要的，就像在第一个示例中一样。但是如果锚标签分布在多行上，我不知道如何得到我想要的。

如果您需要保留已有的内容，请考虑在实际处理之前删除换行符，例如使用 tr - translate or delete characters 。

【讨论】：

这应该是评论，不是答案。

【解决方案4】：

可以用xmllint和xpath表达式解析HTML片段

frag=$(cat <<EOF
<div>
<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class>
    email

</a>
<a class="someclasses"
    href="http://example.com">
    URL

</a>
<a class="someclasses"
    href="http://example.com/2">
    URL 2

</a>
</div>
EOF
)


while read -r line; do
    if [ "${line%=*}" == 'href' ]; then
        url=$(tr -d '"' <<<"${line#*=}")
    elif [ -n "$line" ]; then
       echo "$line"
       echo "=> $url"
    fi
done < <(echo "$frag" | xmllint --recover --html --xpath "//a/text()| //a/@href" -)

结果：

email
=> mailto:this.is.an@email.com
URL
=> http://example.com
URL 2
=> http://example.com/2

xmllint也可以用来直接解析HTML文件。

【讨论】：

【解决方案5】：

您可以试试这个bash 脚本，尽管它可能不如 cmets 中提到的工具高效。

$ cat input_file
<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class>
    email

</a>

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

#!/usr/bin/env bash

IFS=$'\n'
i=0
count=$(( $(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' input_file | sed '/^$/d' | wc -l) - 1 ))
while [[ "$i" -le "$count" ]];
    do for f in input_file; do
        first=($(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' "$f" | sed '/^$/d'))
        second=($(sed -En 's|.*href="(.[^ ]*)".*|\1|p;' "$f"))
        echo "${first[$i]}" $'\n' " => ${second[$i]}"
        ((i++))
    done
done

输出

email
  => mailto:this.is.an@email.com
Visit Duck Duck Go!
  => https://www.duckduckgo.com
anchor text
  => https://www.stackoverflow.com

【讨论】：