如何使用 sed 或 perl 删除 `<a href="file://a>`keep this text`</a>`？答案

【问题标题】：How to remove `<a href="file://a>`keep this text`</a>` using sed or perl?如何使用 sed 或 perl 删除 `<a href="file://a>`keep this text`</a>`？
【发布时间】：2021-12-19 01:55:11
【问题描述】：

如何使用 sed 或 perl 删除所有 <a href="file://???">keep this text</a> 而不是其他 <a></a> 或 </a>？
是：

    <p><a class="a" href="file://any" id="b">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

应该是：

    <p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

我有这样的正则表达式，但它太贪心并且删除了所有</a>

gsed -E -i 's/<a*href="file:[^>]*>(.+?)<\/a>/\1>/g' file.xhtml

【问题讨论】：

考虑使用更具代表性的数据集更新问题；特别是...您提到remove all，这意味着您可能想要删除 multiple 条目，因此显示多个条目的示例将是有益的；另外，您是要删除 all file: 条目还是仅删除某些条目？

标签： regex bash macos sed grep

【解决方案1】：

假设：

OP 无法访问以 HTML 为中心的工具
删除<a href="file:...">...some_text...</a> 包装，只留下...some_text...
仅适用于file: 条目
输入数据在file: 条目的中间没有换行符/提要

显示多个 file: 条目的示例数据散布在其他一些（无意义的）条目中：

$ cat sample.html
<p><a href="https:/google.com">some text</a><a href="file://any" >keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p><a href="file://anyother" >keep this text,too</a>, last test</p>

一个sed 想法删除所有file: 条目的包装：

sed -E 's|<a[^<>]+file:[^>]+>([^<]+)</a>|\1|g' "${infile}"

注意： 可能有些 [^..] 条目有点矫枉过正，但关键目标是短路 sed's 默认贪婪匹配...

这就离开了：

<p><a href="https:/google.com">some text</a>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>keep this text,too, last test</p>

【讨论】：

您的代码对我有用。谢谢！它适用于 macOS sed 和 GNU sed，是迄今为止最短的。

【解决方案2】：

一种方式：

sed -E 's,<a[^>]*?href="file://[^>]*>([^<]*)</a>,\1,g'

<a[^>]*?href="file://[^>]*> 匹配<a + 任意数量的非>（非贪婪）后跟href="file:// + 任意数量的非> 字符后跟>
([^<]*) 匹配并捕获任意数量的非< 字符
匹配</a>

所有匹配的内容都被\1 中的捕获替换，结尾g 使它在每一行的每次出现时都进行替换。

例子：

$ cat data
<p><a class="a" href="file://any" id="b">keep this text</a>, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p><a href="file://any" class="f">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

$ sed -E 's,<a[^>]*?href="file://[^>]*>([^<]*)</a>,\1,g' < data
<p>keep this text, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

【讨论】：

谢谢，这个也可以，但是它只适用于 GNU sed，但它确实有效。
@user3464412 不客气 - 但是，它应该适用于任何 Posix sed，除非我弄错了。我试过sed -E --posix，结果是一样的。

【解决方案3】：

考虑到<a>标签包含多行内容的情况， perl 解决方案怎么样：

perl -0777 -i -pe 's#<a.+?href="?file.+?>(.+?)</a>#$1#gs' file.xhtml

-0777 选项告诉 perl 读取整个文件。
-i 选项启用就地编辑。
s 运算符末尾的s 开关使点匹配任何字符包括换行符。
正则表达式.+? 是.+ 的非贪婪版本，用于启用最短匹配。

【讨论】：