通过 sed 排除特定字符串 '[[' 的正则表达式答案

【问题标题】：A regular expression to exclude a specific string '[[' via sed通过 sed 排除特定字符串 '[[' 的正则表达式
【发布时间】：2015-09-28 12:03:55
【问题描述】：

我需要在文件中使用 sed 获取 '[[' 和 ']]' 之间的字符串：response.txt

x-content-type-options: nosniff
x-server-response-time: 63
x-dropbox-request-id: 84e52618f83eda15cb6d96eb4f601f45
pragma: no-cache
cache-control: no-cache
x-dropbox-http-protocol: None
x-frame-options: SAMEORIGIN

{"has_more": false, "cursor": "AAEynx2q5KMgkcOwL2dKZ4MCYxNTtsdA950A5kYOdjWFln_RYuAokMnJCOb85B7idOHjycS8LJye3BhWfezTkkoprVxhgMNni_Bg04A-JO9fLmqIGO3CYInBQPmNUXL57S32ECWwA-CYu1CiLi5ujTDz", "entries": [["/test", {"rev": "b1e9026cf6f4", "thumb_exists": false, "path": "/TEST", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 05:53:27 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45545}], ["/TEST/test-file-01", {"rev": "b1ed026cf6f4", "thumb_exists": false, "path": "/test/test-file-01", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 06:15:33 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45549}]], "reset": true}

想用命令sed来获取字符串，结果如下：

[["/test", {"rev": "b1e9026cf6f4", "thumb_exists": false, "path": "/TEST", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 05:53:27 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45545}], ["/TEST/test-file-01", {"rev": "b1ed026cf6f4", "thumb_exists": false, "path": "/test/test-file-01", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 06:15:33 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45549}]]

我在终端运行命令：

$ sed -n 's/.*"entries": *$\[\[.*\]\]$/\1/p' /tmp/response.txt

并得到结果：

[["/test", {"rev": "b1e9026cf6f4", "thumb_exists": false, "path": "/TEST", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 05:53:27 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45545}], ["/TEST/test-file-01", {"rev": "b1ed026cf6f4", "thumb_exists": false, "path": "/test/test-file-01", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 06:15:33 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45549}]], "reset": true}

然后，我在终端运行命令：

$ sed -n 's/.*"entries": *$\[\[(?!\]\].)*\]\]$/\1/p' /tmp/response.txt

什么也不返回。

似乎我写错了正则表达式？我能怎么做？谢谢！

【问题讨论】：

我看不出您在终端中键入的两个命令之间的区别。您是在解释运行两次不会产生相同的结果吗？
@RenaudPacalet 抱歉，我更新了第二条命令！

标签： regex bash sed regex-negation

【解决方案1】：

避免使用正则表达式解析 JSON。使用适当的解析器。

如果您安装了jq：

awk -v RS="" "END {print}" response.txt | jq -c '.["entries"]'

[["/test",{"revision":45545,"root":"dropbox","size":"0 bytes","modified":"Fri, 22 May 2015 05:53:27 +0000","rev":"b1e9026cf6f4","thumb_exists":false,"path":"/TEST","is_dir":true,"icon":"folder","read_only":false,"modifier":null,"bytes":0}],["/TEST/test-file-01",{"revision":45549,"root":"dropbox","size":"0 bytes","modified":"Fri, 22 May 2015 06:15:33 +0000","rev":"b1ed026cf6f4","thumb_exists":false,"path":"/test/test-file-01","is_dir":true,"icon":"folder","read_only":false,"modifier":null,"bytes":0}]]

或红宝石：

ruby -rjson -e '
    data = (File.readlines(ARGV.shift))[-1]
    json = JSON.parse(data)
    puts JSON.generate(json["entries"])
' response.txt

[["/test",{"rev":"b1e9026cf6f4","thumb_exists":false,"path":"/TEST","is_dir":true,"icon":"folder","read_only":false,"modifier":null,"bytes":0,"modified":"Fri, 22 May 2015 05:53:27 +0000","size":"0 bytes","root":"dropbox","revision":45545}],["/TEST/test-file-01",{"rev":"b1ed026cf6f4","thumb_exists":false,"path":"/test/test-file-01","is_dir":true,"icon":"folder","read_only":false,"modifier":null,"bytes":0,"modified":"Fri, 22 May 2015 06:15:33 +0000","size":"0 bytes","root":"dropbox","revision":45549}]]

或您选择的任何实现 JSON 解析器的语言。

【讨论】：

感谢您的回答。我需要使用 shell 脚本将字符串（json）存储为变量，所以想使用 sed 命令来获取字符串。

【解决方案2】：

sed 识别 Posix 正则表达式，它不包括像 (?! 这样的环视断言。

幸运的是，为这种简单的情况编写正则表达式很容易（像往常一样，它不太容易阅读）：

sed -n 's/.*"entries": *\(\[\[\(]\?[^]]\)*]]\)/\1/p' /tmp/response.txt

但是，导致您最初尝试出现问题的并不是贪婪匹配。问题是您没有丢弃匹配后行的内容。你想要的是：

sed -n 's/.*"entries": *\(\[\[\(]\?[^]]\)*]]\).*/\1/p' /tmp/response.txt

sed 使用“基本”Posix 正则表达式 (BRE) 的事实意味着您最终会使用大量反斜杠。我已经尝试至少删除其中一些，使用 ] 在正则表达式中 not 特殊的事实，除非它正在关闭字符类。但总的来说，我认为使用grep 会更好地满足您的需求，它具有使用“扩展”（普通）正则表达式（ERE）的 Posix 标准选项，以及仅打印匹配字符串的选项：

grep -oE '"entries": \[\[(]?[^]])*]]' /tmp/response.txt | cut -d ' ' -f2-

（最后的cut是去掉"entries": ）

正则表达式的解释

正则表达式（ERE 形式）包括：

\[\[           match [[
(
  ]?             possibly a single ]
  [^]]           anything but a ]
)*               repeated as many times as necessary
]]            match ]]

重复的组将匹配 ] 后跟 anthing 但 ]，或者它将匹配 ] 以外的任何东西。实际上，它（几乎）是对]] 的否定。

(这不是完全否定，因为它不会匹配字符串末尾的单个]，但这在这里无关紧要，因为我们坚持它后面跟着结束]]，所以它到达字符串末尾的情况不会发生。）

【讨论】：

感谢您的帮助。对我来说很重要： ] 在正则表达式中并不特殊，除非它正在关闭字符类。我向你学习:)

【解决方案3】：

这可能对你有用（GNU sed）：

sed '/\n/!{s/\[\[/\n&/g;s/\]\]/&\n/g};/^\[\[/P;D' file

如果模式空间不包含\n，则将\n 附加到所有[[ 字符串并将\n 附加到所有]] 字符串。如果模式空间以[[ 开头，则打印到以下\n（或模式空间的结尾）。删除直到下一个\n（或模式空间的末尾）并重复直到模式空间为空。

注意这只会打印以所需字符串 ([[or]]) 开头和结尾的换行符之间的字符串。

【讨论】：

【解决方案4】：

试试：

sed -n 's/.*"entries": *\(\[\[.*\]\]\).*/\1/p'

改为（注意模式末尾的.*）。

【讨论】：