使用 bash 或 perl 提取两个不同字符串之间的内容答案

【问题标题】：Extracting the contents between two different strings using bash or perl使用 bash 或 perl 提取两个不同字符串之间的内容
【发布时间】：2015-01-23 10:02:18
【问题描述】：

我已尝试扫描堆栈溢出中的其他帖子，但无法让我的代码正常工作，因此我发布了一个新问题。

以下是文件temp 的内容。

 <?xml version="1.0" encoding="UTF-8"?>
 <env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/<env:Body><dp:response xmlns:dp="http://www.datapower.com/schemas/management"><dp:timestamp>2015-01-
 22T13:38:04Z</dp:timestamp><dp:file name="temporary://test.txt">XJzLXJlc3VsdHMtYWN0aW9uX18i</dp:file><dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:file></dp:response></env:Body></env:Envelope>

此文件包含两个文件名 test.txt 和 test1.txt 的 base64 编码内容。我想提取每个文件的base64编码内容，分别将文件test.txt和text1.txt分开。

为此，我必须删除 base64 内容周围的 xml 标签。我正在尝试以下命令来实现这一点。但是，它没有按预期工作。

sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test.txt">@@g'|perl -p -e 's@</dp:file>@@g' > test.txt

sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test1.txt">@@g'|perl -p -e 's@</dp:file></dp:response></env:Body></env:Envelope>@@g' > test1.txt

下面的命令：

sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test.txt">@@g'|perl -p -e 's@</dp:file>@@g'

产生输出：

 XJzLXJlc3VsdHMtYWN0aW9uX18i

<dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:response>   </env:Body></env:Envelope>`

然而，在输出中我只期待第一行XJzLXJlc3VsdHMtYWN0aW9uX18i。我在哪里犯错了？

当我在命令下运行时，我得到了预期的输出：

sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test1.txt">@@g'|perl -p -e 's@</dp:file></dp:response></env:Body></env:Envelope>@@g'

它产生下面的字符串

lc3VsdHMtYWN0aW9uX18i

然后我可以轻松地将其路由到 test1.txt 文件。

更新

我已通过更新源文件内容来编辑问题。源文件不包含任何换行符。在这种情况下，当前的解决方案将不起作用，我已经尝试过但失败了。 wc -l temp 必须输出到1。

OS: solaris 10 Shell: bash

【问题讨论】：

所以你也不想要这个lc3VsdHMtYWN0aW9uX18i？
是的，除了XJzLXJlc3VsdHMtYWN0aW9uX18i，我什么都不想要
这应该可以工作awk 'match($0,/dp:file name="([^"]+)">([^<]+)</,a){print a[1] > a[2]}' file
我更新了我的问题，让我的要求更清晰
awk: syntax error near line 1 awk: bailing out near line 1 如果我使用您的代码，则会出现上述错误。

标签： perl shell awk sed grep

【解决方案1】：

sed -n 's_<dp:file name="\([^"]*\)">\([^<]*\).*_\1 -> \2_p' temp

我添加\1 -> 以显示从文件名到内容的链接，但仅用于内容，只需删除这部分
posix 版本等 GNU sed 使用 --posix
假设 base64 编码的内容与周围的标签在同一行（而不是分布在多行上，在这种情况下需要进行一些修改）

感谢JID在下面的完整解释

工作原理

sed -n

-n 表示不打印，所以除非明确告知要打印，否则 sed 不会有输出

's_

这是使用_ 替换以下正则表达式以将正则表达式与替换分开。

<dp:file name=

普通文本

"\([^"]*\)"

括号是一个捕获组，必须转义，除非使用-r 选项（-r 在 posix 上不可用）。括号内的所有内容都被捕获。 [^"]* 表示任何非引号字符出现 0 次或多次。所以实际上这只是捕获了两个引号之间的任何内容。

>\([^<]*\)<

这次再次使用捕获组捕获>和<之间的所有内容

.*

其他的都行

_\1 -> \2

这是替换，因此将之前正则表达式中的所有内容替换为第一个捕获组，然后是 ->，然后是第二个捕获组。

_p

表示打印行

资源

http://unixhelp.ed.ac.uk/CGI/man-cgi?sed

http://www.grymoire.com/Unix/Sed.html

【讨论】：

您能帮我解释一下这段代码吗？这个正则表达式到底是做什么的？
我编辑了答案以添加解释，希望你不介意 :) 我还将 </dp:file> 更改为 .* 以便该行的其余部分也将被替换 :)跨度>
@JID 我同意，在解释和增强代码方面非常主动
感谢您的解释。但是这个命令只打印两行。根据我的问题，我必须将 base64 编码的字符串发送到文件中。与 test.txt 关联的 bas64 字符串必须发送到 test.txt 文件，与 test1.txt 关联的字符串必须发送到 test1.txt 文件
好吧，我通过稍微调整命令实现了这一点...我在下面使用来获取'test.txt'文件的base64内容：sed -n 's_<dp:file name="temporary://test.txt">$[^<]*$.*_\1_p' temp及以下，用于获取base64内容test1.txt 的sed -n 's_<dp:file name="temporary://test.txt">$[^<]*$.*_\1_p' temp

【解决方案2】：

/usr/xpg4/bin/sed 在这里运行良好。

如果文件仅包含 1 行，/usr/bin/sed 将无法按预期工作。

以下命令适用于仅包含单行的文件。

/usr/xpg4/bin/sed -n 's_<env:Envelope\(.*\)<dp:file name="temporary://BackUpDir/backupmanifest.xml">\([^>]*\)</dp:file>\(.*\)_\2_p' securebackup.xml 2>/dev/null

没有2>/dev/null，这个sed命令输出警告sed: Missing newline at end of file。

这是因为以下原因：

Solaris 默认 sed 忽略最后一行以不破坏现有脚本，因为在原始 Unix 实现中要求一行被新行终止。

GNU sed 具有更宽松的行为，POSIX 实现接受这一事实但输出警告。

【讨论】：