将 wget stdout 和 stderr 转换为 CSV 进行分析答案

【问题标题】：Converting wget stdout and stderr to CSV for analysis将 wget stdout 和 stderr 转换为 CSV 进行分析
【发布时间】：2021-12-31 00:21:39
【问题描述】：

我正在使用 wget 对一组网站进行完整性检查，因为它们正在进行大规模的升级，包括迁移数据库等。旧版本和新版本分别称为 V1 和 V2。这些网站建立在经过大量修改的 Wagtail CMS 版本之上。

我需要做的一件事是确认 V1 中的旧 URL 已正确重定向到 V2 中的新 URL。我的方法是编写一个 bash 脚本：

使用 wget 完全镜像 V1 面向用户的站点
从生成的磁盘文件结构中使用 du、sed 和 grep 创建一组 V1 URL
过滤掉 V2 中不需要重定向的 URL
使用 wget 从 V2 站点按顺序获取这些 URL，即将原始 sitedomain.com 替换为 stagingsitedomain.com，但保持 URL 的其余部分不变
将生成的 stdout/stderr 转换为 CSV 格式

有了 CSV，我将分析生成的 stdout/stderr 以确定是否存在未正确重定向的 URL。

我在最后一步，即第 5 步。下面是一个示例 stdout/stderr，用于获取三个文件：

--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5/document5.pdf
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:42 ERROR 404: Not Found.

--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /documents/5/ [following]
--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5/
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 302 Found
Location: /en/documents/5/ [following]
--2021-12-28 17:58:42--  https://stagingsitedomain.com/en/documents/5/
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:43 ERROR 404: Not Found.

--2021-12-28 17:58:43--  https://stagingsitedomain.com/documents/9/document9.pdf
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:43 ERROR 404: Not Found.

我希望将其转换为 CSV 格式，如下所示：

https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42,https://stagingsitedomain.com/documents/5/document5.pdf
https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42,HTTP request sent, awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42,ERROR 404: Not Found.
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,https://stagingsitedomain.com/documents/5
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,HTTP request sent, awaiting response... 301 Moved Permanently
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Location: /documents/5/ [following]
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,https://stagingsitedomain.com/documents/5/
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,HTTP request sent; awaiting response... 302 Found
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Location: /en/documents/5/ [following]
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,https://stagingsitedomain.com/en/documents/5/
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:43,ERROR 404: Not Found.
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43,https://stagingsitedomain.com/documents/9/document9.pdf
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43,ERROR 404: Not Found.

从心理逻辑的角度来看，关键步骤是：

从初始输出中删除逗号并替换为 ;
为基于双换行符的三个原始文件中的每一个的 GET 识别单独的 stdout/stderr
对于每个原始文件 stdout/stderr，第一列是原始请求的 URL，重复，这也是日期时间文本之后单个 stdout/stderr 中第一行的文本
第二列是来自 stdout/stderr 的最新日期时间，因此如果给定行上没有日期时间，则它从上面继承
第三列是每一行的其他文本

第 1 列和第 3 列很重要，第 2 列很好。

我已经尝试了具有多个级别的 sed 的各种配置，但在使用带替换的组的同时实现多行替换确实很困难。我最近开始的努力，即解析单个文件的 stdout/stderr 的第一行是：

$ cat wget.txt | sed -E ":a;N;$!ba;s/\n\n--[0-9-\s]{19}--  (https?:\/\/.*?)\n(.*?)\n\n/\n\n\1,\2\n\n/"
sed: -e expression #1, char 74: Invalid range end

我想最后 sed 可能不是最好的工具，可能是 awk？无论如何，鉴于我目前对这些工具的了解，这似乎非常困难，因此非常感谢任何帮助。

【问题讨论】：

为什么需要 CSV？文字就是文字。
简单地在记录周围添加引号似乎大致可以满足您的需求。试试awk RS='\n\n' '{ sub/--[0-9-: ]+-- /, ""); sub(/\n/, ""); print "\047" $0 "\047" }' wget.txt >wget.csv
你也考虑过 curl 吗？
@tripleee CSV 对 Excel 中的分析很有用，例如，我可以过滤哪些 URL 解析为 404 而不是 200，并且可以计算解析为每个代码的项目数。感谢建议的 awk 命令，不幸的是它在 GNU Awk 5.1.1，API：3.1（GNU MPFR 4.1.0，GNU MP 6.2.1）中不起作用。
@konsolebox 我还没有，只是因为我使用 wget 进行镜像。它会让这变得容易得多吗？

标签： awk sed wget

【解决方案1】：

我将使用 GNU AWK 按照以下方式检索关键列，让 file.txt 内容成为

--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5/document5.pdf
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:42 ERROR 404: Not Found.

--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /documents/5/ [following]
--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5/
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 302 Found
Location: /en/documents/5/ [following]
--2021-12-28 17:58:42--  https://stagingsitedomain.com/en/documents/5/
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:43 ERROR 404: Not Found.

--2021-12-28 17:58:43--  https://stagingsitedomain.com/documents/9/document9.pdf
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:43 ERROR 404: Not Found.

然后

awk '/^--[0-9]{4}-[0-9]{2}-[0-9]{2}/{link=$0=$NF}$0{gsub(",",";");print link "," $0}' file.txt

输出

https://stagingsitedomain.com/documents/5/document5.pdf,https://stagingsitedomain.com/documents/5/document5.pdf
https://stagingsitedomain.com/documents/5/document5.pdf,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5/document5.pdf,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42 ERROR 404: Not Found.
https://stagingsitedomain.com/documents/5,https://stagingsitedomain.com/documents/5
https://stagingsitedomain.com/documents/5,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5,HTTP request sent; awaiting response... 301 Moved Permanently
https://stagingsitedomain.com/documents/5,Location: /documents/5/ [following]
https://stagingsitedomain.com/documents/5/,https://stagingsitedomain.com/documents/5/
https://stagingsitedomain.com/documents/5/,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5/,HTTP request sent; awaiting response... 302 Found
https://stagingsitedomain.com/documents/5/,Location: /en/documents/5/ [following]
https://stagingsitedomain.com/en/documents/5/,https://stagingsitedomain.com/en/documents/5/
https://stagingsitedomain.com/en/documents/5/,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/en/documents/5/,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/en/documents/5/,2021-12-28 17:58:43 ERROR 404: Not Found.
https://stagingsitedomain.com/documents/9/document9.pdf,https://stagingsitedomain.com/documents/9/document9.pdf
https://stagingsitedomain.com/documents/9/document9.pdf,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/9/document9.pdf,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43 ERROR 404: Not Found.

解释：如果行以--之后的日期开头，则将link变量值设置为最后一个字段（HTTP不能包含空格，所以在错误的地方拆分应该没有问题）和整个内容行 ($0) 到其最后一个字段。对于每一行打印link，并且将,s 替换为;s 的行被, 剪切，因为link 仅针对看起来以--date 开头的行进行更改，这是来自的链接最新的此类行。我使用$0 作为跳过空行的条件。

（在 gawk 4.2.1 中测试）

【讨论】：

有没有办法在第 5-17 行的第一列中以 https://stagingsitedomain.com/documents/5 结尾？这很重要，因为这是来自 V1 站点的单个 URL，目标是确定该特定 URL 是否在 V2 站点上成功解析。所以我需要能够将第 5 行的 https://stagingsitedomain.com/documents/5 一直绑定到第 16 行和第 17 行的 404 错误。
@ChristopherBrooks 如果您需要删除尾随/ 然后在print 之前添加sub(/\/$/,"",link);，同样您可以删除/en/ 如果以下始终成立：最多有一个@987654342 @ AND 它总是应该被删除并且它总是 /en/ 而不是例如 /fr/ 或任何其他语言代码
它没有删除尾随的 / 而是现在所有这些 URL 都在输出 https://stagingsitedomain.com/documents/5 https://stagingsitedomain.com/documents/5/ https://stagingsitedomain.com/en/documents/5/ 的第 1 列第 5-17 行中。在第 1 列的第 5-17 行中，应该只有 https://stagingsitedomain.com/documents/5。第 1 列中的 URL 仅应在输入中的 \n\n 之后更改，例如在输入中的第 4 行和第 6 行之间。
这很重要，因为我遵循的重定向不遵循“添加尾随 /”或“添加 /en/”之类的约定，它们可能会重定向到完全不同的 URL。