【问题标题】:Converting wget stdout and stderr to CSV for analysis将 wget stdout 和 stderr 转换为 CSV 进行分析
【发布时间】:2021-12-31 00:21:39
【问题描述】:

我正在使用 wget 对一组网站进行完整性检查,因为它们正在进行大规模的升级,包括迁移数据库等。旧版本和新版本分别称为 V1 和 V2。这些网站建立在经过大量修改的 Wagtail CMS 版本之上。

我需要做的一件事是确认 V1 中的旧 URL 已正确重定向到 V2 中的新 URL。我的方法是编写一个 bash 脚本:

  1. 使用 wget 完全镜像 V1 面向用户的站点
  2. 从生成的磁盘文件结构中使用 du、sed 和 grep 创建一组 V1 URL
  3. 过滤掉 V2 中不需要重定向的 URL
  4. 使用 wget 从 V2 站点按顺序获取这些 URL,即将原始 sitedomain.com 替换为 stagingsitedomain.com,但保持 URL 的其余部分不变
  5. 将生成的 stdout/stderr 转换为 CSV 格式

有了 CSV,我将分析生成的 stdout/stderr 以确定是否存在未正确重定向的 URL。

我在最后一步,即第 5 步。下面是一个示例 stdout/stderr,用于获取三个文件:

--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5/document5.pdf
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:42 ERROR 404: Not Found.

--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /documents/5/ [following]
--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5/
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 302 Found
Location: /en/documents/5/ [following]
--2021-12-28 17:58:42--  https://stagingsitedomain.com/en/documents/5/
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:43 ERROR 404: Not Found.

--2021-12-28 17:58:43--  https://stagingsitedomain.com/documents/9/document9.pdf
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:43 ERROR 404: Not Found.

我希望将其转换为 CSV 格式,如下所示:

https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42,https://stagingsitedomain.com/documents/5/document5.pdf
https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42,HTTP request sent, awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42,ERROR 404: Not Found.
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,https://stagingsitedomain.com/documents/5
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,HTTP request sent, awaiting response... 301 Moved Permanently
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Location: /documents/5/ [following]
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,https://stagingsitedomain.com/documents/5/
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,HTTP request sent; awaiting response... 302 Found
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Location: /en/documents/5/ [following]
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,https://stagingsitedomain.com/en/documents/5/
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:42,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/5,2021-12-28 17:58:43,ERROR 404: Not Found.
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43,https://stagingsitedomain.com/documents/9/document9.pdf
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43,ERROR 404: Not Found.

从心理逻辑的角度来看,关键步骤是:

  1. 从初始输出中删除逗号并替换为 ;
  2. 为基于双换行符的三个原始文件中的每一个的 GET 识别单独的 stdout/stderr
  3. 对于每个原始文件 stdout/stderr,第一列是原始请求的 URL,重复,这也是日期时间文本之后单个 stdout/stderr 中第一行的文本
  4. 第二列是来自 stdout/stderr 的最新日期时间,因此如果给定行上没有日期时间,则它从上面继承
  5. 第三列是每一行的其他文本

第 1 列和第 3 列很重要,第 2 列很好。

我已经尝试了具有多个级别的 sed 的各种配置,但在使用带替换的组的同时实现多行替换确实很困难。我最近开始的努力,即解析单个文件的 stdout/stderr 的第一行是:

$ cat wget.txt | sed -E ":a;N;$!ba;s/\n\n--[0-9-\s]{19}--  (https?:\/\/.*?)\n(.*?)\n\n/\n\n\1,\2\n\n/"
sed: -e expression #1, char 74: Invalid range end

我想最后 sed 可能不是最好的工具,可能是 awk?无论如何,鉴于我目前对这些工具的了解,这似乎非常困难,因此非常感谢任何帮助。

【问题讨论】:

  • 为什么需要 CSV?文字就是文字。
  • 简单地在记录周围添加引号似乎大致可以满足您的需求。试试awk RS='\n\n' '{ sub/--[0-9-: ]+-- /, ""); sub(/\n/, ""); print "\047" $0 "\047" }' wget.txt >wget.csv
  • 你也考虑过 curl 吗?
  • @tripleee CSV 对 Excel 中的分析很有用,例如,我可以过滤哪些 URL 解析为 404 而不是 200,并且可以计算解析为每个代码的项目数。感谢建议的 awk 命令,不幸的是它在 GNU Awk 5.1.1,API:3.1(GNU MPFR 4.1.0,GNU MP 6.2.1)中不起作用。
  • @konsolebox 我还没有,只是因为我使用 wget 进行镜像。它会让这变得容易得多吗?

标签: awk sed wget


【解决方案1】:

我将使用 GNU AWK 按照以下方式检索关键列,让 file.txt 内容成为

--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5/document5.pdf
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:42 ERROR 404: Not Found.

--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /documents/5/ [following]
--2021-12-28 17:58:42--  https://stagingsitedomain.com/documents/5/
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 302 Found
Location: /en/documents/5/ [following]
--2021-12-28 17:58:42--  https://stagingsitedomain.com/en/documents/5/
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:43 ERROR 404: Not Found.

--2021-12-28 17:58:43--  https://stagingsitedomain.com/documents/9/document9.pdf
Reusing existing connection to stagingsitedomain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2021-12-28 17:58:43 ERROR 404: Not Found.

然后

awk '/^--[0-9]{4}-[0-9]{2}-[0-9]{2}/{link=$0=$NF}$0{gsub(",",";");print link "," $0}' file.txt

输出

https://stagingsitedomain.com/documents/5/document5.pdf,https://stagingsitedomain.com/documents/5/document5.pdf
https://stagingsitedomain.com/documents/5/document5.pdf,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5/document5.pdf,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/5/document5.pdf,2021-12-28 17:58:42 ERROR 404: Not Found.
https://stagingsitedomain.com/documents/5,https://stagingsitedomain.com/documents/5
https://stagingsitedomain.com/documents/5,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5,HTTP request sent; awaiting response... 301 Moved Permanently
https://stagingsitedomain.com/documents/5,Location: /documents/5/ [following]
https://stagingsitedomain.com/documents/5/,https://stagingsitedomain.com/documents/5/
https://stagingsitedomain.com/documents/5/,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/5/,HTTP request sent; awaiting response... 302 Found
https://stagingsitedomain.com/documents/5/,Location: /en/documents/5/ [following]
https://stagingsitedomain.com/en/documents/5/,https://stagingsitedomain.com/en/documents/5/
https://stagingsitedomain.com/en/documents/5/,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/en/documents/5/,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/en/documents/5/,2021-12-28 17:58:43 ERROR 404: Not Found.
https://stagingsitedomain.com/documents/9/document9.pdf,https://stagingsitedomain.com/documents/9/document9.pdf
https://stagingsitedomain.com/documents/9/document9.pdf,Reusing existing connection to stagingsitedomain.com:443.
https://stagingsitedomain.com/documents/9/document9.pdf,HTTP request sent; awaiting response... 404 Not Found
https://stagingsitedomain.com/documents/9/document9.pdf,2021-12-28 17:58:43 ERROR 404: Not Found.

解释:如果行以--之后的日期开头,则将link变量值设置为最后一个字段(HTTP不能包含空格,所以在错误的地方拆分应该没有问题)和整个内容行 ($0) 到其最后一个字段。对于每一行打印link,并且将,s 替换为;s 的行被, 剪切,因为link 仅针对看起来以--date 开头的行进行更改,这是来自的链接最新的此类行。我使用$0 作为跳过空行的条件。

(在 gawk 4.2.1 中测试)

【讨论】:

  • 有没有办法在第 5-17 行的第一列中以 https://stagingsitedomain.com/documents/5 结尾?这很重要,因为这是来自 V1 站点的单个 URL,目标是确定该特定 URL 是否在 V2 站点上成功解析。所以我需要能够将第 5 行的 https://stagingsitedomain.com/documents/5 一直绑定到第 16 行和第 17 行的 404 错误。
  • @ChristopherBrooks 如果您需要删除尾随/ 然后在print 之前添加sub(/\/$/,"",link);,同样您可以删除/en/ 如果以下始终成立:最多有一个@987654342 @ AND 它总是应该被删除并且它总是 /en/ 而不是例如 /fr/ 或任何其他语言代码
  • 它没有删除尾随的 / 而是现在所有这些 URL 都在输出 https://stagingsitedomain.com/documents/5 https://stagingsitedomain.com/documents/5/ https://stagingsitedomain.com/en/documents/5/ 的第 1 列第 5-17 行中。在第 1 列的第 5-17 行中,应该只有 https://stagingsitedomain.com/documents/5。第 1 列中的 URL 仅应在输入中的 \n\n 之后更改,例如在输入中的第 4 行和第 6 行之间。
  • 这很重要,因为我遵循的重定向不遵循“添加尾随 /”或“添加 /en/”之类的约定,它们可能会重定向到完全不同的 URL。
猜你喜欢
  • 1970-01-01
  • 2020-01-11
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-04-06
  • 1970-01-01
  • 2014-07-22
相关资源
最近更新 更多