如何使用 --accept-regex 选项通过 wget 下载网站？答案

【问题标题】：How do I use the --accept-regex option for downloading a website with wget?如何使用 --accept-regex 选项通过 wget 下载网站？
【发布时间】：2017-10-27 23:04:05
【问题描述】：

我正在尝试使用 wget 下载我的网站的存档 — 3dsforums.com，但有数百万页我不想想要下载，所以我想告诉wget 仅下载与某些 URL 模式匹配的页面，但我遇到了一些障碍。

例如，这是我要下载的网址：

http://3dsforums.com/forumdisplay.php?f=46

...所以我尝试使用 --accept-regex 选项：

wget -mkEpnp --accept-regex "(forumdisplay\.php\?f=(\d+)$)" http://3dsforums.com

但它只是下载网站的主页。

到目前为止，唯一可以远程工作的命令如下：

wget -mkEpnp --accept-regex "(\w+\.php$)" http://3dsforums.com

这提供了以下响应：

Downloaded 9 files, 215K in 0.1s (1.72 MB/s)
Converting links in 3dsforums.com/faq.php.html... 16-19
Converting links in 3dsforums.com/index.html... 8-88
Converting links in 3dsforums.com/sendmessage.php.html... 14-15
Converting links in 3dsforums.com/register.php.html... 13-14
Converting links in 3dsforums.com/showgroups.php.html... 14-29
Converting links in 3dsforums.com/index.php.html... 16-80
Converting links in 3dsforums.com/calendar.php.html... 17-145
Converting links in 3dsforums.com/memberlist.php.html... 14-99
Converting links in 3dsforums.com/search.php.html... 15-16
Converted links in 9 files in 0.009 seconds.

我的正则表达式有问题吗？还是我误解了--accept-regex 选项的使用？我今天一直在尝试各种变体，但我不太了解实际问题是什么。

【问题讨论】：

标签： regex wget

【解决方案1】：

wget默认使用POSIX正则表达式\d类表示为[:digit:]，\w类表示为[:word:]，加上为什么要分组？如果您的 wget 编译时带有 PCRE 支持，那么您的生活会更轻松，并且可以这样做：

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay.php\?f=\d+$" http://3dsforums.com

但是...这不起作用，因为您的论坛软件会创建自动会话 ID (s=<session_id>) 并将它们注入所有链接，因此您还需要考虑这些：

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay\.php\?(s=.*)?f=\d+(s=.*)?$" http://3dsforums.com

唯一的问题是，现在您的文件将使用名称中的会话 ID 保存，因此您必须在 wget 完成后添加另一个步骤 - 批量重命名名称中包含会话 ID 的所有文件.你可以通过管道 wget 到 sed 来做到这一点，但我会把它留给你 :)

如果您的 wget 不支持 PCRE，此模式最终会很长，但希望它支持...

【讨论】：