从特定 URL 及其子目录 wget 站点答案

【问题标题】：wget site from specific URL anad its subdirectories从特定 URL 及其子目录 wget 站点
【发布时间】：2017-06-17 22:43:06
【问题描述】：

我正在尝试将所有 HTML 文件从 https://www.workandincome.govt.nz/map/ 下载到磁盘。我的意思是我需要在以“map”结尾的https://www.workandincome.govt.nz/map/ URL 之后获取 index.html 和所有其他 HTML 文件。比如我需要下载：

https://www.workandincome.govt.nz/map/income-support/extra-help/disability-allowance/medical-fees-01.html
https://www.workandincome.govt.nz/map/income -support/extra-help/community-costs/index.html

等等。我不需要从 map 不在 URL 中的同一站点下载任何其他 HTML 页面。我试过下面的 wget 命令：

wget --limit-rate=200k --recursive --html-extension --convert-links   --random-wait --follow-tags=a -U "Mozilla/5.0 (X11; Linux x86_64)" https://www.workandincome.govt.nz/map/index.html

通过上面我得到https://www.workandincome.govt.nz/map/index.html 然后 http://www.workandincome.govt.nz/robots.txt 然后是我不需要的 HTML 文件，例如：

www.workandincome.govt.nz/online-services/index.html, www.workandincome.govt.nz/eligibility/index.html

有人可以查看我正在使用的 wget 命令和建议吗？谢谢

【问题讨论】：

gnu.org/software/wget/manual/wget.html#Types-of-Files

标签： html wget

【解决方案1】：

你需要使用 -A 参数

wget -A "*map*" --limit-rate=200k --recursive --html-extension --convert-links --random-wait --follow-tags=a -U "Mozilla/5.0 (X11; Linux x86_64)" https://www.workandincome.govt.nz/map/index.html

【讨论】：