从 www 抓取链接并保存为 txt 文件（Bash 或 Python）答案

【问题标题】：Scrape links from www and save as txt files (Bash or Python)从 www 抓取链接并保存为 txt 文件（Bash 或 Python）
【发布时间】：2014-02-16 19:57:10
【问题描述】：

我在家里有一个小项目，我需要每隔一段时间抓取一个网站的链接并将链接保存在 txt 文件中。

该脚本需要在我的 Synology NAS 上运行，因此该脚本需要使用 bash 脚本或 python 编写，而不使用任何插件或外部库，因为我无法将它安装在 NAS 上。（据我所知）

链接如下所示：

<a href="http://www.example.com">Example text</a>

我想将以下内容保存到我的文本文件中：

Example text - http://www.example.com

我在想我可以用 curl 和一些 grep（或者可能是正则表达式）来隔离文本。首先我研究了使用 Scrapy 或 Beutifulsoup，但找不到在 NAS 上安装它的方法。

你们谁能帮我把脚本放在一起吗？

【问题讨论】：

一个典型的网页可能包含许多 NOT 链接的“http...”字符串，我很确定你不想把它们刮掉网站。您可能希望找到所有 <href> 标签，并仅从这些元素中获取链接。您能否提供您要抓取的网页的网址？

标签： python regex bash curl

【解决方案1】：

您可以使用随 Python 免费提供的urllib2。使用它你可以轻松获取任意url的html

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

现在，关于解析 html。您仍然可以使用BeautifulSoup 而不安装它。来自their site，它说“您也可以下载压缩包并直接在您的项目中使用 BeautifulSoup.py”。所以在互联网上搜索那个BeautifulSoup.py 文件。如果找不到，请下载this one 并保存到项目内的本地文件中。然后像下面这样使用它：

soup = BeautifulSoup(html)
for link in soup("a"):
    print link["href"]
    print link.renderContents()

【讨论】：

【解决方案2】：

我推荐使用 Python 的 htmlparser 库。它将为您将页面解析为对象层次结构。然后，您可以找到 a href 标记。

http://docs.python.org/2/library/htmlparser.html

有很多使用这个库来查找链接的例子，所以我不会列出所有的代码，但这里有一个工作示例： Extract absolute links from a page using HTMLParser

编辑：

正如 Oday 所指出的，htmlparser 是一个外部库，您可能无法加载它。在这种情况下，对于可以满足您需求的内置模块，这里有两个建议：

htmllib 包含在 Python 2.X 中。
xml 包含在 Python 2.X 和 3.X 中。

本网站其他地方也有很好的解释，说明如何使用 wget 和 grep 来做同样的事情：
Spider a Website and Return URLs Only

【讨论】：

这是一个不错的建议，但我相信 OP 说他无法加载外部库或插件。

【解决方案3】：

根据你的例子，你需要这样的东西：

wget -q -O- https://dl.dropboxusercontent.com/s/wm6mt2ew0nnqdu6/links.html?dl=1 | sed -r 's#<a href="([^"]+)">([^<]+)</a>.*$#\2 - \1#' > links.txt

cat links.txt 输出：

1Visit W3Schools - http://www.w3schools.com/
2Visit W3Schools - http://www.w3schools.com/
3Visit W3Schools - http://www.w3schools.com/
4Visit W3Schools - http://www.w3schools.com/
5Visit W3Schools - http://www.w3schools.com/
6Visit W3Schools - http://www.w3schools.com/
7Visit W3Schools - http://www.w3schools.com/

【讨论】：

不起作用。 sed: illegal option -- r usage: sed script [-Ealn] [-i extension] [file ...] sed [-Ealn] [-i extension] [-e script] ... [-f script_file] ... [file ...]