在 python 中创建网络爬虫时检索锚标记答案

【问题标题】：Retrieving anchor tags while creating a web crawler in python在 python 中创建网络爬虫时检索锚标记
【发布时间】：2018-03-12 14:31:21
【问题描述】：

我正在创建一个网络爬虫并尝试在 pycharm 中运行程序以检索 URL 的锚标记。但我得到的输出与我输入的 URL 完全相同。代码如下：

    import urllib.request,urllib.parse,urllib.error
    from bs4 import BeautifulSoup
    import ssl
    ctx=ssl.create_default_context()
    ctx.check_hostname=False
    ctx.verify_mode=ssl.CERT_NONE

    url=input("https://en.wikipedia.org/wiki/Apple_Inc.")
    html=urllib.request.urlopen(url, context=ctx).read()
    soup=BeautifulSoup(html, 'html.parser')

    tags=soup("a")
    for tag in tags:
        print(tag.get("href",None))

这里需要注意的是，在 urllib 库中，只有 urllib.error 显示为 used 语句，而 urllib.request 和 urllib.parse 都显示为未使用的语句，我不明白为什么。

这个程序的输出是：https://en.wikipedia.org/wiki/Apple_Inc。

我正在使用 python 3.5.1 和 pycharm 社区版。

【问题讨论】：

标签： python-3.x beautifulsoup web-crawler pycharm urllib

【解决方案1】：

你真的应该使用requests 包。它对于爬行目的非常有用。查看this user response about requests.

这是您转换后的代码：

import requests
from bs4 import BeautifulSoup

request = requests.get("https://en.wikipedia.org/wiki/Apple_Inc.").text
soup = BeautifulSoup(request, "html.parser")

anchor = soup.find_all("a", href=True)
for a in anchor:
    print (a["href"])

【讨论】：