Ryan Mitchell 第 3 章使用 Python 进行 Web Scraping答案

【问题标题】：Web Scraping with Python by Ryan Mitchell Chapter 3Ryan Mitchell 第 3 章使用 Python 进行 Web Scraping
【发布时间】：2020-12-29 16:54:21
【问题描述】：

我正在尝试自学 python 网络抓取。我遇到了这行我无法完全理解的代码。我不明白的行是。

for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):

更大的代码sn-p在这里。

from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks

我知道 includeUrl 函数已经提取了方案和 netloc 以创建完整的链接。例如，如果我们使用以下 url，我们将得到该结果。

'https://stackoverflow.com/questions/ask'是网址，

'https'是方案

stackoverflow.com 网络地址

如果你已经有一个格式良好的链接，比如https://www.facebook.com，那么这个函数到底能做什么？这是否仅适用于不完整的链接，有人可以给我一个如何正确解释此功能的示例吗？

谢谢。

【问题讨论】：

includeUrl 不是一个函数，它是一个包含上面三行构造的 URL 的变量。
你看不懂的那行只包含两个函数调用：find_all() 和compile()。在剩下的问题中，您谈论的是变量（不是函数）includeUrl。令人困惑和不清楚您不了解哪个功能以及它的具体内容。

标签： python html web web-scraping

【解决方案1】：

BeautifulSoup 是一个 html 解析器。 bs 对象是您正在抓取的 Web 的已解析 DOM。

for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):

在这里，使用find_all，您可以获得所有标签 ... ，并且对于这些标签中的每一个，您只保留那些与正则表达式'^(/|.*'+includeUrl+')' 匹配的标签，使用re。编译()

【讨论】：