如何从 URL 中抓取整个文本正文？答案

【问题标题】：How to scrape entire body of text from a URL?如何从 URL 中抓取整个文本正文？
【发布时间】：2019-07-05 16:52:02
【问题描述】：

我有一个从this page 收集的 URL 列表，它们基本上只是来自人们的引用，我想将每个不同 URL 的引用保存在单独的文件中。

为了获取 URL 列表，我使用了：

import bs4
from urllib.request import Request,urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'

# set up known browser user agent for the request to bypass HTMLError
req=Request(my_url,headers={'User-Agent': 'Mozilla/5.0'})

#opening up connection, grabbing the page
uClient = uReq(req)
page_html = uClient.read()
uClient.close()

#html is jumbled at the moment, so call html using soup function
soup = soup(page_html, "html.parser")

# Test: print title of page
soup.title


tags = soup.findAll("a" , href=re.compile("javascript:pop"))
print(tags)

# get list of all URLS
for links in tags:
    link = links.get('href')
    if "java" in link: 
        print("http://archive.ontheissues.org" + link[18:len(link)-3])

如何从每个链接中提取内容，包括文本、项目符号、段落，然后将它们保存到单独的文件中？另外，我不想要那些不是引号的东西，比如这些页面中的其他 URL。

【问题讨论】：

标签： html python-3.x web-scraping beautifulsoup

【解决方案1】：

您希望抓取的“引用”页面有一些不完整/悬空的 HTML 标记。如果您不了解您正在使用的解析器，那么解析这些可能会很痛苦。要获得有关它们的提示，请参阅this page。

现在回到代码，为方便起见，我使用了lxml 解析器。继续前进，如果您观察任何这些“引用”页面的页面源，那么您会看到您希望抓取的大部分文本都存在于以下标签之一中：{h3,p ,ul,ol}。另外，请注意，每个h3 标签旁边都有一个字符串。可以使用.next_sibling 捕获此字符串。现在条件已经设置好了，让我们继续代码吧。

import bs4
from urllib.request import Request,urlopen as uReq, HTTPError 
#Import HTTPError in order to avoid the links with no content/resource of interest
from bs4 import BeautifulSoup as soup_
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'

#Creating a function to harness the power of scraping frequently
def make_soup(url):
    # set up known browser user agent for the request to bypass HTMLError
    req=Request(url,headers={'User-Agent': 'Mozilla/5.0'})

    #opening up connection, grabbing the page
    uClient = uReq(req)
    page_html = uClient.read()
    uClient.close()

    #html is jumbled at the moment, so call html using soup function
    soup = soup_(page_html, "lxml") 
    return soup

# Test: print title of page
#soup.title

soup = make_soup(my_url)
tags = soup.findAll("a" , href=re.compile("javascript:pop\("))
#print(tags)

# get list of all URLS
for links in tags:
    link = links.get('href')
    if "java" in link: 
        print("http://archive.ontheissues.org" + link[18:len(link)-3])
        main_url = "http://archive.ontheissues.org" + link[18:len(link)-3] 
        try:
            sub_soup = make_soup(main_url)
            content_collexn = sub_soup.body.contents #Splitting up the page into contents for iterative access 
            #text_data = [] #This list can be used to store data related to every person
            for item in content_collexn:
                #Accept an item if it belongs to the following classes
                if(type(item) == str):
                    print(item.get_text())
                elif(item.name == "h3"):
                    #Note that over here, every h3 tagged title has a string following it
                    print(item.get_text())   
                    #Hence, grab that string too
                    print(item.next_sibling) 
                elif(item.name in ["p", "ul", "ol"]):
                    print(item.get_text())
        except HTTPError: #Takes care of missing pages and related HTTP exception
            print("[INFO] Resource not found. Skipping to next link.")

        #print(text_data)

【讨论】：

嗨，argon，这段代码在获取页面的全部内容方面表现出色。我只有两个问题。首先，我在阅读 HTML 方面很糟糕，所以我想知道，是否有某种方法可以进行排除，以便脚本不会在页面的最底部出现“单击此处查看免费的定义和背景信息”贸易。”及以后？第二个问题，是否可以使用某种循环将抓取到的每个不同页面保存在 txt 文件中，并以每个候选文件的名称作为文件名？
对于第一个问题，您可以尝试使用正则表达式检查每个内容项的文本部分是否为“单击此处查看定义...”类型的文本。如果它们出现在任何内容项中，那么您可以跳过该项目或用空白字符替换该特定文本。对于第二个问题，是的，您绝对可以做到。您可以使用上面注释的text_data 列表来记录页面的数据，然后将其写入文件。要了解有关如何将列表写入文件的更多信息，请参阅 this。
另外，如果您使用 link[18:len(link)-3] 创建个人页面链接，那么您将能够从中提取名称。简单地说，将此部分存储在一个变量中，例如sub_link。您必须从此字符串中提取人名。为此，请考虑sub_link 的结果之一：2020/Justin_Amash_Free_Trade.htm。在这里，将2020/ & _Free_Trade.htm 部分替换为空白字符。剩下的字符串基本上是人名。将页面数据写入文件时，在open() 函数的文件名参数中使用此字符串。

【解决方案2】：

这些是可以提供帮助的几个方面。

您可以使用Session 对象来提高重用连接的效率。

您可以使用 bs4 4.7.1 压缩您的打开代码以获取正确的 url，如下所示，我使用属性 = 值 css 选择器限制为包含 javascript:pop 的 hrefs。 * 是contains 运算符。

[href*="javascript:pop"]

然后添加:contains 的伪选择器以进一步限制innerText 中包含单词quote 的url。这会将匹配元素列表细化为所需的元素。

:contains(quote)

import requests
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
    r = s.get('http://archive.ontheissues.org/Free_Trade.htm')
    soup = bs(r.content, 'lxml')
    links = [item['href'] for item in soup.select('[href*="javascript:pop"]:contains(quote)')]
    for link in links:
        #rest of code working with Session

参考资料：

【讨论】：

非常感谢。我必须承认，我只理解了你帖子中大约 20% 的单词，因为我对 python 还很陌生，但是我尝试了这段代码并得到了我想要的所有链接，更简洁。 “您可以使用 Session 对象来提高重用连接的效率。” - 这是什么意思？
嗨，这里解释了会话对象：2.python-requests.org/en/master/user/advanced。这是一种重新使用现有连接而不是不断创建新连接的方法。 with requests.Session() as s: s，并在 with中重复使用它> 语句，这样您就可以在循环中通过 links 执行 s.get(link)