使用 bs4 和 Python 从网页中提取引号列表时遇到问题答案

【问题标题】：Trouble extracting list of quotes from webpage using bs4 and Python使用 bs4 和 Python 从网页中提取引号列表时遇到问题
【发布时间】：2021-07-28 20:29:24
【问题描述】：

我想使用 bs4 导航到一个网页，并将页面上的所有引号提取到一个列表中。

我还想提取那个特定人的总页数（页面底部的一个元素）

我目前使用的代码是这样的。

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup

listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
div_container = s.findAll("div", {"id":"quotesList"})

我在搜索引号的 div_container 对象时遇到问题。

【问题讨论】：

标签： python web-scraping beautifulsoup urllib

【解决方案1】：

第一次提供帮助，如果不是最好的，我深表歉意。我是一个 Python 新手，所以我发现打印并保存到文件以查看程序所看到的内容很有帮助。我使用以下代码执行此操作：

#This open a file and sets it in “w or “write” mode. If 'export.txt' doesn't exist Python creates it!
file1 = open('export.txt', 'w')
#This writes whatever I want to the file.
file1.write("This is what I want in the file")
#This safely closes the file.
file1.close()

将此应用于您的代码：

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup

listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
div_container = s.findAll("div", {"id":"quotesList"})

#because of findAll, I could not do file1.write(div_container) and instead had to iterate through each item in the list.
#findAll returns a type "bs4.element.ResultSet" which can't have .text on the end.  However, by calling each item in the "bs4.element.ResultSet" by index, you can then apply .text to it.
#in this case there is only one element.  That is to say, div_container[1] doesn't exist.
for i in range(len(div_container)):
    file1 = open('export.txt', 'w')
    #the .text returns just the text inside of the tag with none of the html coding.
    file1.write(div_container[i].text)
    file1.close()

这给了我们以下信息：

我赞成将宗教视为傲慢的驯化者。对于一个希腊人正统派认为上帝是人类之外的创造者上帝的条件。我的上帝不是乔治布什的上帝。纳西姆·尼古拉斯塔勒布

神教傲慢等

那么这里发生了什么？

如果我们再次运行代码，但不是查看 div，而是使用 BS4 的 prettify 方法查看实际的 HTML，如下所示：

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup

listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
#.prettify() is here
s = soup(webpage,"html.parser").prettify()
file1 = open('export2.txt', 'w')
file1.write(s)
file1.close()

我们可以查看文本文档，看看 Python 看到了什么，其中的一个片段是：

   <div class="bq_center ql_page">
    <div class="reflow_body bq_center">
     <div class="new-msnry-grid bqcpx grid-layout-hide" id="quotesList">
      <div class="m-brick grid-item boxy clearfix bqQt r-width" id="qpos_1_1">
       <a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
        I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
       </a>
       <a class="bq-aut qa_530963 oncl_a" href="/quotes/nassim_nicholas_taleb_530963" title="view author">
        Nassim Nicholas Taleb
       </a>
       <div class="qbn-box">
        <div class="sh-cont">
         <a aria-label="Share this quote on Facebook" class="sh-fb sh-grey" href="/share/fb/530963" rel="nofollow" target="_blank">
          <img alt="Share on Facebook" class="bq-fa" src="/st/img/4341377/fa/facebook-f.svg"/>
         </a>
         <a aria-label="Share this quote on Twitter" class="sh-tw sh-grey" href="/share/tw/530963?ti=Nassim+Nicholas+Taleb+Quotes" rel="nofollow" target="_blank">
          <img alt="Share on Twitter" class="bq-fa" src="/st/img/4341377/fa/twitter.svg"/>
         </a>
         <a aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" href="/share/li/530963?ti=Nassim+Nicholas+Taleb+Quotes+-+BrainyQuote" rel="nofollow" target="_blank">
          <img alt="Share on LinkedIn" class="bq-fa" src="/st/img/4341377/fa/linkedin-in.svg"/>
         </a>
        </div>
       </div>
       <div class="kw-box">
        <a class="qkw-btn btn btn-xs oncl_klc" data-idx="0" href="/topics/god-quotes">
         God
        </a>
        <a class="qkw-btn btn btn-xs oncl_klc" data-idx="1" href="/topics/religion-quotes">
         Religion
        </a>
        <a class="qkw-btn btn btn-xs oncl_klc" data-idx="2" href="/topics/arrogance-quotes">
         Arrogance
        </a>
       </div>
      </div>

为什么这对您很重要？因为您要拉出所有 div，而将其缩减为仅包含 div 的文本是：

  <div class="m-brick grid-item boxy clearfix bqQt r-width" id="qpos_1_1">
   <a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
    I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
   </a>
   <a class="bq-aut qa_530963 oncl_a" href="/quotes/nassim_nicholas_taleb_530963" title="view author">
    Nassim Nicholas Taleb
   </a>
   <div class="kw-box">
    <a class="qkw-btn btn btn-xs oncl_klc" data-idx="0" href="/topics/god-quotes">
     God
    </a>
    <a class="qkw-btn btn btn-xs oncl_klc" data-idx="1" href="/topics/religion-quotes">
     Religion
    </a>
    <a class="qkw-btn btn btn-xs oncl_klc" data-idx="2" href="/topics/arrogance-quotes">
     Arrogance
  </div>

这是我们需要考虑如何最好地获取信息的地方。我们看到它在一个 div 标签和一个 a 标签中。但是，如果我们拉下它，我们最终会再次抓住相同的东西，所以我们需要找到报价所独有的东西，而不是其他东西。

因此，如果我们回顾第二次导出，并比较引号周围的 a 标签：

<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
<a class="b-qt qt_531016 oncl_q" href="/quotes/nassim_nicholas_taleb_531016" title="view quote">

我们可以看到 class 和 href 部分每次都在变化，不会有太大帮助，但是标题中的信息保持不变，因此我们可以使用它。再次使用您的代码作为模板：

    from urllib.request import Request, urlopen
    from bs4 import BeautifulSoup as soup
    
    listOfQuotes = []
    website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
    req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    s = soup(webpage,"html.parser")
    quotes = s.find_all("a", attrs={"title": "view quote"})
    for a in quotes:
        listOfQuotes.append(a.text)

print(listOfQuotes)

对于您问题的第二部分，我会使用 Lucas 在我之前所说的内容，但是我已将其改编为您的代码：

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup

listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
# We bring all the "a" that has the title "view quote"
quotes = s.find_all("a", attrs={"title": "view quote"})
for a in quotes:
    # do something...
    listOfQuotes.append(a.text)

pagination = s.select('ul[class*="pagination"]')
if not pagination:
    pages = 0
else:
    # we subtract two, that of next and that of previous 
    pages = len(pagination[0].find_all("li")) - 2

【讨论】：

【解决方案2】：

最简单的方法是通过标题找到它们（所有引号都有）：

import requests
from bs4 import BeautifulSoup

url = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"r = requests.get(url)
soup = BeautifulSoup(r.text)

# We bring all the "a" that has the title "view quote"
all_a_quotes = soup.find_all("a", attrs={"title": "view quote"})
for a in all_a_quotes:
    # do something...
    print(a.text)

这将输出（总共 60 个）：

I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
You are rich if and only if money you refuse tastes better than money you accept.
If you take risks and face your fate with dignity, there is nothing you can do that makes you small; if you don't take risks, there is nothing you can do that makes you grand, nothing.
Steve Jobs, Bill Gates and Mark Zuckerberg didn't finish college. Too much emphasis is placed on formal education - I told my children not to worry about their grades but to enjoy learning.
[...]
Debt is a mistake between lender and borrower, and both should suffer.
Capitalism is about adventurers who get harmed by their mistakes, not people who harm others with their mistakes.
The next time you experience a blackout, take some solace by looking at the sky. You will not recognize it.

对于分页，我们查看最后一个元素“ul”是否存在（如果不存在，则只有一页），如果存在，我们计算它有多少个“li”，然后减去 2：

pagination = soup.select('ul[class*="pagination"]')
if not pagination:
    pages = 0
else:
    # we subtract two, that of next and that of previous 
    pages = len(pagination[0].find_all("li")) - 2

【讨论】：