第一次提供帮助,如果不是最好的,我深表歉意。我是一个 Python 新手,所以我发现打印并保存到文件以查看程序所看到的内容很有帮助。
我使用以下代码执行此操作:
#This open a file and sets it in “w or “write” mode. If 'export.txt' doesn't exist Python creates it!
file1 = open('export.txt', 'w')
#This writes whatever I want to the file.
file1.write("This is what I want in the file")
#This safely closes the file.
file1.close()
将此应用于您的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
div_container = s.findAll("div", {"id":"quotesList"})
#because of findAll, I could not do file1.write(div_container) and instead had to iterate through each item in the list.
#findAll returns a type "bs4.element.ResultSet" which can't have .text on the end. However, by calling each item in the "bs4.element.ResultSet" by index, you can then apply .text to it.
#in this case there is only one element. That is to say, div_container[1] doesn't exist.
for i in range(len(div_container)):
file1 = open('export.txt', 'w')
#the .text returns just the text inside of the tag with none of the html coding.
file1.write(div_container[i].text)
file1.close()
这给了我们以下信息:
我赞成将宗教视为傲慢的驯化者。对于一个希腊人
正统派认为上帝是人类之外的创造者
上帝的条件。我的上帝不是乔治布什的上帝。纳西姆·尼古拉斯
塔勒布
神教傲慢等
那么这里发生了什么?
如果我们再次运行代码,但不是查看 div,而是使用 BS4 的 prettify 方法查看实际的 HTML,如下所示:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
#.prettify() is here
s = soup(webpage,"html.parser").prettify()
file1 = open('export2.txt', 'w')
file1.write(s)
file1.close()
我们可以查看文本文档,看看 Python 看到了什么,其中的一个片段是:
<div class="bq_center ql_page">
<div class="reflow_body bq_center">
<div class="new-msnry-grid bqcpx grid-layout-hide" id="quotesList">
<div class="m-brick grid-item boxy clearfix bqQt r-width" id="qpos_1_1">
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
</a>
<a class="bq-aut qa_530963 oncl_a" href="/quotes/nassim_nicholas_taleb_530963" title="view author">
Nassim Nicholas Taleb
</a>
<div class="qbn-box">
<div class="sh-cont">
<a aria-label="Share this quote on Facebook" class="sh-fb sh-grey" href="/share/fb/530963" rel="nofollow" target="_blank">
<img alt="Share on Facebook" class="bq-fa" src="/st/img/4341377/fa/facebook-f.svg"/>
</a>
<a aria-label="Share this quote on Twitter" class="sh-tw sh-grey" href="/share/tw/530963?ti=Nassim+Nicholas+Taleb+Quotes" rel="nofollow" target="_blank">
<img alt="Share on Twitter" class="bq-fa" src="/st/img/4341377/fa/twitter.svg"/>
</a>
<a aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" href="/share/li/530963?ti=Nassim+Nicholas+Taleb+Quotes+-+BrainyQuote" rel="nofollow" target="_blank">
<img alt="Share on LinkedIn" class="bq-fa" src="/st/img/4341377/fa/linkedin-in.svg"/>
</a>
</div>
</div>
<div class="kw-box">
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="0" href="/topics/god-quotes">
God
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="1" href="/topics/religion-quotes">
Religion
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="2" href="/topics/arrogance-quotes">
Arrogance
</a>
</div>
</div>
为什么这对您很重要?因为您要拉出所有 div,而将其缩减为仅包含 div 的文本是:
<div class="m-brick grid-item boxy clearfix bqQt r-width" id="qpos_1_1">
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
</a>
<a class="bq-aut qa_530963 oncl_a" href="/quotes/nassim_nicholas_taleb_530963" title="view author">
Nassim Nicholas Taleb
</a>
<div class="kw-box">
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="0" href="/topics/god-quotes">
God
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="1" href="/topics/religion-quotes">
Religion
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="2" href="/topics/arrogance-quotes">
Arrogance
</div>
这是我们需要考虑如何最好地获取信息的地方。我们看到它在一个 div 标签和一个 a 标签中。但是,如果我们拉下它,我们最终会再次抓住相同的东西,所以我们需要找到报价所独有的东西,而不是其他东西。
因此,如果我们回顾第二次导出,并比较引号周围的 a 标签:
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
<a class="b-qt qt_531016 oncl_q" href="/quotes/nassim_nicholas_taleb_531016" title="view quote">
我们可以看到 class 和 href 部分每次都在变化,不会有太大帮助,但是标题中的信息保持不变,因此我们可以使用它。再次使用您的代码作为模板:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
quotes = s.find_all("a", attrs={"title": "view quote"})
for a in quotes:
listOfQuotes.append(a.text)
print(listOfQuotes)
对于您问题的第二部分,我会使用 Lucas 在我之前所说的内容,但是我已将其改编为您的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
# We bring all the "a" that has the title "view quote"
quotes = s.find_all("a", attrs={"title": "view quote"})
for a in quotes:
# do something...
listOfQuotes.append(a.text)
pagination = s.select('ul[class*="pagination"]')
if not pagination:
pages = 0
else:
# we subtract two, that of next and that of previous
pages = len(pagination[0].find_all("li")) - 2