Python BS4 共享信息未显示答案

【问题标题】：Python BS4 share info not showingPython BS4 共享信息未显示
【发布时间】：2018-08-02 02:22:38
【问题描述】：

大家早上好

我最近开始使用 BeautifulSoup 并观看视频并阅读它，目的是每天从网络上抓取股价信息并将其添加到先前填充的包含历史股价的 csv 文件中。

我已经尝试对我的代码（如下）进行多次修改，无论我是使用“div”还是“span”元素，然后添加完整的类名 - 我最终都会收到空括号“[]”作为我的打印在控制台中。

我使用的网站是 yahoo Finance - 所以我尝试使用另一个网站 Sharenet，同样的问题。然后我尝试抓取网站的另一部分（仅限共享名称） - 也是空括号。我收到结果的唯一一次是当我抓取一个嵌套了多个项目的“div”时 - 在打印输出中我可以看到股价信息，但肯定有办法只获得价格吗？

我一直在 youtube 上使用以下视频作为指南，该视频与之前在此处发布的类似问题的帖子一起非常有帮助，但我仍然遇到问题。

https://www.youtube.com/watch?v=XQgXKtPSzUI

import yahoo finance stock price with beautifulsoup and request

以下是我的代码（我使用的是 python 2.7）：

import urllib2
from bs4 import BeautifulSoup as soup

#Opens the connection and downloads the webpage
kio_site = urllib2.urlopen("https://finance.yahoo.com/quote/KIO.JO?p=KIO.JO")

#This will print all the html on the webpage
kio_html = kio_site.read()
#Now closing the internet connection that you opened before
kio_site.close()

#now you want to parse the html file
page_soup = soup(kio_html, "html.parser")

#Specifically find certain elements
kio_info = page_soup.find_all("span", {"class":"Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)"})
print kio_info

当我改用下面的代码时，我得到了一个结果，但股价却在混乱之中：

kio_info = page_soup.find_all("div", {"class":"My(6px) smartphone_Mt(15px)"})

在打印输出中，我还看到在股价数字之前有一个“data-reactid”=“14”，但即使我将它包含在我的代码中（以及“span”和“class”“Trsdu （0.3s）”等）它也没有给我价格。

难道我阅读网页的方式不应该是html？我尝试使用 lxml 但出现错误。

提前感谢您的帮助！

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

我建议使用requests 库。但是，这不是这里的问题。由于使用了默认的User-Agent，该网站正在识别 Python 脚本并返回不同的响应。

您可以通过requests 模块传递一个假的User-Agent，使脚本看起来像一个真正的浏览器。

你可以用这个：

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
r = requests.get('https://finance.yahoo.com/quote/KIO.JO?p=KIO.JO', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

print(soup.find_all("span", {"class": "Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)"}))

输出：

[<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="35"><!-- react-text: 36 -->33,101.00<!-- /react-text --></span>]

或者，使用它来获取值：

print(soup.find("span", {"class": "Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)"}).text)

输出：

33,101.00

【讨论】：

非常感谢您的帮助 - 当我在 html 末尾使用带有“.text”的代码时，它准确地打印出股价值。关于您回答中的逻辑，我确实有一个小问题-您是说问题在于我的 Chrome 浏览器读取网站上的 html 并将其发送到 Python 的方式不正确吗？您是如何获得“headers”变量中的信息的？另一个小问题——“.find”与“.find_all”有何不同？再次感谢您！
阅读this for your first question和this for the second。