【问题标题】:BeautifulSoup Scraping U.S. News Today Stock <table>BeautifulSoup 抓取今日美国新闻股票 <table>
【发布时间】:2018-06-23 13:46:06
【问题描述】:

使用 Python,我正在尝试从 U.S. Today Money Stocks Under $10 中删除 10 美元以下的股票表。然后将每个元素添加到一个列表中(这样我就可以遍历每个股票)。目前,我有这个代码:

resp = requests.get('https://money.usnews.com/investing/stocks/stocks-under-10')
soup = bs.BeautifulSoup(resp.text, "lxml")
table = soup.find('table', {'class': 'table stock full-row search-content'})
tickers = []
for row in table.findAll('tr')[1:]:
    ticker = str(row.findAll('td')[0].text)
    tickers.append(ticker)

我不断收到错误:

Traceback (most recent call last):
  File "sandp.py", line 98, in <module>
    sandp(0)
  File "sandp.py", line 40, in sandp
    for row in table.findAll('tr')[1:]:
AttributeError: 'NoneType' object has no attribute 'findAll'

【问题讨论】:

  • 你能告诉我们table的样子吗?只是为了确保您确实得到了结果。
  • @TomasFarias 我添加了print table 行,终端显示none
  • 好吧,soup.find('table', {'class': 'table stock full-row search-content'}) 似乎找不到结果。你确定那是桌子的正确类别吗?您是否检查过 soup 是否真的访问了正确的内容?也许您必须将一些标头传递给requests.get
  • @TomasFarias 是的,它是正确的表

标签: python web-scraping beautifulsoup stocks


【解决方案1】:

该站点是动态的,因此,您可以使用selenium

from selenium import webdriver
import collections
from bs4 import BeautifulSoup as soup
import re
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://money.usnews.com/investing/stocks/stocks-under-10')
s = soup(d.page_source, 'lxml')
while True:
  try:
    d.find_element_by_link_text("Load More").click() #get all data
  except:
    break
company = collections.namedtuple('company', ['name', 'abbreviation', 'description', 'stats'])
headers = [['a', {'class':'search-result-link'}], ['a', {'class':'text-muted'}], ['p', {'class':'text-small show-for-medium-up ellipsis'}], ['dl', {'class':'inline-dl'}], ['span', {'class':'stock-trend'}], ['div', {'class':'flex-row'}]]
final_data = [[getattr(i.find(a, b), 'text', None) for a, b in headers] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'search-result flex-row'})]
new_data = [[i[0], i[1], re.sub('\n+\s{2,}', '', i[2]), [re.findall('[\$\w\.%/]+', d) for d in i[3:]]] for i in final_data]
final_results = [i[:3]+[dict(zip(['Price', 'Daily Change', 'Percent Change'], filter(lambda x:re.findall('\d', x), i[-1][0])))] for i in new_data]
new_results = [company(*i) for i in final_results]

产出(第一家公司):

company(name=u'Aileron Therapeutics Inc', abbreviation=u'ALRN', description=u'Aileron Therapeutics, Inc. is a clinical stage biopharmaceutical company, which focuses on developing and commercializing stapled peptides. Its ALRN-6924 product targets the tumor suppressor p53 for the treatment of a wide variety of cancers. It also offers the MDMX and MDM2. The company was founded by Gregory L. Verdine, Rosana Kapeller, Huw M. Nash, Joseph A. Yanchik III, and Loren David Walensky in June 2005 and is headquartered in Cambridge, MA.more\n', stats={'Daily Change': u'$0.02', 'Price': u'$6.04', 'Percent Change': u'0.33%'})

编辑:

所有缩写:

abbrevs = [i.abbreviation for i in new_results]

输出:

[u'ALRN', u'HAIR', u'ONCY', u'EAST', u'CERC', u'ENPH', u'CASI', u'AMBO', u'CWBR', u'TRXC', u'NIHD', u'LGCY', u'MRNS', u'RFIL', u'AUTO', u'NEPT', u'ARQL', u'ITUS', u'SRAX', u'APTO']

【讨论】:

  • 我对此完全陌生,使用 Selenium 而不是 BS 有什么好处?
  • @Fidel_Willis 当我尝试使用简单的requests 访问该站点时,我的请求数据包被该站点阻止,因此仅返回一个非常小的带有html 的字符串。因此,为表调用BeautifulSoup.find 将返回None。我认为您收到AttributeError 的原因是因为这个。最好的解决方案是使用selenium,因为它在网页上运行必要的客户端脚本来验证 IP、更新DOM 等。但是,如果您的代码返回页面的完整 HTML,您从第 14 行开始仍然可以使用我的解决方案。
  • 啊我现在明白了。谢谢你。使用您的代码,我收到以下错误:os.path.basename(self.path), self.start_error_message) selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home Exception AttributeError: "'Service' object has no attribute 'process'" in &lt;bound method Service.__del__ of &lt;selenium.webdriver.chrome.service.Service object at 0x1a0bbc4a10&gt;&gt; ignored 这是将文件放在错误的文件夹中吗?
  • @Fidel_Willis 驱动的路径必须传递给Chrome的构造函数。请参阅我最近的编辑。但是请注意,您不必使用Chrome。如果你想使用Firefox,你可以简单地使用webdriver.Firefox并下载firefox驱动。
  • 在下载了驱动等之后,终于找到了正确的PATH。但现在我收到一个关于权限的错误:Message: 'selenium' executable may have wrong permissions。如何更改这些?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2017-06-18
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-12-20
  • 2020-07-08
  • 2019-06-27
相关资源
最近更新 更多