美丽的汤返回“[]”答案

【问题标题】：Beautiful Soup returning "[]"美丽的汤返回“[]”
【发布时间】：2020-06-10 12:35:31
【问题描述】：

我正在尝试使用以下代码从彭博公司简介网站上提取公司信息：

import requests
from bs4 import BeautifulSoup

URL = 'https://www.bloomberg.com/profile/company/AAPL:US'

source = requests.get(URL)

soup = BeautifulSoup(source.content, 'lxml')

company_name = soup.findAll('h1', class_= 'companyName__9bd88132')

company_description = soup.findAll('div', class_ = 'description__ce057c5c')

print(company_name)
print(company_description)

但结果我只得到了两个“[]”。在我看到的类似问题的回复中，他们说这是因为正在提取不正确的 div，但我认为情况并非如此。有人会知道为什么它不起作用吗？编辑：我附上了我试图从下面拉出的 html 部分：

<section class="companyProfileOverview__aa874298 up__e13cf193"><section class="info__d075c560"><h1 class="companyName__9bd88132">Apple Inc</h1><div class="description__ce057c5c">Apple Inc. designs, manufactures, and markets personal computers and related personal computing and mobile communication devices along with a variety of related software, services, peripherals, and networking solutions. Apple sells its products worldwide through its online stores, its retail stores, its direct sales force, third-party wholesalers, and resellers.</div></section><section class="currentPriceContainer"><p class="currentPriceLabel__f1524605">CURRENT PRICE</p><div><div class="inlineRow__7728fc34"><span class="tickerText__d2e1ee30">AAPL:US</span><span class="priceText__0feeaba3">343.99</span><span class="currency__bef924de">USD</span></div><span class="triangle__73a7d8b2 up__a3b61807"></span><div class="inlineRow__7728fc34"><span class="priceChange__5e691975">+10.53</span><span class="percentChange__3c14f7c4">+3.16%</span></div><div class="time__245ca7bb "><span>As of 08:00 PM EDT 06/09/2020 </span></div><a class="quoteLink__d3ac120b" href="/quote/AAPL:US">SEE QUOTE</a></div></section><div class="infoTable__96162ad6"><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">SECTOR</h2><div class="infoTableItemValue__e188b0cb">Technology</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">INDUSTRY</h2><div class="infoTableItemValue__e188b0cb">Hardware</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">SUB-INDUSTRY</h2><div class="infoTableItemValue__e188b0cb">Communications Equipment</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">FOUNDED</h2><div class="infoTableItemValue__e188b0cb">01/03/1977</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">ADDRESS</h2><div class="infoTableItemValue__e188b0cb">1 Infinite Loop
Cupertino, CA 95014
United States</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">PHONE</h2><div class="infoTableItemValue__e188b0cb">1-408-996-1010</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">WEBSITE</h2><div class="infoTableItemValue__e188b0cb">www.apple.com</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">NO. OF EMPLOYEES</h2><div class="infoTableItemValue__e188b0cb">100000</div></section></div></section>

我正在尝试提取公司名称 (companyName__9bd88132) 和公司描述 (description__ce057c5c)。最终我也想提取部门信息。

【问题讨论】：

首先你应该检查页面是否没有使用 JavaScript 添加元素，因为 requests 和 BS 不能运行 JavaScript。其次，您应该检查print(source.text) 以查看您从服务器获得的信息 - 在这里我看到<title>Bloomberg - Are you a robot?</title> 这意味着服务器识别脚本并发送了不同的内容。现在它可能需要更多的工作 - 即。 User-Agent 之类的标头 - 表现得像真人一样，然后服务器可能会发送正确的数据。或者你可能需要 Selenium 来控制真实的网络浏览器，它的行为更像真人。

标签： python beautifulsoup python-requests findall

【解决方案1】：

使用此代码：

import requests
from bs4 import BeautifulSoup

URL = 'https://www.bloomberg.com/profile/company/AAPL:US'
from fake_useragent import UserAgent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
ua=UserAgent()
hdr = {'User-Agent': ua.random,
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive'}
source = requests.get(URL,headers=hdr)

soup = BeautifulSoup(source.content, features="html.parser")
# print(soup)
company_name = soup.find_all('h1', class_= 'companyName__9bd88132')

company_description = soup.find_all('div', class_ = 'description__ce057c5c')

print(company_name)
print(company_description)

【讨论】：

感谢以上代码！它最初不起作用，但我不得不将“ua=UserAgent()”更改为“ua=UserAgent(verify_ssl=False)”，因为验证 ssl 时出现问题。再次感谢。