【发布时间】:2020-06-10 12:35:31
【问题描述】:
我正在尝试使用以下代码从彭博公司简介网站上提取公司信息:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.bloomberg.com/profile/company/AAPL:US'
source = requests.get(URL)
soup = BeautifulSoup(source.content, 'lxml')
company_name = soup.findAll('h1', class_= 'companyName__9bd88132')
company_description = soup.findAll('div', class_ = 'description__ce057c5c')
print(company_name)
print(company_description)
但结果我只得到了两个“[]”。在我看到的类似问题的回复中,他们说这是因为正在提取不正确的 div,但我认为情况并非如此。有人会知道为什么它不起作用吗? 编辑:我附上了我试图从下面拉出的 html 部分:
<section class="companyProfileOverview__aa874298 up__e13cf193"><section class="info__d075c560"><h1 class="companyName__9bd88132">Apple Inc</h1><div class="description__ce057c5c">Apple Inc. designs, manufactures, and markets personal computers and related personal computing and mobile communication devices along with a variety of related software, services, peripherals, and networking solutions. Apple sells its products worldwide through its online stores, its retail stores, its direct sales force, third-party wholesalers, and resellers.</div></section><section class="currentPriceContainer"><p class="currentPriceLabel__f1524605">CURRENT PRICE</p><div><div class="inlineRow__7728fc34"><span class="tickerText__d2e1ee30">AAPL:US</span><span class="priceText__0feeaba3">343.99</span><span class="currency__bef924de">USD</span></div><span class="triangle__73a7d8b2 up__a3b61807"></span><div class="inlineRow__7728fc34"><span class="priceChange__5e691975">+10.53</span><span class="percentChange__3c14f7c4">+3.16%</span></div><div class="time__245ca7bb "><span>As of 08:00 PM EDT 06/09/2020 </span></div><a class="quoteLink__d3ac120b" href="/quote/AAPL:US">SEE QUOTE</a></div></section><div class="infoTable__96162ad6"><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">SECTOR</h2><div class="infoTableItemValue__e188b0cb">Technology</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">INDUSTRY</h2><div class="infoTableItemValue__e188b0cb">Hardware</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">SUB-INDUSTRY</h2><div class="infoTableItemValue__e188b0cb">Communications Equipment</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">FOUNDED</h2><div class="infoTableItemValue__e188b0cb">01/03/1977</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">ADDRESS</h2><div class="infoTableItemValue__e188b0cb">1 Infinite Loop
Cupertino, CA 95014
United States</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">PHONE</h2><div class="infoTableItemValue__e188b0cb">1-408-996-1010</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">WEBSITE</h2><div class="infoTableItemValue__e188b0cb">www.apple.com</div></section><section class="infoTableItem__1003ce53"><h2 class="infoTableItemLabel__c9a5d511">NO. OF EMPLOYEES</h2><div class="infoTableItemValue__e188b0cb">100000</div></section></div></section>
我正在尝试提取公司名称 (companyName__9bd88132) 和公司描述 (description__ce057c5c)。最终我也想提取部门信息。
【问题讨论】:
-
首先你应该检查页面是否没有使用 JavaScript 添加元素,因为
requests和BS不能运行 JavaScript。其次,您应该检查print(source.text)以查看您从服务器获得的信息 - 在这里我看到<title>Bloomberg - Are you a robot?</title>这意味着服务器识别脚本并发送了不同的内容。现在它可能需要更多的工作 - 即。User-Agent之类的标头 - 表现得像真人一样,然后服务器可能会发送正确的数据。或者你可能需要 Selenium 来控制真实的网络浏览器,它的行为更像真人。
标签: python beautifulsoup python-requests findall