【发布时间】:2017-11-08 11:51:44
【问题描述】:
我在 python 中编写了一个脚本来从某些元素中解析出一些名称。当我执行我的脚本时,它会解析名称,但输出看起来很奇怪。名称的解析方式使其看起来像两个大牌。名称由 br 标签分隔。如何单独获取每个名称?
名称所在的元素:
html_content='''
<div class="second-child"><div class="richText"> <p></p>
<p><strong>D<br></strong>Daiwa House Industry<br>Danske Bank<br>DaVita HealthCare Partners<br>Delphi Automotive<br>Denso<br>Dentsply International<br>Deutsche Boerse<br>Deutsche Post<br>Deutsche Telekom<br>Diageo<br>Dialight<br>Digital Realty Trust<br>Donaldson Company<br>DSM<br>DS Smith </p>
<p><strong>E<br></strong>East Japan Railway Company<br>eBay<br>EDP Renováveis<br>Edwards Lifesciences<br>Elekta<br>EnerNOC<br>Enphase Energy<br>Essilor<br>Etsy<br>Eurazeo<br>European Investment Bank (EIB)<br>Evonik Industries<br>Express Scripts <br><br><strong>F<br></strong>Fielmann<br>First Solar<br>FMO<br>Ford Motor<br>Fresenius Medical Care<br><br></p></div></div>
'''
我为解析名称而编写的脚本:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content,"lxml")
for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)
我得到的输出(部分结果):
DDaiwa House IndustryDanske BankDaVita HealthCare PartnersDelphi AutomotiveDensoDentsply InternationalDeutsche
我想得到的输出:
DDaiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
仅供参考,当我仔细查看结果时,我会发现每个单独的名称都相互关联,中间没有间隔。
【问题讨论】:
标签: python string python-3.x web-scraping css-selectors