text 位于两个 <p> 标记内,因此部分文本位于每个 p.text 中,而不是 div.text 中。但是,您可以通过调用text_content 方法而不是使用XPath text() 来提取<div> 的所有子项中的所有文本:
import requests
import lxml.html as LH
url = ("https://www.goodeggs.com/sfbay/missionheirloom/"
"seasonal-chicken-stew-16oz/53c68de974e06f020000073f")
page = requests.get(url, verify=False)
root = LH.fromstring(page.text)
path = '//section[@class="product-description"]/div[@class="description-body"]'
for div in root.xpath(path):
print(div.text_content())
产量
We’re super excited about the changing seasons! Because the new season brings wonderful new ingredients, we’ll be changing the flavor profile of our stews. Starting with deliveries on Thursday October 9th, the Chicken and Wild Rice stew will be replaced with a Classic Chicken Stew. We’re sure you’ll love it!Mission: Heirloom is a food company based in Berkeley. All of our food is sourced as locally as possible and 100% organic or biodynamic. We never cook with refined oils, and our food is always gluten-free, grain-free, soy-free, peanut-free, legume-free, and added sugar-free.
PS。 dfsq 已经建议使用 XPath ...//text()。这也有效,但与 text_content 相比,文本片段作为单独的项目返回:
In [256]: root = LH.fromstring('<a>FOO <b>BAR <c>QUX</c> </b> BAZ</a>')
In [257]: root.xpath('//a//text()')
Out[257]: ['FOO ', 'BAR ', 'QUX', ' ', ' BAZ']
In [258]: [a.text_content() for a in root.xpath('//a')]
Out[258]: ['FOO BAR QUX BAZ']