【问题标题】:lxml cssselect Parsinglxml cssselect解析
【发布时间】:2011-06-22 01:00:44
【问题描述】:
我有一个包含以下数据的文档:
<div class="ds-list">
<b>1. </b>
A domesticated carnivorous mammal
<i>(Canis familiaris)</i>
related to the foxes and wolves and raised in a wide variety of breeds.
</div>
我想得到ds-list 类中的所有内容(没有<b> 和<i> 标签)。目前我的代码是doc.cssselect('div.ds-list'),但所有这些都是<b> 之前的换行符。我怎样才能让它做我想做的事?
【问题讨论】:
标签:
python
html
parsing
css-selectors
lxml
【解决方案1】:
也许您正在寻找text_content 方法?:
import lxml.html as lh
content='''\
<div class="ds-list">
<b>1. </b>
A domesticated carnivorous mammal
<i>(Canis familiaris)</i>
related to the foxes and wolves and raised in a wide variety of breeds.
</div>'''
doc=lh.fromstring(content)
for div in doc.cssselect('div.ds-list'):
print(div.text_content())
产量
1.
A domesticated carnivorous mammal
(Canis familiaris)
related to the foxes and wolves and raised in a wide variety of breeds.
【解决方案2】:
doc.cssselect("div.ds-list").text_content()