python - 如何通过python中的scrapy将带有某些特定标签文本的文本放在一个标签中？答案

【问题标题】：How to get text with some specific tags' text together in a tag by scrapy in python?python - 如何通过python中的scrapy将带有某些特定标签文本的文本放在一个标签中？
【发布时间】：2017-06-08 06:19:07
【问题描述】：

我是 scrapy 的新手。我想从网上抓取一些数据。我得到了如下的 html 文档。

<div class="user-info">
    <p class="user-img">
        something in p tag
    </p>
    <em>text</em> data I want
    <a href="#">
        something in a tag
    </a>
</div>

我只想获得我想要的文本数据。但是 text 在标签 中。所以如果我使用div[contains(@class, "user-info")]/text()，我只能得到我想要的数据。如果我使用div[contains(@class, "user-info")]/node()，我会得到div.user-info 和div[contains(@class, "user-info")]/node()/text() 中的所有标签。那么问题来了，我怎样才能把text和我想要的数据一起变成我想要的text data？

【问题讨论】：

标签： python xpath web-scraping scrapy selector

【解决方案1】：

如果你想要之后和<a href="#">something in a tag</a>之前的所有节点，你可以使用following axis：

以下轴包含与上下文节点相同的文档中的所有节点，这些节点按文档顺序在上下文节点之后，不包括任何后代，不包括属性节点和命名空间节点

>>> s = scrapy.Selector(text='''<div class="user-info">
...     <p class="user-img">
...         something in p tag
...     </p>
...     <em>text</em> data I want
...     <a href="#">
...         something in a tag
...     </a>
... </div>''')
>>> s.css('p.user-img')
[<Selector xpath="descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' user-img ')]" data='<p class="user-img">\n        something i'>]

>>> s.css('p.user-img').xpath('following::text()[following::a]').getall()
['\n    ', 'text', ' data I want\n    ']

>>> ''.join(s.css('p.user-img').xpath('following::text()[following::a]').getall())
'\n    text data I want\n    '

【讨论】：

【解决方案2】：

尝试使用下面的XPath 来连接两个必需的文本节点：

concat(//div[@class="user-info"]/em/text(), " ", //div[@class="user-info"]/text()[3])

【讨论】：

谢谢，但它们可能在数据中，例如：文本 data 我想要，丢失了它的顺序。
试试//div[@class="user-info"]//text()[not(parent::a or parent::p)]

【解决方案3】：

我将和替换为""，然后使用div[contains(@class, "user-info")]/text()

【讨论】：