如何使用xpath同时从内部或外部获取文本？答案

【问题标题】：How to get text from inside span or outside at the same time with xpath?如何使用xpath同时从内部或外部获取文本？
【发布时间】：2020-02-04 00:44:20
【问题描述】：

我在使用 xpath 获取不一致的价目表时遇到问题

示例

<td><span="green">$33.99</span></td>
<td>Out of stock</td>
<td><span="green">$27.99</span></td>
<td><span="green">$35.00</span></td>

如何同时获取跨度内和缺货的价格？因为我只得到 33.99 美元或任何有跨度和不在跨度内的文本被跳过。它破坏了订购。

我使用 @piratefache 的解决方案 (Scrapy) 更新的失败尝试

product_prices_tds = response.xpath('//td/')
    product_prices = []

    for td in product_prices_tds:
        if td.xpath('//span'):
            product_prices = td.xpath('//span/text()').extract()
        else:
            product_prices = td.xpath('//text()').extract()

    for n in range(len(product_names)):
        items['price'] = product_prices[n]
        yield items

它不起作用，因为 product_prices 没有从各地获得正确的文本。不只是像我打算的那样在跨度内部或外部。

更新对于后来来的人。感谢@piratefache，我修复了我的代码。这里更正了 sn-p 供以后想要使用的人使用。

product_prices_tds = response.xpath('//td')
    product_prices = []

    for td in product_prices_tds:
        if td.xpath('span'):
            product_prices.append(td.xpath('span//text()').extract())
        else:
            product_prices.append(td.xpath('/text()').extract())

    for n in range(len(product_names)):
        items['price'] = product_prices[n]
        yield items

【问题讨论】：

您可能正在使用 BeautifulSoup 库来获取您的 html。到目前为止你的代码是什么？
首先我使用 //td/span/text() 来得到这个。但不会缺货第二次我尝试使用 //td/text() | /span/text() 我得到了更好的结果，但它并没有将结果合并为一个。它只是用 \n 的东西列出跨度之外的东西。接下来是跨度内的东西，另一个阵列是 33.99 美元。不加入一个结果。
好的，您可以编辑您的帖子并添加您的代码吗？调试起来会更容易。
更新了帖子上的代码。
我还没有看到任何可运行的 Python 代码？

标签： python list xpath scrapy

【解决方案1】：

使用 Scrapy 查看下面的编辑

根据您的 html 代码，使用 BeautifulSoup 库，您可以通过以下方式获取信息：

from bs4 import BeautifulSoup

page = """<td><span="green">$33.99</span></td>
          <td>Out of stock</td>
            <td><span="green">$27.99</span></td>
            <td><span="green">$35.00</span></td>"""

soup = BeautifulSoup(page, features="lxml")
tds = soup.body.findAll('td') # get all spans

for td in tds:

    # if attribute span exist
    if td.find('span'):
        print(td.find('span').text)
    # if not, just print inner text (here it's out of stock)
    else:
        print(td.text)

输出：

$33.99
Out of stock
$27.99
$35.00

使用 Scrapy：

import scrapy

page = """<td><span="green">$33.99</span></td>
          <td>Out of stock</td>
            <td><span="green">$27.99</span></td>
            <td><span="green">$35.00</span></td>"""

response = scrapy.Selector(text=page, type="html")
tds = response.xpath('//td')

for td in tds:

    # if attribute span exist
    if td.xpath('span'):
        print(td.xpath('span//text()')[0].extract())
    # if not, just print inner text (here it's out of stock)
    else:
        print(td.xpath('text()')[0].extract())

输出：

$33.99
Out of stock
$27.99
$35.00

【讨论】：

很抱歉没有具体说明。但我使用scrapy。 bs4 无法获取 js 渲染的数据。
是的！我用正确的语法修复了我的代码，就像你的建议一样，它正在工作。你是我的救命恩人。非常感谢。

【解决方案2】：

XPath 解决方案（从 2.0 开始）（与@piratefache 之前发布的逻辑相同）：

for $td in //td 
return 
if ($td[span]) 
then
$td/span/data() 
else 
$td/data()

适用于

<root>
    <td>
        <span>$33.99</span>
    </td>
    <td>Out of stock</td>
    <td>
        <span>$27.99</span>
    </td>
    <td>
        <span>$35.00</span>
    </td>
</root>

 $33.99
 Out of stock
 $27.99
 $35.00

顺便说一句：<span="green"> 不是有效的 XML。可能缺少@color 或类似属性（？）

【讨论】：