【问题标题】：Extracting paragraph text including other element's content using Scrapy Selector使用 Scrapy Selector 提取包含其他元素内容的段落文本
【发布时间】：2015-03-25 11:24:37
【问题描述】：

使用Scrapy 0.24 Selectors，我想提取包含其他元素内容的段落内容（在下面的示例中，它将是锚点<a>。我该如何实现呢？

代码

>>> from scrapy import Selector
>>> html = """
        <html>
            <head>
                <title>Test</title>
            </head>
            <body>
                <div>
                    <p>Hello, can I get this paragraph content without this <a href="http://google.com">Google link</a>?
                </div>
            </body>
        </html>
        """
>>> sel = Selector(text=html, type="html")
>>> sel.xpath('//p/text()').extract()
[u'Hello, can I get this paragraph content with this ', u'?']

输出

[u'Hello, can I get this paragraph content with this ', u'?']

预期输出

[u'Hello, can I get this paragraph content with this Google link?']

【问题讨论】：

嗯。您可以先提取 <a hrefs... 中的内容，然后对其进行解析。如何开始的示例：stackoverflow.com/questions/3997525/python-replace-with-regex

标签： python html xpath scrapy

【解决方案1】：

我会推荐 BeautifulSoup。 scrapy 是一个完整的爬虫框架，而 BS 是一个强大的解析库 (Difference between BeautifulSoup and Scrapy crawler?)。

文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

安装：pip install beautifulsoup4

对于您的情况：

# 'html' is the one your provided
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
res = [p.get_text().strip() for p in soup.find_all('p')]

结果：

[u'Hello, can I get this paragraph content without this Google link?']

【讨论】：