如何在 BeautifulSoup 中捕获内部文本和内部标签答案

【问题标题】：How to capture inner text as well as inner tags in BeautifulSoup如何在 BeautifulSoup 中捕获内部文本和内部标签
【发布时间】：2014-04-01 11:12:49
【问题描述】：

我有一个我正在解析的文档，其中包含div 标记列表，但它有时也只有内联文本。我需要知道如何按顺序从中提取内容。

假设我有以下内容：

<div>
<div>1</div>
<div>2</div>
3
<div>4</div>
</div>

我需要提取上面的所有文本，使其显示为 1234。

我有以下代码可以获取所有div 标签，但不会自行获取文本。

from ghost import Ghost
from BeautifulSoup import BeautifulSoup

def tagfilter(tag):
    return tag.name == 'div'

ghost = Ghost()
ghost.open("testpage.html")

page, resources = ghost.wait_for_page_loaded()

soup = BeautifulSoup(ghost.content)
maindiv = soup.find('div', {'id': 'parentdiv'})
outtext = ''
for s in maindiv.findAll(ipfilter):
    outtext + = s.text
print outtext

【问题讨论】：

标签： python html beautifulsoup screen-scraping

【解决方案1】：

使用stripped_strings（或strings，如果需要空格）：

In [16]: soup = BeautifulSoup('''<div>
<div>1</div>
<div>2</div>
3
<div>4</div>
</div>''')


In [19]: list(soup.stripped_strings)
Out[19]: [u'1', u'2', u'3', u'4']


In [20]: ''.join(soup.stripped_strings)
Out[20]: u'1234'

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings

【讨论】：

谢谢，我会试一试的。