结合剥离空白和 html 标签答案

【问题标题】：combine stripping white space and html tags结合剥离空白和 html 标签
【发布时间】：2015-06-18 16:39:57
【问题描述】：

我正在寻找使用 Beautiful Soup 从解析文本中去除 html 标记和空白的可能性。问题是我不能将这两者结合起来。

这是整个脚本：

# -*- coding: utf-8 -*-

from urllib2 import urlopen
from bs4 import BeautifulSoup as bs

word = "Drop"
url = ('http://civil.ge/eng/category.php?id=10')
soup = bs(urlopen(url).read())
titz = soup.find("div", {"class": "archtype_category_block"})

for t in titz.find_all('div', {'class': 'archive_type_article_title'}):
    if word in t.encode('utf-8').strip():
        print t.prettify()

prettify() 的结果是：

<div class="archive_type_article_title">
 Prosecutors Drop Objection to Release of Ex-MoD Officials from Pretrial     Detention
</div>

和get_text() 我得到干净的文本，前后有很多空白。有什么解决办法吗？

谢谢！

【问题讨论】：

标签： python-2.7 beautifulsoup

【解决方案1】：

我使用了 Python 3，但无法重现您的间距问题。所以也许这就是一个答案！

我会将print t.prettify() 更改为print t.prettify().join(mystring.split())，看看是否能解决您的问题。

另外，你的代码只会得到第一个archtype_category_block，也许这就是你想要的，但如果你想要所有这些，你必须将titz = soup.find("div", {"class": "archtype_category_block"})更改为for titz in soup.find_all("div", {"class": "archtype_category_block"}):

【讨论】：

感谢您的回答。 join() 产生 TypeError: 'NoneType' object is not callable。