使用 Beautifulsoup 时如何获取文本标记答案

【问题标题】：How to get the text tokens when using Beautifulsoup使用 Beautifulsoup 时如何获取文本标记
【发布时间】：2017-09-05 14:52:39
【问题描述】：

我是文本挖掘的新手，正在从事一个玩具项目，从网站上抓取文本并将其拆分为令牌。但是，使用Beautifulsoup下载内容后，使用.split方法分割失败，代码如下

# -*- coding: utf-8 -*-
import nltk
import operator
import urllib3
from bs4 import BeautifulSoup

http = urllib3.PoolManager()
url= 'http://python.org/'
response = http.request('GET',url)
# nltk.clean_html is dropped by NTLK
clean = BeautifulSoup(response.data,"html5lib")
# clean will have entire string removing all the html noise
tokens = [tok for tok in clean.split()]
print tokens[:100]

Python 告诉我

TypeError: 'NoneType' object is not callable

根据之前的stackoverflow question，是因为

clean 不是字符串，它是 bs4.element.Tag。当你试图抬头分裂它，它发挥它的魔力并试图找到一个名为的子元素分裂，但没有。你称它为无

在这种情况下，我应该如何调整我的代码以实现获得令牌的目标？谢谢。

【问题讨论】：

在我看来，您几乎没有阅读 BeautifulSoup 文档：crummy.com/software/BeautifulSoup/bs4/doc。没有一种方法可以以一种有用的方式从页面中获取令牌。有必要对每一页进行研究。
BeautifulSoup Grab Visible Webpage Text的可能重复

标签： python python-2.7 web-scraping beautifulsoup

【解决方案1】：

您可以使用 get_text() 仅返回 HTML 中的文本并将其传递给 nltk word_tokenize()，如下所示：

from bs4 import BeautifulSoup
import requests
import nltk

response = requests.get('http://python.org/').content
soup = BeautifulSoup(response, "html.parser")
text_tokens = nltk.tokenize.word_tokenize(soup.get_text())

print text_tokens

（你也可以使用 urllib3 来获取你的数据）

这会给你一些开始：

[u'Welcome', u'to', u'Python.org', u'{', u'``', u'@', u'context', u"''", u':'...

如果您只对单词感兴趣，则可以进一步过滤返回的列表以删除只有标点符号的条目，例如：

text_tokens = [t for t in text_tokens if not re.match('[' + string.punctuation + ']+', t)]

【讨论】：