Python：使用 bs4 和 RegEx 仅解析 HTML 中的文本答案

【问题标题】：Python: parsing only text from HTML using bs4 and RegExPython：使用 bs4 和 RegEx 仅解析 HTML 中的文本
【发布时间】：2016-04-05 21:14:34
【问题描述】：

我正在使用 bs4 构建一个 python3 网络爬虫/爬虫。有些部分需要Reg Ex。我只想抓取文本内容。我应该如何解析这样的东西：

<p> This is blah blah
<a class="wordpresslink" href="https://wordpress.com/" rel="generator nofollow">WordPress.com</a>
<a href="http://www.whatever.com/"><span class="s1">Example</span></a>
Like blah blah
</p>

我想要输出：

This is blah blah WordPress.com Example Like blah blah

到目前为止我的代码：

import urllib.request
from bs4 import BeautifulSoup

u='https://en.wikipedia.org/wiki/Adivasi'
r=urllib.request.urlopen(u)
soup=BeautifulSoup(r.read(),'html.parser')

res = [i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]
for p in res:
        print(p)

【问题讨论】：

标签： python html regex python-3.x beautifulsoup

【解决方案1】：

使用 BeautifulSoup 解析器来解析 html 文件。

>>> soup = BeautifulSoup(s)
>>> soup.find('p').text
u' This is blah blah\nWordPress.com\nExample\nLike blah blah\n'
>>> soup.find('p').text.replace('\n', ' ').strip()
u'This is blah blah WordPress.com Example Like blah blah'

如果还有更多，请使用find_all

[i.text.replace('\n', ' ').strip() for i in soup.find_all('p')]

【讨论】：

如果我有多个段落怎么办？
@MagicManSuperMan：使用 soup.find_all() 和 for 循环代替。我建议您在发布问题之前阅读文档并尝试一下...
@AvinashRaj：好吧，既然\ns 可能更多，我认为re.sub('\n+', ' ', i.text).strip() 会更好。（或者str.splitlines() 改为str.join()）。
好吧，我确实尝试过@KevinGuan，但一无所获。
我正在尝试废弃此页面：link，那里有一个不同的脚本：आदिवासी，我的代码在那里出错。 ` 文件“C:/Users/James/Desktop/crawler1.py”，第 16 行，在 print(p) 文件“C:\Python34\lib\encodings\cp1252.py”，第 19 行，编码返回codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 16-22: character maps to `