使用 BeautifulSoup 以与 html 中相同的格式从 html 中提取文本答案

【问题标题】：Extract text from html using in the same format as in html using BeautifulSoup使用 BeautifulSoup 以与 html 中相同的格式从 html 中提取文本
【发布时间】：2020-08-26 11:21:06
【问题描述】：

代码：

body_text = BeautifulSoup(open(html)).text

在 html 页面中，类似于 1 的行。ETA 基础预期的海洋气象条件在提取时被拆分为行，需要解决此问题。

我使用了类似的字符串格式化条件

body_html = str(BeautifulSoup(open(html_file)))
body_html = body_html.replace('\n', ' ') #to remove all new lines
body_html = body_html.replace('/>', '/>\n') # add new lines so that texts from two different tags do not extracted in same line

示例 HTML 页面：https://easyupload.io/sh02xi

有没有更好的方法来提取与我们在 html 中可视化的格式相同的文本？

【问题讨论】：

试试soup =BeautifulSoup(open(html_file))html = soup.prettify()

标签： python beautifulsoup

【解决方案1】：

html2text 更适合文本提取，当需要以与 html 相同的格式提取文本时：

h = html2text.HTML2Text()
h.body_width = 0
h.ignore_links = True
h.ignore_images = True
print(h.handle(html_string))

【讨论】：