如何在 Python 中将 HTML 转换为文本？答案

【问题标题】：How to convert HTML to text in Python?如何在 Python 中将 HTML 转换为文本？
【发布时间】：2019-11-16 20:59:10
【问题描述】：

我知道这个问题有很多答案，但其中很多都已经过时了，当我找到一个“有效”的答案时，它的效果还不够好。

这是我当前的代码：

import requests
from bs4 import BeautifulSoup

url = "http://example.com"

req = requests.get(url)


html = req.text


PlainText = BeautifulSoup(html, 'lxml')
print (PlainText.get_text())

这是我得到的输出：


Example Domain




    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }




Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...

这是我想要的输出：

Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

我怎样才能从网站上只打印出我可以阅读的文本？

【问题讨论】：

纯文本是什么意思？
get_text() 返回一个字符串。您在寻找什么返回类型？
Prayson W. Daniel：我指的是页面上显示的文本，而不是所有其他类型的内容。例如，在维基百科页面上使用此脚本时，这是输出中的句子之一：“印度文化历史跨越 4,500 多年。[346] 在吠陀时期（c. 1700 – c. 500 BCE）”不是纯文本。
你的问题还是不清楚，你能解释一下纯文本是什么意思吗？！你能给我们举例说明 html 文本的外观以及你想要输出什么吗？！
使用 Soup.text 你不会得到纯文本。而是你会得到所有标签和内容细节什么美丽的汤检索。你需要做什么，你需要找到你正在寻找的那些元素打印文本。

标签： python html text beautifulsoup

【解决方案1】：

只要“纯文本”部分不包含字符“}”，这样的方法应该可以工作。

import requests
from bs4 import BeautifulSoup

url = "http://example.com"

req = requests.get(url)


html = req.text


PlainText = BeautifulSoup(html, 'lxml')

text = Plaintext.get_text()
split = text.split('}')
withoutCss = split[len(split) - 1]



print (withoutCss)

【讨论】：

是的。它有效，但不适用于其他链接。我尝试使用维基百科链接，但失败了。我刚得到这个输出：);

【解决方案2】：

这是一个python程序，它使用一个函数来删除标签之间的所有内容，并只返回不在这些标签之间的文本。

def striphtmltags(s):
    b=True
    r=''
    for i in range(0, len(s)):
        if(s[i]=='<'): b=False
        if(b): r+=s[i]  
        if(s[i]=='>'): b=True
    return(r.strip())   

html="<html><body><h1>this is the header</h1>this is the main body<font color=blue>this is blue</font><h6>this is the footer</h6></body></html>"
text=striphtmltags(html)

print("text:", text)

这会产生：

text: this is the headerthis is the main bodythis is bluethis is the footer

【讨论】：