Beautiful Soup HTML 解析异常答案

【问题标题】：Beautiful Soup HTML parsing anamolyBeautiful Soup HTML 解析异常
【发布时间】：2020-03-13 07:34:30
【问题描述】：

我正在尝试使用漂亮的汤将文本从 HTML 中的某个类中取出。我已成功获取文本，但其中有一些异常（无法识别的字符），如下图所示。如何使用 python 代码解决它，而不是手动删除这些异常。

代码：

    try:
        html =requests.get(url)
    except:
        print("no conection")
    try:
        soup = BS(html.text,'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

【问题讨论】：

这是一个编码错误。 html.text 很可能推断出错误的编码。网址是什么？
@GordonAitchJay timesofindia.com/india/…

标签： html python-3.x web-scraping

【解决方案1】：

当您访问html.text 时，Requests 会尝试确定字符编码，以便正确解码从服务器接收到的原始字节。 timesofindia 发送的content-type 标头是text/html; charset=iso-8859-1，这就是Requests 所使用的。字符编码几乎可以肯定是utf-8。

您可以通过在访问 html.text 之前将 html 的 encoding 设置为 utf-8 来解决此问题：

    try:
        html =requests.get(url)
        html.encoding = 'utf-8'
    except:
        print("no conection")
    try:
        soup = BS(html.text,'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

或将html.content解码为utf-8，并将其传递给BS而不是html.text：

    try:
        html =requests.get(url)
    except:
        print("no conection")
    try:
        soup = BS(html.content.decode('utf-8'),'html.parser')
    except:
        print("pasre error")
    print(soup.find('div',{'class':'_3WlLe clearfix'}).get_text())

我会强烈建议您了解字符编码和 Unicode。很容易被它绊倒。我们都去过那里。

Characters, Symbols and the Unicode Miracle - Computerphile Tom Scott 和 Sean Riley

What every programmer absolutely, positively needs to know about encodings and character sets to work with text David C. Zentgraf

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 乔尔·斯波尔斯基

【讨论】：