由于编码问题，无法正确抓取网页答案

【问题标题】：Cannot scrape web page properly because of encoding issues由于编码问题，无法正确抓取网页
【发布时间】：2020-11-11 14:15:12
【问题描述】：

虽然我设置了编码来检测土耳其语字符，但它无法正确捕获和显示此网页。它适用于与此类似的所有其他页面，这些页面位于相同的字符集和域下。我不明白为什么会这样？任何的想法？提前致谢！

例如：

Bilgisayar MühendisliÄŸi BÃ¶lÃ¼mü

而不是：

Bilgisayar Mühendisliği Bölümü

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

url = "http://bmb.osmaniye.edu.tr/personel-akademik"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser', from_encoding="utf-8")

print(soup.original_encoding)
print(soup)

输出：

windows-1252
<!DOCTYPE html>

<html lang="en"><head>
<title>Osmaniye Korkut Ata Ãœniversitesi - Bilgisayar MÃ¼hendisliÄŸi BÃ¶lÃ¼mÃ¼</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!-------------<meta http-equiv="Content-Type" content="text/html; charset=windows-1254" />---------------->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="" name="google-site-verification">
<meta content=",Bilgisayar MÃ¼hendisliÄŸi BÃ¶lÃ¼mÃ¼" name="keywords"/>

【问题讨论】：

soup = BeautifulSoup(page.content.decode('utf-8','backslashreplace'), 'html.parser') 可以帮忙吗？
@JosefZ 它确实工作得很好，但我不明白这一点！你能建议我记录一下这个问题是怎么回事吗？另外，为什么将编码设置为 utf-8 无法处理我的代码中的工作？因为我假设您不熟悉所提到的语言，您怎么能发现这一点？是通过比较两个输出的 ASCII 值吗？再次感谢

标签： python python-3.x beautifulsoup encoding request

【解决方案1】：

对于您未来的网络抓取工作，您可能想先尝试一下：

page.encoding = page.apparent_encoding

或者，按照建议，使用反斜杠替换进行解码。

例如：

import requests
from bs4 import BeautifulSoup

page = requests.get("http://bmb.osmaniye.edu.tr/personel-akademik")
soup = BeautifulSoup(page.content.decode("utf-8", "backslashreplace"), 'html.parser').find("title").getText(strip=True)
print(soup)

给你这个：

Osmaniye Korkut Ata Üniversitesi - Bilgisayar Mühendisliği Bölümü

【讨论】：

正如@JosefZ 建议的那样，他的解决方案确实有效，但您提供的关于 .encoding 的解决方案不起作用。顺便说一句，在我在这里提问之前我已经尝试过了，因为目的不仅是阅读页面的标题，我还需要正确地抓取整个内容和字符。