【问题标题】:BeautifulSoup cannot read file despite correct charset尽管字符集正确,BeautifulSoup 仍无法读取文件
【发布时间】:2021-11-12 11:34:25
【问题描述】:

我正在尝试使用 utf-8 打开带有 BeautifulSoup 的 utf-8 元标记的文件,但出现解析错误:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(filename), "html.parser", from_encoding="utf-8")

文件头:

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Logs
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>

错误:

$ python3.6 dom.py Traceback(最近一次调用最后一次):文件“dom.py”, 第 56 行,在 汤 = BeautifulSoup(open(filename), "html.parser", from_encoding="utf-8") 文件 “/usr/local/lib/python3.6/site-packages/bs4/init.py”,第 309 行,在 初始化 markup = markup.read() File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 152902:序数不在范围内(128)

我应该如何进行调试? 谢谢

【问题讨论】:

  • 试试:soup = BeautifulSoup(open(filename, "r", encoding="utf-8").read(), "html.parser")

标签: python beautifulsoup encoding


【解决方案1】:

您没有正确打开文件。

from bs4 import BeautifulSoup

with open(filename, "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f, "html.parser", from_encoding="utf-8")

from bs4 import BeautifulSoup

f = open(filename, "r", encoding="utf-8").read()
soup = BeautifulSoup(f, "html.parser", from_encoding="utf-8")
f.close()

【讨论】:

    猜你喜欢
    • 2016-08-19
    • 1970-01-01
    • 2023-04-05
    • 2012-08-10
    • 1970-01-01
    • 1970-01-01
    • 2021-08-27
    • 1970-01-01
    • 2020-07-08
    相关资源
    最近更新 更多