我如何在 python 中以 utf-8 格式打开 html 文件？

【问题标题】：how i can open htmls file as utf-8 in python?我如何在 python 中以 utf-8 格式打开 html 文件？
【发布时间】：2022-01-10 14:36:35
【问题描述】：

我正在尝试在 python 中以 utf-8 格式打开文件。我在 htmls 路径中有列表，我创建列表的代码工作：

def get_all_htmls(directory_path):
    return glob.iglob(os.path.join(directory_path,'*.html'))

directory_path=r'C:\Users\astar\Project\Articles\Articles'
links = []
for html_path in get_all_htmls(directory_path):
    links.append(html_path)

但是，现在在这段代码中：

for link in links:
    f=codecs.open(r'link','r','utf-8')
    document= BeautifulSoup(f)

所有的 html 都不工作，我能做什么？

【问题讨论】：

Python 3 字符串是 Unicode，open 的默认值已经是 UTF-8。您无需执行任何操作即可读取 UTF8 文件。如果您有问题，则表示文件不是 UTF8

标签： python python-3.x encoding

【解决方案1】：

如果它适用于您的某些文件，但不是所有文件，这意味着其中一些文件以 utf-8 正确编码，而另一些可能以其他编码编码，（例如“ISO-8859-8” , 希伯来语）。您不会说出了什么问题，这使得您很难在代码中给出准确的答案，但如果您在该调用中收到 UnicodeDecodeError 异常，您可以创建一个循环，尝试所有合适的编码，直到一个成功：

for link in links:
    for encoding in ("utf-8", "iso-8859-8", "latin-1"):
        try:
            f=codecs.open(link,'r','utf-8')
            document= BeautifulSoup(f)
        except UnicodeDecodeError:
            print(f"{encoding} failed for {link}, trying next encoding")
        else:
            print(f"Successfully read {link} as an {encoding} file") 
            break
    else: # for-level else, entered if no "break" statement was executed, 
          #and therefore, if no codec worked (although latin-1, in special, will always succeed)
         print(f"could not correctly read {link} with any of the avaliable encodings. skipping file")
         continue

【讨论】：