如何修复字符串赋值中错误的 utf 字符？答案

【问题标题】：how to fix wrong utf character in string assignement?如何修复字符串赋值中错误的 utf 字符？
【发布时间】：2019-08-29 19:00:26
【问题描述】：

我有一个 python3.6 循环执行以下操作：

childrenList = []

output["text"] = path

for item in sorted(os.listdir(path)):
    contentFullPath = os.path.join(path, item)
    children = {}

    if os.path.isfile(contentFullPath):
        print("item = "+str(item))
        children["text"] = item
        print("children = "+str(children))
        children["id"] = contentFullPath
        childrenList.append(children)

我得到以下输出：

web_1     | item = Obamaééé.mp4
web_1     | children = {'icon': 'jstree-file', 'text': 
            'Obamae\udccc\udc81e\udccc\udc81e\udccc\udc81.mp4'}

为什么名为 children 的变量中的文本没有正确显示？

【问题讨论】：

容器对象的str()（例如本例中的字典）是由其包含的元素的repr() 构建的，而不是它们的str()。两种显示都是正确的，只是格式不同。
但是为什么会出现以下错误？ children["text"] = item.encode().decode() $UnicodeEncodeError: 'utf-8' codec can't encode character '\udccc' in position 8: surrogates not allowed

标签： python python-3.x utf

【解决方案1】：

import unidecode

text = 'Obamaééé.mp4'
strip_text = unidecode.unidecode(text.decode('utf-8'))
print(strip_text)

[output] : Obamaeee.mp4

希望对您有所帮助。

【讨论】：

children["text"] = unidecode(item.decode('utf-8')) AttributeError: 'str' object has no attribute 'decode'
我认为他们打算使用encode，但无论哪种方式，这都不是一个真正的答案；它正在剥离变音符号（更改字符串），这只是一个“修复”，就像烧毁你的房子修复坏壁纸一样。