【问题标题】:UTF-8 conversion to real letterUTF-8 转换为真实字母
【发布时间】:2016-06-19 11:32:43
【问题描述】:

我的一个项目需要帮助。我正在清理大量数据以批量插入到 Microsoft SQL 中。数据就像一千万行。但我创建了一个脚本,只是为了提取前 1000 个进行清理,假设其余部分相同。我注意到有很多 UTF-8 字符,所以我将其转换为最接近的真实字符。但是在我提取它以查看前 100000 行之后,我注意到需要完成更多的 UTF-8 转换,我正在手动转换它们,这非常详尽。我想知道是否有更简单的方法来执行此操作,而不是手动输入所有内容。这是我的代码:

import re

infile = r"C:\\Users\\Dave\\Desktop\\database\\page-links_en.txt"
outfile = r"C:\\Users\\Dave\\Desktop\\database\\Complete\\cleanedpagelinks_file.txt"

fin = open(infile)
fout = open(outfile, "w+")

rex = re.compile(r'/([^/>]+)>')

for line in fin:
#for word in delete_list:
#    line = line.replace(word, "")
line = line.replace("%C3%A9","e")
line = line.replace("%C3%B3","o")
line = line.replace("%E2%80%93","-")
line = line.replace("%C3%A6","e")
line = line.replace("%C3%A8","e")
line = line.replace("_"," ")
line = line.replace("%C3%A0","e")
line = line.replace("%C3%A1","i")
line = line.replace("%C5%82","l")
line = line.replace("%C5%84","n")
line = line.replace("%C3%BF", "y")
line = line.replace("%C3%BE", "p")
line = line.replace("%C3%BD", "y")
line = line.replace("%C3%BC", "u")
line = line.replace("%C3%BB", "u")
line = line.replace("%C3%BA", "u")
line = line.replace("%C3%B9", "o")
line = line.replace("%C3%B6", "o")
line = line.replace("%C3%B5", "o")
line = line.replace("%C3%B4", "o")
line = line.replace("%C3%B3", "o")
line = line.replace("%C3%B2", "o")
line = line.replace("%C3%B1", "n")
line = line.replace("%C3%B0", "e")
line = line.replace("%C3%AC", "i")
line = line.replace("%C3%AD", "i")
line = line.replace("%C3%AE", "i")
line = line.replace("%C3%AF", "i")
line = line.replace("%C3%81","A")
line = line.replace("%C3%82","A")
line = line.replace("%C3%83","A")
line = line.replace("%C3%84","A")
line = line.replace("%C3%85","A")
line = line.replace("%C3%86","AE")
line = line.replace("%C3%87","C")
line = line.replace("%C3%88","E")
line = line.replace("%C3%89","E")
line = line.replace("%C3%8A","E")
line = line.replace("%C3%8B","E")
line = line.replace("%C3%8C","I")
line = line.replace("%C3%8D","I")
line = line.replace("%C3%8E","I")
line = line.replace("%C3%8F","I")
line = line.replace("%C3%90","D")
line = line.replace("%C3%91","N")
line = line.replace("%C3%92","O")
line = line.replace("%C3%93","O")
line = line.replace("%C3%94","O")
line = line.replace("%C3%95","O")
line = line.replace("%C3%96","O")
line = line.replace("%C3%98","O")
line = line.replace("%C3%99","U")
line = line.replace("%C3%9A","U")
line = line.replace("%C3%9B","U")
line = line.replace("%C3%9C","U")
line = line.replace("%C3%9D","Y")
line = line.replace("%C3%9F","B")
line = line.replace("%C3%a0","a")
line = line.replace("%C3%a1","a")
line = line.replace("%C3%a2","a")
line = line.replace("%C3%a3","a")
line = line.replace("%C3%a4","a")
line = line.replace("%C3%a5","a")
line = line.replace("%C3%a6","ae")
line = line.replace("%C3%a7","c")
line = line.replace("%C3%a8","e")
line = line.replace("%C3%a9","e")
line = line.replace("%C3%aa","e")
line = line.replace("%C3%ab","e")


match = rex.search(line)
if match:
    newline = match.group(1)
else: newline = ''
fout.write(newline + '\n')
fin.close()
fout.close()

正如您在我的代码中看到的,我正在手动替换为真实字符值。 这是我意识到需要转换的文本文件中的示例行。

B%E1%BA%A3o %C4%90%E1%BA%A1i

【问题讨论】:

  • 你能把文件的部分给测试一下吗?或尝试使用 line.decode('utf-8').encode('iso-8859-1')
  • 输入的编码是什么,输出的编码应该是什么?
  • 我尝试使用 line.decode('utf-8').encode('iso-8859-1') ,但它说 AttributeError: 'str' object has no attribute 'decode' Also here是我刚刚用 unicodes Gotterd%C3%A4mmerung Gurmukh%C4%AB 字母 Hez%C3%A2rfen Ahmed Celebi Hrad%C4%8Dany Thich Nh%E1%BA%A5t H%E1%BA% 提取的一些行的示例A1nh La%E1%B9%85k%C4%81vat%C4%81ra S%C5%ABtra Hu%E1%BA%BF pastebin.com/wckCENCc 我在 pastebin 的第 89、132 和 153 行发布了其中的一部分
  • 什么是真正的字母?
  • 我所说的真实字母的意思是例如 c3 bf 是'ÿ',但我只是输入'y' @schwobaseggl

标签: python python-3.x utf


【解决方案1】:

您可以将unidecodeurllib.parse.unquote 一起使用:

In [8]: from unidecode import  unidecode

In [9]: from urllib.parse import unquote

In [10]: unidecode(unquote("Gotterd%C3%A4mmerung"))
Out[10]: 'Gotterdammerung'

unidecode 会将非 ascii 字符转换为对应的 ascii 字符。

【讨论】:

    【解决方案2】:

    您可以使用urllib.parse.unquote。它默认采用 UTF-8,但如果其中还有来自其他编解码器的 url,您可以使用一些自动检测:

    from urllib.parse import unquote
    
    def cleanup(url):
        try:
            return unquote(url, errors='strict')
        except UnicodeDecodeError:
            return unquote(url, encoding='latin-1')
    

    B%E1%BA%A3o %C4%90%E1%BA%A1i是越南的末代皇帝:

    >>> cleanup('B%E1%BA%A3o %C4%90%E1%BA%A1i')
    'Bảo Đại'
    

    如果您想将这些转换为 ASCII 等价物,您可以使用 unidecode:

    >>> unidecode.unidecode('Bảo Đại')
    'Bao Dai'
    

    【讨论】:

      【解决方案3】:

      谢谢大家,我终于让它工作了。我必须安装 unidecode 模块,这让我永远无法弄清楚因为我遇到了 pip 和 cmd 提示错误。安装软件包后,我添加了这一行并且它起作用了。

      line = cleanup(line)
      line = unidecode(line)
      

      非常感谢您的帮助!

      【讨论】:

        【解决方案4】:

        据我了解,这是 URL 编码的,即对字符进行编码,以便您可以作为参数传递给服务器。

        使用来自 urllib 的unquote_plus()

        s1 = u'B%E1%BA%A3o %E1%BA%A1i'
        print urllib.unquote_plus(s1)
        

        输出:

        Bảo ại
        

        【讨论】:

        • 我试过了,语法错误无效,也尝试了其他方法,但还是同样的错误@VarunJoshi
        猜你喜欢
        • 2021-09-05
        • 2012-09-22
        • 2018-05-24
        • 2018-11-06
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多