cx_freeze 后脚本无法检测文件中的汉字答案

【问题标题】：script cannot detect chinese characters in the file after cx_freezecx_freeze 后脚本无法检测文件中的汉字
【发布时间】：2015-05-30 09:21:55
【问题描述】：

我有 2 个不同的问题，但它们彼此密切相关：

1）我的python脚本将原始代码文件复制到另一个文件并逐行读取（utf-8），用正则表达式检测汉字并将它们发送到谷歌翻译，得到答案，用包含中文字符的行之后的翻译。当我直接在 windows 下运行脚本时，这在 pycharm 下完美运行。但是在使用 cx_freeze 转换为可执行文件后，它基本上读取文件但看不到任何中文字符，所以没有进行翻译。你能帮忙吗？

2) 可执行文件适用于其他一些计算机（Windows）。我发现它与 windows -system locale settings` 有很强的关系。将其设置为中文后，我们可以使其工作。我试图通过语言环境模块通过脚本更改它，但没有成功。

这里是可能有助于理解问题的代码 sn-ps：

def initialize(self):
    #several imports here
    #several filename operations here

    writefileF = codecs.open(writefile, "w", "utf-8")

    # copy the original to another with  utf-8 encoding (to be safe)
    with io.open(self.orig_filename, ) as source:
            with io.open(readFileN, mode='w', encoding='utf-8') as target:
                try:
                    shutil.copyfileobj(source, target)
                except:
                    print 'trying single copy file with no metadata.. '
                    shutil.copyfile(self.filename, readFileN)
   readFile = codecs.open(readFileN, "r", "utf-8")
   # generAtor func call
   creategen = self.readfilebylines(readFile)
   for iterator in creategen:
       endd = myconcat.join(iterator[0])
       writefileF.writelines(myconcat.join(endd))

def readfilebylines(self, myfileobj):
    linenum = 0
    for lines in myfileobj.readlines():            
        mygen = lines
        mymatch = self.regularexpmatch(lines)
        if mymatch:
            print 'chinese word detected'
            #do translation
        else:
            pass
        yield mygen, linenum

def regularexpmatch(self, mytext):
    chinese_compile = re.compile(ur'[\u4e00-\u9fff]+')
    matched = chinese_compile.search(mytext)
    return matched

【问题讨论】：

标签： python regex utf-8 io utf8-decode

【解决方案1】：

我苦苦挣扎了几个小时，终于找到了解决办法。问题是如果您不指定原始文件编码，脚本不知何故无法更改文件的编码。

所以对于我来说：

def copyfile_inUTF8(self,orig_file,copy_file):
    import chardet
    raw_data=open(orig_file,'r').read()
    target_en='utf-8'
    #detect sourcefile encoding
    orig_en=chardet.detect(raw_data)['encoding']
    print 'original file encoding:',orig_en
    target = open(copy_file, "w")
    target.write(unicode(raw_data, orig_en).encode(target_en))
    print 'copy file encoding:',chardet.detect(open(copy_file,'r').read())['encoding']

使用该功能，我能够看到原始编码并将其更改为目标编码，即 utf-8。

【讨论】：