文件中的 unicode 字符串包含不同的答案

【问题标题】：unicode string in file contain different文件中的 unicode 字符串包含不同的
【发布时间】：2012-04-14 13:30:58
【问题描述】：

我的系统是fedora。由于某种原因。一条记录的最后一个字段是一个 unicode 字符串（在 qemu 中使用来自来宾机器的 memcpy 复制数据）。 unicode 字符串是 windows regedit 键名。

smss.exe|NtOpenKey|304|4|4|0|\^@R^@e^@g^@i^@s^@t^@r^@y^@\^@M^@ a^@c^@h^@i^@n^@e^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^ @r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@\^@C^@o ^@n^@t^@r^@o^@l^@\^@S^@e^@s^@s^@i^@o^@n^@ ^@M^@a^@n ^@a^@g^@e^@r^@ smss.exe|NtClose|304|4|4|0|System|NtOpenKey|4|0|2147484532|0|\^@R^@e^@g^@i^@s^@t^@r^@ y^@\^@M^@a^@c^@h^@i^@n^@e^@\^@S^@y^@s^@t^@e^@m^@\^ @C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t ^@ services.exe|NtOpenKey|680|624|636|0|\^@R^@E^@G^@I^@S^@T^@R^@Y^@\^@M^@A^@ C^@H^@I^@N^@E^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^@r^ @e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@\^@S^@e^@r ^@v^@i^@c^@e^@s^@

这里是一些十六进制代码：使用 '|'作为拆分字符。前 6 个字段是 ascii sting。最后一个字段是一个窗口 unicode 字符串（我认为它是 utf-16 代码）。

0000000 6d73 7373 652e 6578 4e7c 4f74 6570 4b6e
0000010 7965 337c 3430 347c 347c 307c 5c7c 5200
0000020 6500 6700 6900 7300 7400 7200 7900 5c00
0000030 4d00 6100 6300 6800 6900 6e00 6500 5c00
0000040 5300 7900 7300 7400 6500 6d00 5c00 4300
0000050 7500 7200 7200 6500 6e00 7400 4300 6f00
0000060 6e00 7400 7200 6f00 6c00 5300 6500 7400
0000070 5c00 4300 6f00 6e00 7400 7200 6f00 6c00
0000080 5c00 5300 6500 7300 7300 6900 6f00 6e00
0000090 2000 4d00 6100 6e00 6100 6700 6500 7200

我将使用 python 对其进行解析并将其插入 db 。这是我的处理方式

def parsecreate(filename):
    sourcefile = codecs.open("data.db",mode="r",encoding='utf-8')
    cx = sqlite3.connect("sqlite.db")
    cu = cx.cursor()
    cu.execute("create table data(id integer primary key,command text, ntfunc text, pid text, ppid text, handle text, roothandle text, genevalue text)")
    eachline = []
    for lines in sourcefile:
        eachline = lines.split('|')
        eachline[-1] = eachline[-1].strip('\n')
        eachline[-1] = eachline[-1].decode('utf-8')

        cu.execute("insert into data(command,ntfunc,pid,ppid,handle,roothandle,genevalue) values(?,?,?,?,?,?,?)",(eachline[0],eachline[1],eachline[2],eachline[3],eachline[4],eachline[5],eachline[-1]) )

    cx.commit()
    cx.close()

我会错的：

文件“./parse1.py”，第 18 行，在 parsecreate 中对于源文件中的行：文件“/usr/lib/python2.7/codecs.py”，第 684 行，在下一个返回 self.reader.next() 文件“/usr/lib/python2.7/codecs.py”，第 615 行，在下一个 line = self.readline() 文件“/usr/lib/python2.7/codecs.py”，第 530 行，在 readline 数据 = self.read(readsize, firstline=True) 文件“/usr/lib/python2.7/codecs.py”，第 477 行，已读取 newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError：“utf8”编解码器无法解码位置 51 中的字节 0xd0：无效的继续字节

因为 unicode 字符串可能包含 utf8 不知道的字节。如何正确阅读最后一个字段？

简单地说。不是 utf-16 编码文件中有一个 unicode 字符串，如何使该字段正确插入到数据库中？ Python 读取文件使用一种编码方式。我可以只读取原始字节吗？可以将这些字节组合成一个 unicode 字符串。

【问题讨论】：

标签： python sqlite unicode utf-16

【解决方案1】：

您的数据文件不是纯文本文件，因此请以二进制形式打开文件并明确解码文本字段。我不得不对数据进行相当多的操作才能取回我认为的原始二进制数据。看起来原始数据可能是类似于我下面的最终输出的sqlite3.exe 转储，除了最终字段的数据存储为 UTF-16 编码的 BLOB 而不是 TEXT。

请注意，按行解析并按“|”分割如果 UTF-16 数据包含表示 '\n' 或 '|' 的字节，可能会遇到问题，但我现在将忽略它。

这是我的测试：

from binascii import unhexlify
import sqlite3

data = unhexlify('''\
6d73 7373 652e 6578 4e7c 4f74 6570 4b6e
7965 337c 3430 347c 347c 307c 5c7c 5200
6500 6700 6900 7300 7400 7200 7900 5c00
4d00 6100 6300 6800 6900 6e00 6500 5c00
5300 7900 7300 7400 6500 6d00 5c00 4300
7500 7200 7200 6500 6e00 7400 4300 6f00
6e00 7400 7200 6f00 6c00 5300 6500 7400
5c00 4300 6f00 6e00 7400 7200 6f00 6c00
5c00 5300 6500 7300 7300 6900 6f00 6e00
2000 4d00 6100 6e00 6100 6700 6500 7200'''.replace(' ','').replace('\n',''))

# OP's data dump must have been decoded from the original data
# as little-endian words, and is missing a final 0x00 byte.
# Byte-swapping and adding missing zero byte to get back what
# was likely the original binary data.
data = ''.join(a+b for a,b in zip(data[1::2],data[::2])) + '\x00'

with open('data.db','wb') as f:
    f.write(data)

def parsecreate(filename):
    with open(filename,'rb') as sourcefile:
        with sqlite3.connect("sqlite.db") as cx:
            cu = cx.cursor()
            cu.execute("create table data(id integer primary key,command text, ntfunc text, pid text, ppid text, handle text, roothandle text, genevalue text)")
            eachline = []
            for line in sourcefile:
                eachline = line.split('|')
                eachline[-1] = eachline[-1].decode('utf-16le')
                cu.execute("insert into data(command,ntfunc,pid,ppid,handle,roothandle,genevalue) values(?,?,?,?,?,?,?)",(eachline[0],eachline[1],eachline[2],eachline[3],eachline[4],eachline[5],eachline[-1]) )

parsecreate('data.db')

输出：

C:\>sqlite3 sqlite.db
SQLite version 3.7.9 2011-11-01 00:52:41
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> select * from data;
1|smss.exe|NtOpenKey|304|4|4|0|\Registry\Machine\System\CurrentControlSet\Control\Session Manager

【讨论】：

非常感谢您。因为我刚回家，我明天要考试。我可以找到两个不同之处。 1.read with mode 'b' 2 因为最后一个文件准备好了 utf-16le ，只需将其解码为 unicode 字符串。顺便说一句，您编写的文件 data.db 是“mssse.exN|OtepKnye3|404|4|0|\| a string”。我认为问题是由于我只是复制了文件的一小部分。它应该像“.exe|NtOpenKey|”。
是的，最好有原始原始数据，或者至少作为字节转储而不是我怀疑的小端字。我更新了我的答案来解密你的数据。
我在处理 '\n' 时遇到了问题，如果我不将 '\n' 写入条目重新编码的末尾。我如何从文件中读取一行。如果我写一个'\n'，那么：1不要使用eachline[-1] = eachline[-1].strip('\n')因为UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 132: truncated data2使用eachline[-1] = eachline[-1].strip('\n')我想知道天气有可能它会删除unicode字符串中的一个字节。
我理解错了，可能是unicode字符串包含'0x0A'，那么for lines in file 可能会得到一个错误的行（因为它读到'\n'）