【问题标题】:Problems with saving unicode to sqlite将 unicode 保存到 sqlite 的问题
【发布时间】:2018-02-07 13:53:57
【问题描述】:

我目前正在做一个项目,该项目包括从 SMTP 服务器读取日志文件,并提取有关经过的每封电子邮件的有意义的信息。我有一个表,其中包含一些稍后将与搜索相关的列;垃圾邮件分数、从域、到域、时间戳、主题等。 一切正常,直到我遇到一些非 ASCII 字符,这通常在主题字段上(如预期的那样)。

我尝试将 str 解码为 iso-8859-1(这是文件的编码)并保存它,我还尝试将其编码回 UTF-8,老实说,我我有点迷失在这里。我听说在 python 2.7 中使用 unicode 是一场噩梦,但直到现在我还没有经历过。

无论如何,让我解释一下。这就是我提取主题的方式:

if 'subject' in realInfo: 
emailDict[keywrd].setSubject(realInfo[realInfo.index('subject') + 
len('subject') + 1:].decode('ISO-8859-1'))

emailDict 是一个包含所有正在处理的电子邮件的字典。

这就是我将所有内容插入数据库的方式:

    info = (e.getID(), str(e.getSpamScore()), str(e.getMCPScore()), " ".join(e.getFrom()), " ".join(e.getTo()), e.getStatus(), e.getTimestamp(), e.getSubject(), dumps(e))
    print repr(e.getSubject())  # DEBUG
    print type(e.getSubject())  # DEBUG
    self.conn.cursor().execute(u"INSERT INTO emails (emailID, SpamScore, MCPScore, FromDomain, ToDomain, status, timestamp, subject, object)"
                      " VALUES (?,?,?,?,?,?,?,?,?)", info)
    self.conn.commit()

我添加了 2 个打印语句来帮助我了解问题所在。

'e' 是一个电子邮件对象,用作每封电子邮件的蓝图。它包含先前由解释器获得的信息。之后,我将最重要的信息保存在列上,如前所述,这些信息将用于搜索(“对象”列是电子邮件对象,此处使用 pickle)。但是一旦出现特殊字符,就会引发异常:

u'VPXL \xffM-^W no more compromises. Better size, better life. \n'
<type 'unicode'>
Exception in thread Thread-25:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/ProjMail/projMail_lib.py", line 174, in refresher
self.interpreter.start()
File "/ProjMail/projMail_lib.py", line 213, in start
c.save(self.emailTracker)
File "/ProjMail/projMail_lib.py", line 56, in save
self.saveEmails()
File "/ProjMail/projMail_lib.py", line 62, in saveEmails
else: self.add(key) # If it's new
File "/ProjMail/projMail_lib.py", line 82, in add
" VALUES (?,?,?,?,?,?,?,?,?)", info)

ProgrammingError: You must not use 8-bit bytestrings unless you use a 
text_factory that can interpret 8-bit bytestrings (like text_factory = str). 
It is highly recommended that you instead just switch your application to 
Unicode strings.           

据我所知,它是 unicode,所以我不明白为什么 sqlite 会抱怨。 知道我在这里可能做错了什么吗?提前致谢!

【问题讨论】:

    标签: python python-2.7 unicode sqlite


    【解决方案1】:

    问题不是将主题本身插入数据库,而是插入腌制的Email 实例。

    >>> subject = u'VPXL \xffM-^W no more compromises. Better size, better life. \n'
    >>> conn = sqlite3.connect(':memory:')
    >>> c = conn.cursor()                            
    >>> c.execute("""CREATE TABLE foo (bar text, baz text)""")                                   
    <sqlite3.Cursor object at 0x7fab5cf280a0>
    >>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, 'random text'))
    <sqlite3.Cursor object at 0x7fab5cf280a0>
    
    >>> class Email(object):pass
    ... 
    >>> e = Email()
    >>> e.subject = subject
    >>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, pickle.dumps(e)))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
    

    选择Email 实例会在内部创建一个具有混合编码的字节串,从而触发异常(即使只选择subject 也会这样做)。

    为防止出现异常,您可以将连接的 text_factory 属性更改为 str

    >>> conn.text_factory = str
    >>> c.execute(stmt2, (subject, pickle.dumps(e)))
    <sqlite3.Cursor object at 0x7fab5b3343b0>
    

    如果您希望继续使用默认的unicode text_factory,可以将腌制类存储在blob 列中,并包装在buffer 实例中。

    >>> conn.text_factory = unicode
    >>> c.execute("""CREATE TABLE foo2 (bar text, baz blob)""")
    >>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, buffer(pickle.dumps(e))))                       
    <sqlite3.Cursor object at 0x7fab5b3343b0>
    

    腌制的实例在检索时恢复:

    >>> c.execute("""SELECT bar, baz FROM foo2""")
    <sqlite3.Cursor object at 0x7fab5b3343b0>
    >>> res = c.fetchone()
    >>> res
    (u'VPXL \xffM-^W no more compromises. Better size, better life. \n', <read-write buffer ptr 0x7fab5e9706c8, size 167 at 0x7fab5e970688>)
    >>> pickle.loads(res[1])
    <__main__.Email object at 0x7fab5b333ad0>
    

    【讨论】:

    • 我按照你的建议做了,它奏效了!非常感谢!
    猜你喜欢
    • 2018-11-06
    • 2016-07-30
    • 1970-01-01
    • 2012-12-23
    • 1970-01-01
    • 1970-01-01
    • 2019-07-10
    • 2017-02-13
    • 2010-12-31
    相关资源
    最近更新 更多