从 DataFrame 访问 unicode 内容会在 Python3 中返回带有额外反斜杠的 unicode 内容答案

【问题标题】：Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3从 DataFrame 访问 unicode 内容会在 Python3 中返回带有额外反斜杠的 unicode 内容
【发布时间】：2019-04-13 11:40:15
【问题描述】：

我有一个 CSV 文件，其中包含通过 API 下载的一些推文。推文包含一些 Unicode 字符，我很清楚如何解码它们。

我把CSV文件放到DataFrame中，

df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns

其中一条推文是 -

b'RT : This little girl dressed as her father for Halloween, a  employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'

但是当我通过命令访问这条推文时 - df['tweet'][0]

输出以以下格式返回 -

"b'RT : This little girl dressed as her father for Halloween, a  employee \\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x91\\x8c (via ) '"

我无法弄清楚为什么这个额外的反斜杠会附加到推文中。因此，该内容不会被解码。以下是 DataFrame 中的几行。

      time                         tweet
0   2018-11-02 05:55:46        b'RT : This little girl dressed as her father for Halloween, a  employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
1   2018-11-02 05:46:41        b'RT : This little girl dressed as her father for Halloween, a  employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
2   2018-11-02 03:44:35        b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map that\xe2\x80\x99s confusing.\xe2\x80\xa6 (via )
3   2018-11-02 03:37:03        b' service is a joke. No service northbound  No service northbound from Navy Yard after a playoff game at 11:30pm. And they\xe2\x80\xa6'

“sample.csv”的屏幕截图。

正如我之前提到的，如果直接访问这些推文中的任何一条，都会在输出中附加一个额外的反斜杠。

谁能解释一下为什么会发生这种情况以及如何避免它？

谢谢

【问题讨论】：

显示原始 .CSV 中的一些示例行。好像一开始就写错了。如果您编写了 CSV，您可能会提出一个新问题，即如何从 API 中读取并将其正确写入 CSV。这看起来像 XY Problem。

标签： python-3.x pandas dataframe twitter unicode

【解决方案1】：

您没有显示 CSV 文件的内容，但看起来创建它的人记录了“来自 tweeter 的字节对象的字符串表示”——也就是说，在 CSV 文件本身内部，您会发现文字 b'\xff...' 字符。

因此，当您从 Python 中读取它时，尽管以字符串形式打印它似乎是一个字节对象（用 b'...' 表示的对象），但它们是一个字符串，以该表示作为内容。

将它们作为正确的字符串返回的一种方法是让 Python 评估它们的内容 - 然后，它们成为有效的 Bytes 对象，可以将其解码为文本。使用 ast.literal_eval 总是一个好主意，因为 eval 太随意了。

因此，在您将数据加载到数据框中后，这可以修复您的推文列：

import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

【讨论】：

非常感谢@jsbueno。您的解决方案很有魅力。
既然你问了，想提一下csv内容和csv文件是一样的，请看我编辑的帖子。它现在包含 sample.csv 文件的屏幕截图。再次感谢您。