【问题标题】:Series' object has no attribute 'decode in pandas系列'对象在熊猫中没有属性'解码
【发布时间】:2019-02-28 06:03:43
【问题描述】:

我正在尝试在 python 中解码 utf-8 编码的文本。数据被加载到 pandas 数据帧,然后我解码。这会产生错误:AttributeError: 'Series' object has no attribute 'decode'。如何正确解码 pandas 列中的文本?

>> preparedData.head(5).to_dict( )
{'id': {0: 1042616899408945154, 1: 1042592536769044487, 2: 1042587702040903680, 3: 1042587263643930626, 4: 1042586780292276230}, 'date': {0: '2018-09-20', 1: '2018-09-20', 2: '2018-09-20', 3: '2018-09-20', 4: '2018-09-20'}, 'time': {0: '03:30:14', 1: '01:53:25', 2: '01:34:13', 3: '01:32:28', 4: '01:30:33'}, 'text': {0: "b'\\xf0\\x9f\\x8c\\xb9 are red, violets are blue, if you want to buy us \\xf0\\x9f\\x92\\x90, here is a CLUE \\xf0\\x9f\\x98\\x89 Our #flowerpowered eye & cheek palette is AL\\xe2\\x80\\xa6 '", 1: "b'\\xf0\\x9f\\x8e\\xb5Is it too late now to say sorry\\xf0\\x9f\\x8e\\xb5 #tartetalk #memes'", 2: "b'@JillianJChase Oh no! Please email your order # to social@tarte.com & we can help \\xf0\\x9f\\x92\\x95'", 3: 'b"@Danikins__ It\'s best applied with our buffer brush! \\xf0\\x9f\\x92\\x9c\\xc2\\xa0"', 4: "b'@AdelaineMorin DEAD \\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3\\xf0\\x9f\\xa4\\xa3'"}, 'hasMedia': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0}, 'hasHashtag': {0: 1, 1: 1, 2: 0, 3: 0, 4: 0}, 'followers_count': {0: 801745, 1: 801745, 2: 801745, 3: 801745, 4: 801745}, 'retweet_count': {0: 17, 1: 94, 2: 0, 3: 0, 4: 0}, 'favourite_count': {0: 181, 1: 408, 2: 0, 3: 0, 4: 14}}

我的数据如上所示。我想解码“文本”列。

ExampleText = b'\xf0\x9f\x8c\xb9 是红色,紫罗兰是蓝色,如果你想购买我们\xf0\x9f\x92\x90,这里有一个线索\xf0\x9f\x98\x89 我们的#flowerpowered 眼颊调色板是 AL\xe2\x80\xa6'

我可以将上面的文本解码为

ExampleText = ExampleText.decode('utf8')

但是,当我尝试从 pandas 数据框列中解码文本时,我收到了错误消息。我试过这样,

preparedData['text'] = preparedData['text'].decode('utf8')

那么我得到的错误是,

Traceback (most recent call last):
File "F:/Level 4 Research Project/makeViral/main.py", line 23, in <module>
main()
File "F:/Level 4 Research Project/makeViral/main.py", line 19, in main
preprocessedData = preprocessData(preparedData)
File "F:\Level 4 Research Project\makeViral\preprocess.py", line 34, in preprocessData
 preparedData['text'] = preparedData['text'].decode('utf8')
File "C:\Users\Kabilesh\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 4376, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'decode'

我也试过了

preparedData['text'] = preparedData['text'].str.decode('utf8', errors='strict')

这不会产生任何错误。但生成的“文本”列就像,

'text': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    我可能是错的,但我猜你所拥有的是字节字符串而不是字节字符串 b"XXXXX" 而不是 "b'XXXXX'" 正如你在答案中发布的那样,在这种情况下你可以执行以下操作(你需要使用字符串访问器):

    preparedData['text'] = preparedData['text'].str.decode('utf8')
    

    编辑: 看起来我的假设是错误的,在这种情况下你可以做一个预处理步骤:

    import ast
    preparedData['text'] = preparedData['text'].apply(ast.literal_eval).str.decode("utf-8")
    

    【讨论】:

    • 我检查了我的数据。有些以“b'@makeupbyalishan 和一些 b'Natural glam FTW! 开​​头!我不知道这是怎么回事。我用 tweepy 从 twitter 获取数据,数据是 UTF-8 编码的。我该怎么办?跨度>
    • 查看我的编辑,我认为它可以解决您遇到的问题(至少在您提供的示例中)。在我看来,您的字节串在某些时候已经变成了实际的字符串(也许您将其保存为文件格式并再次读取?)。如果没有更多的挖掘,我无法确定
    猜你喜欢
    • 1970-01-01
    • 2014-05-05
    • 1970-01-01
    • 2016-03-27
    • 2020-06-23
    • 1970-01-01
    • 2018-06-21
    • 2016-08-19
    • 2019-10-02
    相关资源
    最近更新 更多