在熊猫系列中分割不同长度的物体答案

【问题标题】：Splitting objects of different lengths in panda series在熊猫系列中分割不同长度的物体
【发布时间】：2021-08-15 18:18:34
【问题描述】：

这里是 Python/pandas 初学者。

我有一个 pandas 系列（较大 df 的列），如下所示：

0                                   ['0344010000122413']
1                                   ['0344010000132886']
2                                   ['0344010000021642']
3      ['0344010000010731', '0344010000010732', '0344...
4                                   ['0344010000025264']
Name: NUMPOINTS, Length: 271, dtype: object

每个 NUMPPOINT 的长度 = 16。每行 NUMPOINTS 的数量从 0 到 ±100 不等。

如您所见，该系列的 dtype 是一个对象。目标是我想将此系列中的每一行转换为真实列表，并将数字转换为整数，但由于 dtype 和[' 和']，这不会发生。每行的可变长度使得无法使用某些功能。

我使用了df['NUMPOINTS'] = df.NUMPOINTS.apply(lambda x: x[2:-2].split(','))，但这仅适用于具有 1 个 NUMPOINT 的行。

我使用了df['NUMPOINTS'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'') 函数，但这会将对象“粘”在一起。例如，索引 3 给出：

3      0344010000010731034401000001073203440100000107...

然后转换为整数会出错。

我在这个问题pandas - convert string into list of strings 中使用了解决方案，但也没有完成这项工作。我在这里遗漏了什么吗？

编辑：尝试https://stackoverflow.com/users/10035985/andrej-kesely 更新的答案给了我：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-374-5f4f43cc7fc1> in <module>()
      1 from ast import literal_eval
      2 df["NUMPOINTS"] = df["NUMPOINTS"].apply(
----> 3     lambda x: [
      4         int(value) for value in (literal_eval(x) if isinstance(x, str) else x)
      5     ]

2 frames
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-374-5f4f43cc7fc1> in <listcomp>(.0)
      2 df["NUMPOINTS"] = df["NUMPOINTS"].apply(
      3     lambda x: [
----> 4         int(value) for value in (literal_eval(x) if isinstance(x, str) else x)
      5     ]
      6 )

ValueError: invalid literal for int() with base 10: "0344010000010731'"

【问题讨论】：

标签： python pandas list object

【解决方案1】：

您可以在列表理解中应用ast.literal_eval 然后int()：

from ast import literal_eval

df["NUMPOINTS"] = df["NUMPOINTS"].apply(
    lambda x: [int(value) for value in literal_eval(x)]
)
print(df)

打印：

                            NUMPOINTS
0                   [344010000122413]
1                   [344010000132886]
2                   [344010000021642]
3  [344010000010731, 344010000010732]
4                   [344010000025264]

编辑：

如果您的列中有字符串/列表：

df["NUMPOINTS"] = df["NUMPOINTS"].apply(
    lambda x: [
        int(value.strip("'")) for value in (literal_eval(x) if isinstance(x, str) else x)
    ]
)
print(df)

【讨论】：

这给了我这个错误：ValueError: malformed node or string:
@QB-science 列中有NaN 值吗？列中有[ ... ]以外的字符串吗？
没有``` NaN``` 但是有些行只有[]
@QB-science 这应该不是问题。编辑您的问题并将完整的错误回溯放在那里（格式正确）。
@QB-science 更新了我的答案。

【解决方案2】：

import pandas as pd

您也可以使用map() 和pd.eval() 方法：

df['NUMPOINTS']=df['NUMPOINTS'].map(lambda x: [int(y.lstrip('0').rstrip("'")) for y in (pd.eval(x) if type(x).__name__=='str' else x)])

现在如果你打印df，你会得到：

                            NUMPOINTS
0                   [344010000122413]
1                   [344010000132886]
2                   [344010000021642]
3  [344010000010731, 344010000010732]
4                   [344010000025264]

【讨论】：

这个解决方案给了我这个语法错误：File "<unknown>", line 1 [0 344010000122413 ] ^ SyntaxError: invalid syntax
感谢您的帮助，但更新后的答案给了我 ``` File "", line 1 [0 344010000122413] ^ SyntaxError: invalid syntax ```