熊猫系列中似乎无法拆分列表？答案

【问题标题】：Can't seem to split list in panda series?熊猫系列中似乎无法拆分列表？
【发布时间】：2018-02-12 02:33:14
【问题描述】：

我有一列包含多个流派，我试图拆分流派列表以便分别获取每个流派但是，无论我尝试什么，我都会不断为数据框中的整个列获取 NaN。

这是数据的样子：

0                                      [Drama,, Romance]
1                 [Animation,, Comedy,, Kids, &, Family]
2                         [Drama,, Mystery, &, Suspense]
3                                                [Drama]
4                                                    NaN
5                 [Art, House, &, International,, Drama]
6       [Art, House, &, International,, Drama,, Romance]
7                                          [Documentary]
8      [Action, &, Adventure,, Animation,, Art, House...
9               [Action, &, Adventure,, Drama,, Western]
10                                     [Comedy,, Horror]

我想得到： [“戏剧”，“浪漫”] [“动画”、“喜剧”、“儿童与家庭”] ...

我这样做是因为我希望能够看到有多少独特的流派，目前我只能看到独特的列表，但我想要每个独特的流派。我什至不确定我是否以正确的方式进行此操作，因此非常感谢您的帮助。

这是我最近的尝试：（x 等于显示的数据加上更多行）

 x = pd.Series(x)
 x = x.str.split()
 [i.str.split() for i in x]

非常感谢您的帮助！

【问题讨论】：

请发布输入数据和所需的输出。
您还可以添加您想要的最终数据框的样子吗？

标签： python pandas split series

【解决方案1】：

您的数据似乎有一些多余的逗号格式不正确。假设您的数据实际上是strings，您需要将eval 列表的string 表示形式转换为实际的list。

几个步骤：

# First, import ast to use for literal_eval()
import ast

# Then, remove the extraneous commas
new_df = df[0].str.replace(', ',' ')

# Then, add quotes into your listed items to prep for eval.
new_df = new_df.str.replace(r'(?P<item>\b[\w &]+)',r'"\1"')

# Then, eval the string representation
lst = [ast.literal_eval(i) for i in new_df if pd.notnull(i)]

# Or, you can just put all of this together:
lst = [ast.literal_eval(i) for i in df[0].str.replace(', ',' ').str.replace(r'(?P<item>\b[\w &]+)',r'"\1"') if pd.notnull(i)]

输出：

[['Drama', 'Romance'],
 ['Animation', 'Comedy', 'Kids & Family'],
 ['Drama', 'Mystery & Suspense'],
 ['Drama'],
 ['Art House & International', 'Drama'],
 ['Art House & International', 'Drama', 'Romance'],
 ['Documentary'],
 ['Action & Adventure', 'Animation', 'Art House'],
 ['Action & Adventure', 'Drama', 'Western'],
 ['Comedy', 'Horror']]

如果您想要索引并将其表示为字典：

 d = {i: ast.literal_eval(j) for i, j in new_df.items() if pd.notnull(j)}

输出：

{0: ['Drama', 'Romance'],
 1: ['Animation', 'Comedy', 'Kids & Family'],
 2: ['Drama', 'Mystery & Suspense'],
 3: ['Drama'],
 5: ['Art House & International', 'Drama'],
 6: ['Art House & International', 'Drama', 'Romance'],
 7: ['Documentary'],
 8: ['Action & Adventure', 'Animation', 'Art House'],
 9: ['Action & Adventure', 'Drama', 'Western'],
 10: ['Comedy', 'Horror']}

如果您希望在 DataFrame 中使用它，我不确定您希望它如何表示，但是一旦您拥有 dict 或 list，就可以轻松恢复。

【讨论】：

它在生产什么？如果这是一个错误，你会得到什么样的错误？您的数据已经是数组还是实际上是字符串？