从 pandas 数据帧构造一个二维数组答案

【问题标题】：Structuring a 2D array from a pandas dataframe从 pandas 数据帧构造一个二维数组
【发布时间】：2019-11-29 15:36:04
【问题描述】：

我有一个熊猫数据框：

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=['Text','Selection_Values'])
df["Text"] = ["Hi", "this is", "just", "a", "single", "sentence.", "This", np.nan, "is another one.","This is", "a", "third", "sentence","."]
df["Selection_Values"] = [0,0,0,0,0,1,0,0,1,0,0,0,0,0]
print(df)

输出：

               Text  Selection_Values
0                Hi                 0
1           this is                 0
2              just                 0
3                 a                 0
4            single                 0
5         sentence.                 1
6              This                 0
7               NaN                 0
8   is another one.                 1
9           This is                 0
10                a                 0
11            third                 0
12         sentence                 0
13                .                 0

现在，我想根据Selection Valuecolumn 将Text 列重新组合成一个二维数组。所有出现在0（第一个整数，或在1 之后）和1（包括）之间的单词都应该放入一个二维数组中。数据集的最后一句可能没有结束 1。这可以按照这个问题的解释来完成：Regroup pandas column into 2D list based on another column

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

我想更进一步，提出以下条件：如果一个列表中有超过max_number_of_cells_per_list个非NaN单元格，那么这个列表应该被分成大致相等的部分，最多包含+/- 1 个 max_number_of_cells_per_list 单元格元素。

假设：max_number_of_cells_per_list = 2，那么预期的输出应该是：

 [["Hi this is"], ["just a"], ["single sentence."],["This is another one"], ["This is"], ["a third sentence ."]]

示例：

基于“Selection_Values”列，可以使用以下方法将单元格重新组合到以下二维列表中：

[[s.str.cat(sep=' ')] for s in np.split(df.Text, df[df.Selection_Values == 1].index+1) if not s.empty]

输出（原始列表）：

[["Hi this is just a single sentence."],["This is another one"], ["This is a third sentence ."]]

让我们看看这些列表中的单元格数量：

如你所见，list1有6个cell，list 2有2个cell，list 3有5个cell。

现在，我想要实现的是：如果列表中有超过一定数量的单元格，则应该将其拆分，这样每个结果列表都有 +/-1 所需的单元格数量.

例如max_number_of_cells_per_list = 2

修改列表：

你有没有办法做到这一点？

编辑：重要提示：不应将原始列表中的单元格放入相同的列表中。

编辑2：

               Text  Selection_Values  New
0                Hi                 0  1.0
1           this is                 0  0.0
2              just                 0  1.0
3                 a                 0  0.0
4            single                 0  1.0
5         sentence.                 1  0.0
6              This                 0  1.0
7               NaN                 0  0.0
8   is another one.                 1  1.0
9           This is                 0  0.0
10                a                 0  1.0
11            third                 0  0.0
12         sentence                 0  0.0
13                .                 0  NaN

【问题讨论】：

我们可以在这个操作之前定义max_number_of_cells_per_list吗？
@anky_91，是的，您可以....但是您不能将来自不同原始列表的两个单元格放在一起。因此，例如，您不能将列表 2 中的 This 放入列表 1。

标签： python pandas list

【解决方案1】：

IIUC，你可以这样做：

n=2 #change this as you like for no. of splits
s=df.Text.dropna().reset_index(drop=True)
c=s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False)

[[i] for i in s.groupby(c.cumsum()).apply(' '.join).tolist()]

[['Hi this is'], ['just a'], ['single sentence.'], 
    ['This is another one.'], ['This is a'], ['third sentence .']]

编辑：

d=dict(zip(df.loc[df.Text.notna(),'Text'].index,c.index))
ser=pd.Series(d)
df['new']=ser.reindex(range(ser.index.min(),
                        ser.index.max()+1)).map(c).fillna(False).astype(int)
print(df)

               Text  Selection_Values  new
0                Hi                 0    1
1           this is                 0    0
2              just                 0    1
3                 a                 0    0
4            single                 0    1
5         sentence.                 1    0
6              This                 0    1
7               NaN                 0    0
8   is another one.                 1    0
9           This is                 0    1
10                a                 0    0
11            third                 0    1
12         sentence                 0    0
13                .                 0    0

【讨论】：

问题：是否可以生成一个列表，例如“Selection_Values”，以便最终选择作为新列插入数据集中？
@henry 你的意思是c 变量？：s.groupby(s.index//n).cumcount().eq(0).shift().shift(-1).fillna(False).astype(int) ?? 1 是从下一个 1 开始的地方
谢谢，我添加了这个New 专栏。正如您在 EDIT 2 中看到的，Newcolumn 中的 1 和 Selection_Values 不匹配。
这可能是由于NaN值没有正确匹配。