在具有共同值的数据框中添加一个新列答案

【问题标题】：Add a new column in dataframe with common value在具有共同值的数据框中添加一个新列
【发布时间】：2022-02-03 14:39:19
【问题描述】：

我的数据框 df1 为：

col1	col2	col3
Apple	Apple	Apple
orange	0	orange
Cake	0	0
0	Banana	0
0	grape	grape

逻辑：将公共值添加到新列。如果值（非零）存在于一列或两列中，则比较它们并添加公共值。

我想在此数据框中添加一个新列 (New_col)，其值如下：

col1	col2	col3	New_col
Apple	Apple	Apple	Apple
orange	0	orange	orange
Cake	0	0	Cake
0	Banana	0	Banana
0	grape	grape	grape

任何建议如何做到这一点？提前谢谢！

【问题讨论】：

df1['New_col'] = df1['col1']?
谢谢。我现在已经编辑了表格，这个解决方案将不起作用。
这就是为什么你应该解释逻辑，而不是只转储两张表让我们猜测;）我敢打赌我能找到一个不是你想要的逻辑解决方案
是的！对不起。在您发表第一条评论后实现。
那么，逻辑是什么？ ;)

标签： python-3.x pandas dataframe

【解决方案1】：

假设，您想获得每行的第一个非零值，您可以mask 零和bfill 然后获得第一列：

df['NewCol'] = df.mask(df.eq('0')).bfill(axis=1).iloc[:,0]

注意。我还假设 0 是一个字符串。您现在可以根据需要调整此代码

输出：

     col1    col2    col3  NewCol
0   Apple   Apple   Apple   Apple
1  orange       0  orange  orange
2    Cake       0       0    Cake
3       0  Banana       0  Banana

【讨论】：

感谢 Mozway！这就像一个繁荣！ :-)

【解决方案2】：

@mozway 的解决方案的替代方案：使用.ne("0") 和.idxmax 为每一行使用.apply 获取不同于“0”的第一个出现的索引，并将结果分配给新列。

>>> df["new_col"] = df.apply(lambda x: x[x.ne("0").idxmax()], axis=1)
>>> df

     col1    col2    col3  newCol
0   Apple   Apple   Apple   Apple
1  Orange       0  Orange  Orange
2    Cake       0       0    Cake
3       0  Banana       0  Banana

【讨论】：

谢谢乔奥！它也知道其他替代解决方案！

【解决方案3】：

我了解到您正在寻找在任何行中出现次数最多的键，并将其用作新AggregateCol中的项目

关于该说明 - 我将首先创建一个 Numpy 数组，该数组的长度与数据帧中的元素数相同，len(df)

>>> a = {'col1':['AA','BB','CC',0], 'col2':['AA',0,0,'DD'], 'col3':['AA','BB',0,0]}

>>> df = pd.DataFrame(a)

>>> print(df)

col1 col2 col3
0   AA   AA   AA
1   BB    0   BB
2   CC    0    0
3    0   DD    0

>>> a = np.array(len(df) * [0], dtype='object')

>>> print(a)

[0 0 0 0]

接下来有一个功能 - (a) 创建一个字典对象，其中包含在行中看到的每个项目的计数， (b) 用项目的最大出现次数填充 numpy 数组

如果你不想数0，我们可以在这里做一个小修改

>>> def f1():
    idx = 0
    for row in df.iterrows():

        # dictionary object to store occurrences of each item in the row
        # you could also use a collections.Counter to achieve this
        d = {}
        for item in row[1]:
            if item in d.keys():
                d[item] += 1
            else:
                d[item] = 1

        # find the item with max occurrence
        max_key = sorted(d, key=d.get, reverse=True)[0]
        #print(f"{idx=} {max_key=} {d=}")

        # fill the numpy array index with the max_key
        a[idx] = max_key
        idx += 1   

>>> print(a) 

['AA' 'BB' 0 0]

下一部分变得简单，您只需将填充的 numpy 数组元素分配到新列中

>>> df['AggregateCol'] = a

>>> print(df)

col1 col2 col3 AggregateCol
0   AA   AA   AA           AA
1   BB    0   BB           BB
2   CC    0    0            0
3    0   DD    0            0

【讨论】：