如何组合数据框中一行的数据？答案

【问题标题】：How to combine data from one row in dataframe?如何组合数据框中一行的数据？
【发布时间】：2021-11-13 11:13:00
【问题描述】：

我正在使用顺序和频繁模式挖掘。我得到了这种类型的数据集来完成这项任务，并被告知在处理之前从数据集中制作一个序列。

这是从数据集中获取的样本数据，采用表格格式。 .csv 格式的表格位于：https://drive.google.com/file/d/1j1rEy4Q600y_oym23cG3m3NNWuNvIcgG/view?usp=sharing

User	Item 1	Item 2	Item 3	Item 4	Item 5
A	milk	cake	citrus
B	cheese	milk	bread	cabbage	carrot
A	tea	juice	citrus	salmon
B	apple	orange
B	cake

首先，我想我必须将 csv 文件制作成 Pandas Dataframe。我对此没有问题，我想问的是，数据框怎么可能产生这样的结果？

预期结果1，从1个用户那里购买的一组物品被分组到一个元组中

User	Transactions
A	(milk cake citrus)(tea juice citrus salmon)
B	(cheese milk bread cabbage carrot)(apple orange)(cake)

预期结果2，用户购买的每件商品不按一个分组。

User	Transactions
A	milk, cake, citrus, tea, juice, citrus, salmon,
B	cheese, milk, bread, cabbage, carrot, apple, orange, cake

我的问题是，如何制作这些数据框？我已经尝试了这篇文章中的解决方案：How to group dataframe rows into list in pandas groupby，但仍然没有成功。

【问题讨论】：

你应该包括你的尝试和不适合你的地方。

标签： python pandas dataframe

【解决方案1】：

为了得到第一个结果：

out = df.set_index('User').apply(lambda x : tuple(x[x.notna()].tolist()),axis=1).groupby(level=0).agg(list).reset_index(name='Transactions')
Out[95]: 
  User                                       Transactions
0    A  [(milk, cake, citrus), (tea, juice, citrus, sa...
1    B  [(cheese, milk, bread, cabbage, carrot), (appl...

对于比前一个更容易的第二个结果：

df.set_index('User').replace('',np.nan).stack().groupby(level=0).agg(','.join)
Out[97]: 
User
A             milk,cake,citrus,tea,juice,citrus,salmon
B    cheese,milk,bread,cabbage,carrot,apple,orange,...
dtype: object

【讨论】：

您的第一个解决方案包括 NaN ;)
@mozway 如果原件为空，则不会
如何去除第一个输出中的 NaN？
@DionisiusPratama 改变这个 x[x.notna()]

【解决方案2】：

让我们从第二个开始：

(df.set_index('User')
   .stack()
   .groupby(level=0).apply(list)
   .rename('Transactions')
   .reset_index()
)

输出：

  User                                       Transactions
0    A   [milk, cake, citrus, tea, juice, citrus, salmon]
1    B  [cheese, milk, bread, cabbage, carrot, apple, ...

要获得第一个，只需要添加一个新列：

(df.assign(group=df.groupby('User').cumcount())
   .set_index(['User', 'group'])
   .stack()
   .groupby(level=[0,1]).apply(tuple)
   .groupby(level=0).apply(list)
   .rename('Transactions')
   .reset_index()
)

输出：

  User                                       Transactions
0    A  [(milk, cake, citrus), (tea, juice, citrus, sa...
1    B  [(cheese, milk, bread, cabbage, carrot), (appl...

【讨论】：

【解决方案3】：

import pandas as pd
df = pd.read_csv('sampletable.csv')

df['Transactions'] = '(' + df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=' '), axis=1) + ')'

df = df.groupby(['User'])['Transactions'].apply(lambda x: ''.join(x)).reset_index()

print(df)

输出：

  User                                                  Transactions
0    A        (milk cake citrus)(tea juice citrus salmon)
1    B  (cheese milk bread cabbage carrot)(apple orange)(cake)

对于第二个输出，使用这个：

df = pd.read_csv('sampletable.csv')

df['a'] = df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=', '), axis=1)

df = df.groupby(['User'])['a'].apply(lambda x: ', '.join(x)).reset_index()

print(df)

【讨论】：