【问题标题】:Create all possible permutations of a column partitioned by another column in a Pandas Dataframe在 Pandas Dataframe 中创建由另一列分区的列的所有可能排列
【发布时间】:2017-06-12 15:25:06
【问题描述】:

我的数据框看起来像这样:

我的目标是:

解释:

  1. 每个客户都下了 3 个订单
  2. 每个订单可以从多个类别中购买
  3. 期望状态:获取客户按订单顺序购买的类别的所有可能排列。第二张图片将有助于更好地理解这一点
  4. 处于期望状态的Category1 表示以第一顺序购买的Category,Category2 表示以第二顺序购买的Category,依此类推。

我正在使用的代码:

start_time = time.time()

df = pd.DataFrame()
for CustomerName in base_df.CustomerName.unique():
    df1 = base_df[(base_df['CustomerName']== CustomerName)][['CustomerName','order_seq','Category']]
    df2 = pd.DataFrame(index=pd.MultiIndex.from_product([subdf['Category'] for p, subdf in df1.groupby(['order_seq'])], names = df1.order_seq.unique())).reset_index()
    df2['CustomerName'] = CustomerName
    df = df.append(df2)

print("--- %s seconds ---" %(time.time() - start_time))

在我的数据集上运行大约需要 10 分钟 - 寻找更快的方法。

我现在正在研究 Pandas,但也欢迎 R 或 SQL 的指针!谢谢!

【问题讨论】:

  • 这是一个排列?为什么顾客 1 只能在他的第一个订单中点餐?
  • 欢迎来到 Stack Overflow!您可以先take the tour 学习How to Ask a good question 并创建一个Minimal, Complete, and Verifiable 示例。这让我们更容易为您提供帮助。
  • @PauloMiraMor - 不,它可以是任何东西。他本可以在他的第一个订单中购买衣服、家具或两者兼而有之。是的,需要为每个客户按订单顺序排列所有产品

标签: python pandas stack pivot


【解决方案1】:

考虑合并三个 OrderSequence 数据框,每个数据框都连接到不同的 CustomerName

import pandas as pd

df = pd.DataFrame({'CustomerName': [1,1,1,1,1,1,1,2,2,2,3,3,3,3],
                   'OrderSequence': [1,2,2,2,3,3,3,1,2,3,1,1,2,3],
                   'Category': ['Food','Food','Clothes','Furniture','Clothes','Food','Toys',
                                'Clothes','Toys','Food','Furniture','Toys','Food','Food']})

finaldf = pd.DataFrame(df['CustomerName'].drop_duplicates())

for i in range(1,4):
    seqdf = df[df['OrderSequence']==i][['CustomerName', 'Category']].\               
                                      rename(columns={'Category':'Category'+str(i)})
    finaldf = pd.merge(finaldf, seqdf, on=['CustomerName'])

print(finaldf)

#     CustomerName  Category1  Category2 Category3
# 0              1       Food       Food   Clothes
# 1              1       Food       Food      Food
# 2              1       Food       Food      Toys
# 3              1       Food    Clothes   Clothes
# 4              1       Food    Clothes      Food
# 5              1       Food    Clothes      Toys
# 6              1       Food  Furniture   Clothes
# 7              1       Food  Furniture      Food
# 8              1       Food  Furniture      Toys
# 9              2    Clothes       Toys      Food
# 10             3  Furniture       Food      Food
# 11             3       Toys       Food      Food

诚然,上面的设置首先是在 SQL 中使用自联接考虑的,然后翻译为 pandas:

SELECT t1.CustomerName, t2.Category AS Category1, 
       t3.Category AS Category2, t4.Category AS Category3

FROM (SELECT DISTINCT CustomerName FROM DataFrame) AS t1 
INNER JOIN DataFrame AS t2 
ON t1.CustomerName = t2.CustomerName 
INNER JOIN DataFrame AS t3
ON t1.CustomerName = t3.CustomerName 
INNER JOIN DataFrame AS t4
ON t1.CustomerName = t4.CustomerName

WHERE (t2.OrderSequence=1) AND (t3.OrderSequence=2) AND (t4.OrderSequence=3);

【讨论】:

  • 谢谢,将尝试运行您的逻辑,看看它是否在我的数据上运行得更快!
  • 我们从实际数据中发现了什么?
【解决方案2】:

好的。做了一些工作,但我做到了。希望有帮助。

import pandas as pd
import numpy as np
from itertools import combinations

df = pd.DataFrame([], columns=['CustomerName','Order Sequence','Category'])

df['CustomerName'] = [1,1,1,1,1,1,1,2,2,2,3,3,3,3]
df['Order Sequence'] = [1,2,2,2,3,3,3,1,2,3,1,1,2,3]
df['Category'] = ['Food','Food','Clothes','Furniture','Clothes','Food','Toys','Clothes','Toys','Food','Furniture','Toys','Food','Food']

df2 = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])

for CN in sorted(set(df['CustomerName'])):

    df_temp = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3'])

    list_OS_1 = []
    list_OS_2 = []
    list_OS_3 = []

    MMC = reduce(lambda x, y: x*y,df.loc[df['CustomerName']==CN, 'Order Sequence'].value_counts().values)

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category']:

            list_OS_1.append(CTG) 

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category']:

            list_OS_2.append(CTG) 

    for N in np.arange(MMC / len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category'])):

        for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category']:

            list_OS_3.append(CTG) 

    df_temp['Category1'] = list_OS_1
    df_temp['Category2'] = list_OS_2
    df_temp['Category3'] = list_OS_3
    df_temp['CustomerName'] = CN

    df2 = pd.concat([df2,df_temp],0)

print (df2)

输出:

   CustomerName  Category1  Category2 Category3
0           1.0       Food       Food   Clothes
1           1.0       Food    Clothes      Food
2           1.0       Food  Furniture      Toys
3           1.0       Food       Food   Clothes
4           1.0       Food    Clothes      Food
5           1.0       Food  Furniture      Toys
6           1.0       Food       Food   Clothes
7           1.0       Food    Clothes      Food
8           1.0       Food  Furniture      Toys
0           2.0    Clothes       Toys      Food
0           3.0  Furniture       Food      Food
1           3.0       Toys       Food      Food

ps:它不是动态的,所以如果你添加或删除类别,它会被淘汰。 但只要它遵循你通过我的初始标准,它就可以工作

【讨论】:

  • 谢谢!不幸的是,我可能添加了新类别!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-12-14
  • 1970-01-01
  • 2012-07-11
  • 1970-01-01
  • 2017-03-27
  • 2015-03-09
相关资源
最近更新 更多