【问题标题】:Group by based on first part of tuples on python using pandas使用 pandas 根据 python 元组的第一部分进行分组
【发布时间】:2020-10-20 10:07:45
【问题描述】:

我在通过元组的前两个元素进行分组时遇到了一些麻烦,我已经搜索了很多并尝试了但我无法弄清楚:(

我有这个数据集:

    idi d2  duplicates
0   a   b   (us2, us1, 1)
0   a   b   (us1, us4, 1)
0   a   b   (us4, us2, 1)
0   a   b   (us2, us5, 1)
0   a   b   (us5, us4, 1)
0   a   b   (us4, us1, 1)
0   a   b   (us1, us2, 1)
0   a   b   (us2, us1, 2)
0   a   b   (us1, us4, 4)
0   a   b   (us4, us2, 1)
0   a   b   (us2, us4, 1)
0   a   b   (us4, us2, 1)
1   c   b   (us1, us2, 1)
1   c   b   (us2, us1, 1)
1   c   b   (us1, us2, 1)
1   c   b   (us2, us4, 1)
1   c   b   (us4, us5, 1)
2   v   b   (us4, us5, 1)

我想根据id、id2和'usx'进行分组,所以输出应该是:

    idi d2   duplicates
0   a   b   (us2, us1, 1), (us2, us1, 2)
0   a   b   (us1, us4, 1), (us1, us4, 4)
0   a   b   (us4, us2, 1), (us4, us2, 1), (us4, us2, 1)
0   a   b   (us2, us5, 1)
0   a   b   (us5, us4, 1)
0   a   b   (us4, us1, 1)
0   a   b   (us1, us2, 1)
0   a   b   (us2, us4, 1)
1   c   b   (us1, us2, 1), (us1, us2, 1)
1   c   b   (us2, us1, 1)
1   c   b   (us2, us4, 1)
1   c   b   (us4, us5, 1)
2   v   b   (us4, us5, 1)

生成有效部分的代码是:

d = {'id': [      "a",  "a",   "a", "a",   "a",   "a",   "a",   "a",   "a",   "c",   "c",   "c",   "c",   "c",   "a",   "a",   "a",   "a",   "v",   "v",   "c",   "c"], 
     'id2': ["b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b",   "b"], 
     'userid':         ["us1", "us2", "us1", "us2", "us4", "us4", "us5", "us1", "us2", "us1", "us2", "us1", "us2", "us4", "us4", "us2", "us4", "us2", "us4", "us5", "us4", "us5"],
     "time":            [11,    2,      3,     5,      4,   7,     6,      8,     9,    10,    11,    12,    13,    14,     15,   16,    17,    18,    19,    20,    21,    22]}

df_test = pd.DataFrame(data=d).sort_values('time').reset_index()
df_test  = df_test.groupby(['id','id2']).apply(lambda x: list(zip(x['userid'][:-1], x['userid'][1:],
                                                                                   x['time'][:-1], x['time'][1:]))).reset_index(name = 'duplicates')

df_test['duplicates'] = df_test.apply(lambda x: [(k, v, j - y) for k,v, y,j in x.duplicates if k != v], 1)
df_test['duplicates'] = df_test.apply(lambda x: [(k,v,y) for k,v,y in x.duplicates], 1)
df_test.explode('duplicates')

【问题讨论】:

    标签: python pandas lambda tuples pandas-groupby


    【解决方案1】:

    我相信您需要通过使用str 索引提取的元组的前 2 个值进行分组,它可以工作,因为元组是可迭代的:

    df = (df_test.groupby(['id','id2', df_test['duplicates'].str[:2]], sort=False)['duplicates']
                 .apply(list)
                 .reset_index(level=2, drop=True)
                 .reset_index())
    print (df)
       id id2                                     duplicates
    0   a   b                 [(us2, us1, 1), (us2, us1, 2)]
    1   a   b                 [(us1, us4, 1), (us1, us4, 4)]
    2   a   b  [(us4, us2, 1), (us4, us2, 1), (us4, us2, 1)]
    3   a   b                                [(us2, us5, 1)]
    4   a   b                                [(us5, us4, 1)]
    5   a   b                                [(us4, us1, 1)]
    6   a   b                                [(us1, us2, 1)]
    7   a   b                                [(us2, us4, 1)]
    8   c   b                 [(us1, us2, 1), (us1, us2, 1)]
    9   c   b                                [(us2, us1, 1)]
    10  c   b                                [(us2, us4, 1)]
    11  c   b                                [(us4, us5, 1)]
    12  v   b                                [(us4, us5, 1)]
    

    编辑:

    df_test['duplicates'] = df_test.apply(lambda x: [(x['id'], k,v,y) for k,v,y in x.duplicates], 1) 
    
    df_test = df_test.explode('duplicates')
    print (df_test)
      id id2        duplicates
    0  a   b  (a, us2, us1, 1)
    0  a   b  (a, us1, us4, 1)
    0  a   b  (a, us4, us2, 1)
    0  a   b  (a, us2, us5, 1)
    0  a   b  (a, us5, us4, 1)
    0  a   b  (a, us4, us1, 1)
    0  a   b  (a, us1, us2, 1)
    0  a   b  (a, us2, us1, 2)
    0  a   b  (a, us1, us4, 4)
    0  a   b  (a, us4, us2, 1)
    0  a   b  (a, us2, us4, 1)
    0  a   b  (a, us4, us2, 1)
    1  c   b  (c, us1, us2, 1)
    1  c   b  (c, us2, us1, 1)
    1  c   b  (c, us1, us2, 1)
    1  c   b  (c, us2, us4, 1)
    1  c   b  (c, us4, us5, 1)
    2  v   b  (v, us4, us5, 1)
    

    df = (df_test.groupby(['id','id2', df_test['duplicates'].str[1:3]], sort=False)['duplicates']
                  .apply(list)
                  .reset_index(level=2, drop=True)
                  .reset_index())
    print (df)
       id id2                                         duplicates
    0   a   b               [(a, us2, us1, 1), (a, us2, us1, 2)]
    1   a   b               [(a, us1, us4, 1), (a, us1, us4, 4)]
    2   a   b  [(a, us4, us2, 1), (a, us4, us2, 1), (a, us4, ...
    3   a   b                                 [(a, us2, us5, 1)]
    4   a   b                                 [(a, us5, us4, 1)]
    5   a   b                                 [(a, us4, us1, 1)]
    6   a   b                                 [(a, us1, us2, 1)]
    7   a   b                                 [(a, us2, us4, 1)]
    8   c   b               [(c, us1, us2, 1), (c, us1, us2, 1)]
    9   c   b                                 [(c, us2, us1, 1)]
    10  c   b                                 [(c, us2, us4, 1)]
    11  c   b                                 [(c, us4, us5, 1)]
    12  v   b                                 [(v, us4, us5, 1)]
    

    【讨论】:

    • 嗨,谢谢!仍然是一个问题,当我在您的代码之前添加此行时: df_test['duplicates'] = df_test.apply(lambda x: [(x['id'], k,v,y) for k,v,y in x .duplicates], 1) 然后我改为 .str[1:3] 它停止工作。你能解释一下为什么这样我可以尝试更好地理解代码并解决这个问题吗? :)
    • @CatarinaNogueira - 对我来说工作得很好,经过编辑的答案。你的熊猫版本是什么?
    • 没有更新,你是对的!现在我在 1.0.5 并且运行良好!谢谢你:)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-22
    • 1970-01-01
    • 2015-03-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多