【问题标题】:Python: get combinations of unique values from columns of datafamePython:从数据框的列中获取唯一值的组合
【发布时间】:2020-07-11 21:23:55
【问题描述】:

我有一个这样的数据框:

id  a   b   c   d   e
0   a10 a11 a12 a13 a14
1   a10 a21 a12 a23 a24
2   a30 a21 a12 a33 a14
3   a30 a21 a12 a43 a44
4   a10 a51 a12 a53 a14

我想要数据框中长度为“x”的所有唯一组合列表。如果长度为 3,那么一些组合将是:

[[a10,a11,a12],[a10,a21,a12],[a10,a51,a12],[a30,a11,a12],[a30,a21,a12],[a30,a51,a12],
[a11,a12,a13],[a11,a12,a23],[a11,a12,a33],[a11,a12,a43],[a11,a12,a53],[a21,a12,a13]....]

只有两个约束:

1. Length of combination lists should be equal to the 'x'
2. In one combination, there can be at max only 1 unique value from a column of dataframe.

下面给出了构建数据框的最少代码。任何帮助都感激不尽。谢谢!

data_dict={'a':['a10','a10','a30','a30','a10'],
          'b':['a11','a21','a21','a21','a51'],
          'c':['a12','a12','a12','a12','a12'],
          'd':['a13','a23','a33','a43','a53'],
          'e':['a14','a24','a14','a44','a14']}
df1=pd.DataFrame(data_dict)

【问题讨论】:

    标签: python-3.x pandas dataframe combinations


    【解决方案1】:

    要获得每列的唯一值:

    aa = [list(product(np.unique(df1[col1]), 
                       np.unique(df1[col2]), 
                       np.unique(df1[col3]))) 
          for col1, col2, col3 in list(combinations(df1.columns, 3))]
    

    旧答案

    首先,我们使用np.flatten 将您的矩阵展平为一维数组,并使用np.unique 获取唯一值,然后我们使用itertools.combinations

    from itertools import combinations
    
    a = np.unique(df1.to_numpy().flatten())
    aa = set(combinations(a, 3))
    
    {('a10', 'a11', 'a12'),
     ('a10', 'a11', 'a13'),
     ('a10', 'a11', 'a14'),
     ('a10', 'a11', 'a21'),
     ('a10', 'a11', 'a23'),
     ('a10', 'a11', 'a24'),
     ('a10', 'a11', 'a30'),
     ('a10', 'a11', 'a33'),
     ('a10', 'a11', 'a43'),
     ('a10', 'a11', 'a44'),
     ('a10', 'a11', 'a51'),
     ('a10', 'a11', 'a53'),
     ('a10', 'a12', 'a13'),
     ('a10', 'a12', 'a14'),
     ...
    

    或者实际获取列表(效率较低):

    from itertools import combinations
    
    a = np.unique(df1.to_numpy().flatten())
    aa = [list(x) for x in set(combinations(a, 3))]
    
    [['a12', 'a33', 'a51'],
     ['a11', 'a12', 'a13'],
     ['a10', 'a11', 'a21'],
     ['a10', 'a23', 'a24'],
     ['a12', 'a14', 'a24'],
     ['a14', 'a43', 'a53'],
     ['a11', 'a21', 'a53'],
     ['a10', 'a12', 'a24'],
     ['a12', 'a21', 'a44'],
     ['a12', 'a30', 'a51'],
     ['a14', 'a23', 'a30'],
     ...
    

    【讨论】:

    • 但是 "['a10', 'a11', 'a21']" 有 2 个来自列 'b' 的值。单个列中的值不应超过 1 个。
    【解决方案2】:

    combinationssets 的每个列创建的sets 过滤一起用于第二个条件:

    from  itertools import combinations
    
    L = [set(df[x]) for x in df]
    a = [x for x in combinations(np.unique(df.values.ravel()), 3) 
         if all(len(set(x).intersection(y)) < 2 for y in L)]
    

    【讨论】:

    • 谢谢!我仍在消化它是如何发挥魅力的,但它正在产生良好而快速的结果。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-02-07
    • 1970-01-01
    • 1970-01-01
    • 2022-10-13
    • 2016-09-01
    • 1970-01-01
    相关资源
    最近更新 更多