【问题标题】:Python: how to count collaborations between pairs in pandas dataframe?Python:如何计算 pandas 数据框中对之间的协作?
【发布时间】:2016-04-29 08:47:00
【问题描述】:

我有一个这样的数据框

df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'], 
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
 'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df 
Item Name  Weight
A    Tom     4
A    John    4
A    Paul    4
B    Tom     3
B    Frank   3
C    Tom     5
C    John    5
C    Richard 5
C    James   5 

对于每个人,我想要在weight 上平均具有相同项目的人的列表

df1 
Name              People                          Times
Tom     [John, Paul, Frank, Richard, James]       [(1/4+1/5),1/4,1/3,1/5,1/5]
John    [Tom, Richard, James]                     [(1/4+1/5),1/5,1/5]
Paul    [Tom, John]                               [1/4,1/4]
Frank   [Tom]                                     [1/3]
Richard [Tom, John, James]                        [1/5,1/5,1/5]
James   [Tom, John, Richard]                      [1/5,1/5,1/5]

为了统计合作次数而不考虑weight,我做了:

#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])

#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1

#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
    Name_x                                     Name_y
0    Frank                                      [Tom]
1    James                       [Tom, John, Richard]
2     John           [Tom, Paul, Tom, Richard, James]
3     Paul                                [Tom, John]
4  Richard                         [Tom, John, James]
5      Tom  [John, Paul, Frank, John, Richard, James]


#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
      Name                               People            times
0    Frank                                [Tom]              [1]
1    James                 [John, Richard, Tom]        [1, 1, 1]
2     John          [James, Paul, Richard, Tom]     [1, 1, 1, 2]
3     Paul                          [John, Tom]           [1, 1]
4  Richard                   [James, John, Tom]        [1, 1, 1]
5      Tom  [Frank, James, John, Paul, Richard]  [1, 1, 2, 1, 1]

在最后一个数据框中,我有所有对之间的协作计数,但是我希望他们的协作加权计数

【问题讨论】:

    标签: python pandas group-by unique


    【解决方案1】:

    开始于:

    df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'],
                       'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'],
                       'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]})
    
    df1 = pd.merge(df, df, on=['Item'])
    df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1)
    

    您可以使用.apply() 创建值,使用.unstack() 创建宽格式:

    collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']
    
    Name_y      Frank  James  John  Paul  Richard       Tom
    Name_x                                                 
    Frank         NaN    NaN   NaN   NaN      NaN  0.333333
    James         NaN    NaN   0.2   NaN      0.2  0.200000
    John          NaN    0.2   NaN   0.5      0.2  0.700000
    Paul          NaN    NaN   0.5   NaN      NaN  0.500000
    Richard       NaN    0.2   0.2   NaN      NaN  0.200000
    Tom      0.333333    0.2   0.7   0.5      0.2       NaN
    

    然后遍历行并转换为列表:

    df = pd.DataFrame(columns=['People', 'Times'])
    for p, data in collab.iterrows():
        s = data.dropna()
        df.loc[p] = [s.index.tolist(), s.values]
    
                                          People  \
    Frank                                  [Tom]   
    James                   [John, Richard, Tom]   
    John             [James, Paul, Richard, Tom]   
    Paul                             [John, Tom]   
    Richard                   [James, John, Tom]   
    Tom      [Frank, James, John, Paul, Richard]   
    
                                            Times  
    Frank                        [0.333333333333]  
    James                         [0.2, 0.2, 0.2]  
    John                     [0.2, 0.5, 0.2, 0.7]  
    Paul                               [0.5, 0.5]  
    Richard                       [0.2, 0.2, 0.2]  
    Tom      [0.333333333333, 0.2, 0.7, 0.5, 0.2]
    

    【讨论】:

    • 这是我想要的,但我会收到以下错误
    猜你喜欢
    • 1970-01-01
    • 2017-07-19
    • 2020-05-30
    • 1970-01-01
    • 2021-01-14
    • 2020-08-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多