Pandas：计算具有多个条件的列和行上的迭代字符串出现次数答案

【问题标题】：Pandas: Count iterating string occurrences over columns & rows with multiple conditionsPandas：计算具有多个条件的列和行上的迭代字符串出现次数
【发布时间】：2020-11-09 17:07:03
【问题描述】：

我有一个学术期刊数据集。变量Top Journal 是一个虚拟变量，如果论文发表在顶级期刊上，则等于 1。 Publication Month 是论文发表的数字月份。 author1、author2 等是该行具体论文的作者。

对于每个作者，我想计算之前在顶级期刊上发表的文章数量。因此，我想计算他/她的名字在authorX 之一列中的所有先前出现次数，但仅限于论文发表在顶级期刊上时。

df = pd.DataFrame({'Top Journal': [1,0,1],
                  'Publication Year': [2020, 2020, 2020],
                  'Publication Month': [8,8,7],
                  'author1': ['Hendren, Nathaniel', 'Backus, Matthew','Enke, Benjamin'],
                  'author2': ['Sprung-Keyser, Ben', 'Blake, Thomas', 'Hendren, Nathaniel'],
                  'author3': [None,'Larsen, Brad', None ]},
                 index = ['UID1', 'UID2', 'UID3'])

输出应如下所示：

 Top     Publication Publication    author1           author2           author3    previous_publications1  previous_publications2  previous_publications3
Journal     Year       Month
  1         2020        8      Hendren, Nathaniel  Sprung-Keyser, Ben     None             1                      0                       None
  0         2020        8       Backus, Matthew     Blake, Thomas      Larsen, Brad        0                      0                        0
  1         2020        7        Enke, Benjamin    Hendren, Nathaniel     None             0                      0                       None

重要提示：如果作者姓名在author1 中被提及一次，它可能会出现在另一个观察中的任何其他位置（例如author6）。

以前的顶级期刊出版物的数量应该显示在新列previous_publications1、previous_publications2，其中数字指的是各自的作者。因此，与 Hendren, Nathaniel 第二次出现在第三行时相比，第一篇论文 (Hendren, Nathaniel) 的作者 1 的发表次数更多。

【问题讨论】：

您能否发布您的预期输出？
当然，抱歉。例如，因为只有“Hendren, Nathaniel”出现在另一个顶级期刊中，所以前三行的附加列将如下所示：previous_publications1: 1 0 0 previous_publications2: 0 0 0 previous_publications3: 无 0 无
假设有一个名为 df 的数据框以及 df.index 并将代码复制并粘贴到您的问题中，您能做到 df.to_dict() 吗？
你是这个意思吗？

标签： python pandas dataframe conditional-statements countif

【解决方案1】：

使用数据框：

df = pd.DataFrame({'Top Journal': [1,0,1],
                  'Publication Year': [2020, 2020, 2020],
                  'Publication Month': [8,8,7],
                  'author1': ['Hendren, Nathaniel', 'Backus, Matthew','Enke, Benjamin'],
                  'author2': ['Sprung-Keyser, Ben', 'Blake, Thomas', 'Hendren, Nathaniel'],
                  'author3': [None,'Larsen, Brad', None]},
                 index = ['UID1', 'UID2', 'UID3'])

author 列的格式使 wide_to_long 成为一个不错的选择，因为您可以使用 author 作为存根名称将所有三个作者列合二为一，这样您就可以累计计算以前的出版物数量;但是，为了使用cumcount，我们需要在一列中包含相关数据。从那里使用unstack(3)将第四个索引列（'author #'）放入列标题中，将其转换回具有“长到宽”的原始格式。然后你摆脱多索引并使用df.columns = [''.join(col) for col in df.columns] 获得原始列名，但首先author # 列名必须是带有.rename({1: '1', 2: '2', 3: '3'}, axis=1) 的字符串：

df = (pd.wide_to_long(df, stubnames='author', i=['Top Journal', 'Publication Year', 'Publication Month'], j='author #')
        .sort_values(['Publication Year', 'Publication Month']))
df['previous_publications'] = df.groupby('author').cumcount()
df = df[~df['author'].isnull()].unstack(3).rename({1: '1', 2: '2', 3: '3'}, axis=1).fillna('None')
df.columns = [''.join(col) for col in df.columns]
df
Out[1]: 
                                                           author1  \
Top Journal Publication Year Publication Month                       
0           2020             8                     Backus, Matthew   
1           2020             7                      Enke, Benjamin   
                             8                  Hendren, Nathaniel   

                                                           author2  \
Top Journal Publication Year Publication Month                       
0           2020             8                       Blake, Thomas   
1           2020             7                  Hendren, Nathaniel   
                             8                  Sprung-Keyser, Ben   

                                                     author3  \
Top Journal Publication Year Publication Month                 
0           2020             8                  Larsen, Brad   
1           2020             7                          None   
                             8                          None   

                                                previous_publications1  \
Top Journal Publication Year Publication Month                           
0           2020             8                                     0.0   
1           2020             7                                     0.0   
                             8                                     1.0   

                                                previous_publications2  \
Top Journal Publication Year Publication Month                           
0           2020             8                                     0.0   
1           2020             7                                     0.0   
                             8                                     0.0   

                                                previous_publications3  
Top Journal Publication Year Publication Month                          
0           2020             8                                     0.0  
1           2020             7                                     NaN  
                             8                                     NaN

【讨论】：

非常感谢 - 它工作得非常好。非常感谢！！
嗨大卫，cumcount 现在计算作者姓名的所有先前实例，无论虚拟 Top Journal 的值是 1 还是 0。我尝试修复它，但无法成功.这里回答了一个类似的问题：stackoverflow.com/questions/51018739/…你有什么想法吗？
除其他外，我尝试了：df['previous_publications'] = df[df["Top Journal"]==1].groupby('author').cumcount() 但这当然会导致所有实例的0，其中作者参与了Top Journal = 0。
使用df['previous_publications'] = df.groupby('author').top_journal.apply(lambda x: x.shift().cumsum())修复它