在 Pandas 中，如何使用函数创建组？答案

【问题标题】：In Pandas, how do I create groups by using a function?在 Pandas 中，如何使用函数创建组？
【发布时间】：2016-08-25 17:05:46
【问题描述】：

我有以下数据框：

data = np.random.randn(10,10)
col = list('ABCDEFGHIJ')
idx = list('ababaaccab')
df = pd.DataFrame(data, columns = col, index = idx)

df

          A         B         C         D         E         F         H  
a -0.104171 -0.872001  1.459766 -0.026101  0.474336  2.032986 -0.795409   
b  0.778402  0.965868  1.672520 -2.463641  1.024571  1.501360  1.047823   
a  0.731303 -1.314826  1.477969 -1.018818  0.539794 -0.108252  0.038276   
b -1.180857 -1.931064 -0.287966 -0.387748 -0.324306  0.146812  0.674937   
a -0.151452  0.387804  0.853088  0.610810  0.091901 -0.246471 -0.677219   
a  1.392482  1.286639 -0.607495  0.682221  0.164414 -0.496787  0.502786   
c  0.039890  0.587645  0.577257 -0.381706 -1.477829  1.165732 -1.877052   
c -1.307827 -0.370028  0.136269 -0.968533  0.830933 -0.025641 -0.497450   
a  0.990024  0.003812 -0.698894  0.674133 -0.176148 -0.184096 -1.449170   
b -1.214920 -1.123358 -0.847955 -0.464895  0.517553 -0.080168 -1.162767

我还使用 pandas 文档中的函数来分隔“元音”和“辅音”之间的字母

def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'v'
    else:
        return 'c'

我的问题是如何使用数据框索引中的字母类型进行分组？

【问题讨论】：

pandas.pydata.org/pandas-docs/stable/generated/…
你的预期输出是什么？

标签： python pandas dataframe pandas-groupby

【解决方案1】：

当您将函数传递给groupby 时，它会根据索引计算函数。所以，如果值在索引中，你可以这样做：

df.groupby(get_letter_type).sum()
Out[122]: 
          A         B         C         D         E         F         G  \
c  5.504182  3.637560  2.659321  0.558187  0.206418 -1.194616  1.410917   
v  1.132699 -0.768438 -0.183739 -1.353405  1.148394 -0.668739 -1.376241   

          H         I         J  
c  3.388815 -1.086567 -2.223479  
v  0.456455 -0.904328  1.072830

对于更一般的情况，您可以使用np.vectorize 获取函数的矢量化版本：

import numpy as np    
get_letter_type_vectorized = np.vectorize(get_letter_type)

然后使用该函数以您的索引作为参数进行分组（适用于索引以外的任何其他输入）：

df.groupby(get_letter_type_vectorized(df.index)).sum()

如果数据集很大，您也可以使用np.where 尝试自己的矢量化版本：

df.groupby(np.where(df.index.isin(list("aeiou")), "v", "c")).sum()

np.where 将返回一个由 v 和 c 组成的数组 (array(['v', 'c', 'v', 'c', 'v', 'v', 'c', 'c', 'v', 'c'], dtype='<U1'))，并将在该数组上进行分组。

【讨论】：

不错！不知道np.vectorize() 功能
@MaxU 我发现它非常有用，尽管正如文档中所述，它基本上是一个循环，因此没有性能优势。对于大型数据集，最好寻找实际的矢量化。
您的df.groupby(np.where(df.index.isin(list("aeiou")), "v", "c")).sum() 版本约为。与 piRSquared 的版本相比，在 100.000 DF 上快 2.3 倍，快 2.7 倍
非常感谢。

【解决方案2】：

也许你可以试试这样的：

for letters in list(df.index):
    A = get_letter_type(letters)
    if A == 'v':
        print df.index.values
    else:
        continue

【讨论】：

【解决方案3】：

设置

np.random.seed(314)
data = np.random.randn(10,10)
col = list('ABCDEFGHIJ')
idx = list('ababaaccab')
df = pd.DataFrame(data, columns = col, index = idx)


def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'v'
    else:
        return 'c'

解决方案

将字母类型附加到df.index 并定义新的DataFrame。然后使用groupby(level=1)

letter_types = df.index.to_series().apply(get_letter_type)
df_w_letter_types = df.set_index(letter_types, append=True)
letter_type_groupby = df_w_letter_types.groupby(level=1)

演示

然后你就可以为所欲为

print letter_type_groupby.sum()

          A         B         C         D         E         F         G  \
c  0.155376 -0.544616 -2.274168 -0.721236 -1.214174  0.663555  2.668149   
v -1.196059 -0.264262 -0.252973  1.178112  0.030117 -0.392086  3.503615   

          H         I         J  
c  2.951569 -3.216444  3.976823  
v -2.790688 -0.343123 -4.346544

【讨论】：

这是一个不错的解决方案，但您不需要将生成的列附加到您的 df 进行分组（作为索引或作为列）。你可以这样做：df.groupby(letter_types).sum()
@ayhan 是的！我从你的回答中了解到这一点。谢谢你。