在两列上使用 groupby() 时获取每个组的最大值~Python答案

【问题标题】：Get the max of each group when using groupby() on two columns ~Python在两列上使用 groupby() 时获取每个组的最大值~Python
【发布时间】：2017-06-14 22:16:21
【问题描述】：

使用类似格式的 csv（总 csv 为 ~500 x ~600,000），因此缺少列：

       Sales  market_id  product_id

0         38   10001516     1132679
1         49   10001516     1138767
2          6   10001516     1132679
     ...        ...         ...
9969  245732    1002123     1383020
9970  247093    1006821     1383020

等并像这样阅读它： df0=pd.read_csv('all_final_decomps2_small.csv', low_memory=False, encoding='iso8859_15')

我正在尝试为每个 market_id 找到具有最大销售额的 product_id。为此，我需要对销售额求和，因为相同的 product_id 和 market_id 可以出现在多行中。

我已经尝试过这种方法，它会产生每个市场中的产品总和：

df_sales=df0[['Sales','market_id','product_id']] 
df_sales.groupby(['market_id', 'product_id'])['Sales'].sum()

照原样（缩短）：

market_id  product_id
1006174    1132679             2789
           1382460             4586
           1382691               49
           1383020        269138089
1006638    1132679          5143156
           1382460           387250
           1383020        204456809
10002899   1132679              630
           1382464              220

使用：

df_sales.groupby(['market_id', 'product_id'])['Sales'].sum().max()

返回总和的最大值而不是其他值，因此在这种情况下它将返回 269138089。我想返回如下内容：

market_id  product_id      max_sales
1006174    1383020        269138089
1006638    1383020        204456809
10002899   1132679              630

我已经尝试了很多不同的方法，但我似乎无法为这个示例获得任何帮助，所以我很感激任何帮助（如果以前有人问过它，我很抱歉）。

我正在使用： Python 3.6.1 :: Anaconda 4.4.0（64 位）

【问题讨论】：

标签： python pandas anaconda pandas-groupby

【解决方案1】：

在groupby 中使用idxmax

设置

import pandas as pd
from io import StringIO

txt = """market_id  product_id         Sales
1006174    1132679             2789
1006174    1382460             4586
1006174    1382691               49
1006174    1383020        269138089
1006638    1132679          5143156
1006638    1382460           387250
1006638    1383020        204456809
10002899   1132679              630
10002899   1382464              220"""


sales = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col=[0, 1], squeeze=True)

解决方案

sales.loc[sales.groupby(level=0).idxmax()]

market_id  product_id
1006174    1383020       269138089
1006638    1383020       204456809
10002899   1132679             630
Name: Sales, dtype: int64

或者

sales.loc[sales.groupby(level=0).idxmax()].reset_index(name='max_sales')

   market_id  product_id  max_sales
0    1006174     1383020  269138089
1    1006638     1383020  204456809
2   10002899     1132679        630

【讨论】：

复制您的代码，它运行良好并按要求运行。但是，我在使用此方法从文件创建数据框时遇到问题。实际的 csv 大约为 10,000 x 500，它是 600,000 x 500 的较小版本（因此我无法读取为引号中的文本）。这是我当前的读取行： df0=pd.read_csv('exampledb.csv', low_memory=False, encoding='iso8859_15') 知道如何更改您的代码以使其工作吗？在创建数据框后尝试将列“market_id”和“product_id”指定为索引会导致索引多索引错误
@ebrithilotho 我的sales 对象是我认为你应该在你的df_sales 中拥有的...提供足够的数据。你应该可以在你的 df_sales 上运行我的代码。
使用这条线：df_sales.loc[df_sales.groupby(level=0).idxmax()].reset_index(name='max_sales' 是我遇到问题的地方。使用我的阅读行，它显示：ValueError: Cannot index with multidimensional key。如果我更改我的读取行以添加 index_col=['market_id','product_id'] 它会导致相同的错误。它们不是表中的第 0,1 列（它们与 500 个其他变量混合在一起），所以我希望能够按名称引用它们。我知道您的代码可以独立运行，但我正在努力将其应用到我的文件中。如果我没有提供足够的数据，还有其他信息可以提供帮助吗？

【解决方案2】：

不知何故设法得到了这个 - 我不确定它是否是最好的方法，但它适用于我的数据：

df0=pd.read_csv('test.csv', low_memory=False, encoding='iso8859_15')

#Rank all items in each market by total sales
df_sales=df0[['Sales', 'market_id', 'product_id']] # int, int, int

# groups sales by market and product and sums product sales
gr_sales = df_sales.groupby(['market_id', 'product_id'], as_index = False).sum()

# gets the product sales in each market and sorts in order of decreasing sales
gr_sales = gr_sales.groupby('market_id').apply(pd.DataFrame.sort_values, 'Sales', ascending = False)

# Finds the product id with the maximum sales in each market
max_sales = gr_sales.groupby('market_id').max()

给我：

In[621]: max_sales
Out[621]: 
    market_id  product_id       Sales
0     1006174     1383020   269138089
1     1006638     1383020  1330070614
2     1006678     1383020    58548417
3     1006684     1383020   215858049
4     1006692     1383020    21799689
5     1006732     1383020    58548417
6     1006733     1383020    58548417
7     1006739     1383020   215858049
8     1006819     1383020   605951504
9     1006820     1383020    59083807
10    1006821     1383020    25116872
11    1050511     1382672     6201692
12    1050512     1382672     5468317
13   10001493     1383020    21799689
14   10001516     1383020   204456809
15   10002899     1383020    62413425

和（缩短的例子）：

In[624]: gr_sales
Out[624]: 
               market_id  product_id       Sales
market_id                                       
1006174   11     1006174     1383020   269138089
          9      1006174     1382672     5070111
          5      1006174     1382536     2442639
          7      1006174     1382602     1108361
          6      1006174     1382557      158488
          8      1006174     1382651       17214
          1      1006174     1382460        4586
          0      1006174     1132679        2789
          3      1006174     1382490         799
          2      1006174     1382464         105
          10     1006174     1382691          49
          4      1006174     1382522          16
1006638   28     1006638     1383020  1330070614
          25     1006638     1382672   109679596
          12     1006638     1132679     5143156
          17     1006638     1382536     4885278
          22     1006638     1382620     2668948
          21     1006638     1382602     2216722
          18     1006638     1382538      992228
          13     1006638     1382460      387250
          19     1006638     1382557      316976
          23     1006638     1382651       39616
          26     1006638     1382674       22388
          20     1006638     1382573        7412
          15     1006638     1382490        1598
          14     1006638     1382464         758
          24     1006638     1382665         120
          27     1006638     1382691          98
          16     1006638     1382522          32
1006678   32     1006678     1383020    58548417
                 ...         ...         ...

[117 rows x 3 columns]

我不确定如何从 gr_sales 输出中删除任意索引（就在中间，这样有点烦人），或者从 max_sales 表中删除

【讨论】：