如何在 python cuDF 中使用自定义函数进行分组？答案

【问题标题】：How to groupby with custom function in python cuDF?如何在 python cuDF 中使用自定义函数进行分组？
【发布时间】：2022-07-31 05:35:57
【问题描述】：

我是使用 GPU 进行数据操作的新手，并且一直在努力复制 cuDF 中的一些功能。例如，我想为数据集中的每个组获取一个众数值。在 Pandas 中，可以使用自定义函数轻松完成：

df = pd.DataFrame({\'group\': [1, 2, 2, 1, 3, 1, 2],
                   \'value\': [10, 10, 30, 20, 20, 10, 30]}

| group | value |
| ----- | ----- |
| 1     | 10    |
| 2     | 10    |
| 2     | 30    |
| 1     | 20    |
| 3     | 20    |
| 1     | 10    |
| 2     | 30    |

def get_mode(customer):
    freq = {}
    for category in customer:
        freq[category] = freq.get(category, 0) + 1
    key = max(freq, key=freq.get)
    return [key, freq[key]]

df.groupby(\'group\').agg(get_mode)

| group | value |
| ----- | ----- |
| 1     | 10    |
| 2     | 30    |
| 3     | 20    |

但是，我似乎无法在 cuDF 中复制相同的功能。尽管似乎有一种方法可以做到这一点，我已经找到了一些例子，但它在我的情况下不起作用。例如，以下是我尝试用于 cuDF 的函数：

def get_mode(group, mode):
    print(group)
    freq = {}
    for i in range(cuda.threadIdx.x, len(group), cuda.blockDim.x):
        category = group[i]
        freq[category] = freq.get(category, 0) + 1
    mode = max(freq, key=freq.get)
    max_freq = freq[mode]
    
df.groupby(\'group\').apply_grouped(get_mode, incols=[\'group\'],
                                   outcols=dict((mode=np.float64))

有人可以帮我了解这里出了什么问题，以及如何解决吗？尝试运行上面的代码会引发以下错误（希望我设法将其置于剧透之下）：

错误代码

TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Failed in cuda mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_get at 0x7fa8f0500710>) found for signature:

>>> impl_get(DictType[undefined,undefined]<iv={}>, int64, Literal[int](0))

There are 2 candidate implementations:
    - Of which 1 did not match due to:
    Overload in function \'impl_get\': File: numba/typed/dictobject.py: Line 710.
      With argument(s): \'(DictType[undefined,undefined]<iv=None>, int64, int64)\':
     Rejected as the implementation raised a specific error:
       TypingError: Failed in nopython mode pipeline (step: nopython frontend)
     non-precise type DictType[undefined,undefined]<iv=None>
     During: typing of argument at /opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py (719)
     
     File \"../../opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py\", line 719:
         def impl(dct, key, default=None):
             castedkey = _cast(key, keyty)
             ^

raised from /opt/conda/lib/python3.7/site-packages/numba/core/typeinfer.py:1086
    - Of which 1 did not match due to:
    Overload in function \'impl_get\': File: numba/typed/dictobject.py: Line 710.
      With argument(s): \'(DictType[undefined,undefined]<iv={}>, int64, Literal[int](0))\':
     Rejected as the implementation raised a specific error:
       TypingError: Failed in nopython mode pipeline (step: nopython frontend)
     non-precise type DictType[undefined,undefined]<iv={}>
     During: typing of argument at /opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py (719)
     
     File \"../../opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py\", line 719:
         def impl(dct, key, default=None):
             castedkey = _cast(key, keyty)

During: resolving callee type: BoundFunction((<class \'numba.core.types.containers.DictType\'>, \'get\') for DictType[undefined,undefined]<iv={}>)
During: typing of call at /tmp/ipykernel_33/2595976848.py (6)


File \"../../tmp/ipykernel_33/2595976848.py\", line 6:
<source missing, REPL/exec in use?>

During: resolving callee type: type(<numba.cuda.compiler.Dispatcher object at 0x7fa8afe49520>)
During: typing of call at <string> (10)


File \"<string>\", line 10:
<source missing, REPL/exec in use?>

标签： rapids cudf

【解决方案1】：

cuDF 构建在 Numba 的 CUDA 目标之上以启用 UDF。这不支持在 UDF 中使用字典，但是您可以通过组合 value_counts 和 drop_duplicates 使用 pandas 或 cuDF 的内置操作来表达您的用例。

import pandas as pd

df = pd.DataFrame(
    {
        'group': [1, 2, 2, 1, 3, 1, 2],
        'value': [10, 10, 30, 20, 20, 10, 30]
    }
)

out = (
    df
    .value_counts()
    .reset_index(name="count")
    .sort_values(["group", "count"], ascending=False)
    .drop_duplicates(subset="group", keep="first")
)
print(out[["group", "value"]])
   group  value
4      3     20
1      2     30
0      1     10

【讨论】：

我是否正确理解 cuDF 尚未实现 value_counts() 选项，还是我将其与某些东西混淆了？只是虽然该解决方案在 Pandas 中运行良好，但当我尝试将其与 cuDF 一起应用时，它会给出“AttributeError：DataFrame object has no attribute value_counts”。
看起来DataFrame.value_counts 最近在当前的开发分支中实现了。您可以通过在rapids.ai/start.html每晚安装当前的 22.08 来使用它