如何将多个参数传递给熊猫中的映射函数答案

【问题标题】：How to pass more than one parameter to map function in panda如何将多个参数传递给熊猫中的映射函数
【发布时间】：2014-04-06 15:59:48
【问题描述】：

我有以下数据帧

mn = pd.DataFrame({'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],'fld2': [125000, 350000,700000, 800000, 200000, 600000, 500000],'lType': ['typ1','typ2','typ3','typ1','typ3','typ1','typ2'], 'counter': [100,200,300,400,500,600,700]})

映射函数

def getTag(rangeAttribute):
    sliceDef = {'tag1': [1, 4], 'tag2': [4, 6], 'tag3': [6, 9],
                'tag4': [9, 99]}
    for sl in sliceDef.keys():
        bounds = sliceDef[sl]
        if ((float(rangeAttribute) >= float(bounds[0]))
            and (float(rangeAttribute) <= float(bounds[1]))):
            return sl


def getTag1(rangeAttribute):
    sliceDef = {'100-150': [100000, 150000],
                '150-650': [150000, 650000],
                '650-5M': [650000, 5000000]}
    for sl in sliceDef.keys():
        bounds = sliceDef[sl]
        if ((float(rangeAttribute) >= float(bounds[0]))
            and (float(rangeAttribute) <= float(bounds[1]))):
            return sl

我想根据 fld1 和 fld2 的标签计算总和。目前，我必须为不同类型的字段编写具有硬编码值的不同函数。 MAP 函数只需要 1 个参数。除了MAP还有其他功能吗
也可以将 sliceDef 作为输入参数。

mn.groupby([mn['fld1'].map(getTag),mn['fld2'].map(getTag1),'lType'] ).sum()

【问题讨论】：

我不认为 map 对系列中的每个元素进行操作，如果您想逐行传递具有多个参数的操作，那么您可以使用 apply 并设置 @987654325 @ like so mn.apply(lambda row: getTag(row), axis=1) in getTag 您可以选择如下列：row['fld1'] 和 row['fld2']。这应该可以实现您想要的
您可能也有兴趣查看pd.cut，例如pd.cut(mn.fld1, [1, 4, 6, 9, 99], right=False)。它与您正在寻找的形式不完全相同，但根据我的经验，它非常方便。

标签： python map pandas

【解决方案1】：

您可以使用pd.cut（感谢 DSM 和 Jeff 指出这一点），而不是使用地图：

import numpy as np
import pandas as pd

mn = pd.DataFrame(
    {'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],
     'fld2': [125000, 350000, 700000, 800000, 200000, 600000, 500000],
     'lType': ['typ1', 'typ2', 'typ3', 'typ1', 'typ3', 'typ1', 'typ2'],
     'counter': [100, 200, 300, 400, 500, 600, 700]})

result = mn.groupby(
    [pd.cut(mn['fld1'], [1,4,6,9,99], labels=['tag1', 'tag2', 'tag3', 'tag4']),
     pd.cut(mn['fld2'], [100000, 150000, 650000, 5000000],
            labels=['100-150', '150-650', '650-5M']),
     'lType']).sum()

print(result)

产量

                    counter   fld1    fld2
             lType                        
tag1 100-150 typ1       100   2.23  125000
     150-650 typ1       600   3.32  600000
tag2 150-650 typ2       900  10.00  850000
tag3 150-650 typ3       500   8.85  200000
     650-5M  typ3       300   7.87  700000
tag4 650-5M  typ1       400   9.02  800000

这将比调用getTag 或getTag1 更快为系列中的每个值调用一次。相反，pd.cut 使用 np.searchsorted 只需一次调用即可返回 all 索引（此外，searchsorted 使用用 C 编写的 O(log n) 二进制搜索而不是 O(n)循环用 Python 编写）。

一个微妙的点：sliceDef.keys() 返回的键不保证任何特定的顺序。它甚至可以从一个运行到另一个运行（至少对于 Python3）。您的标准使用完全封闭的区间：

    if ((float(rangeAttribute) >= float(bounds[0]))
        and (float(rangeAttribute) <= float(bounds[1]))):

因此，如果rangeAttribute 恰好落在bounds 中的某个值上，那么首先测试哪个键可能很重要。

因此，您当前的代码是不确定的。

pd.cut使用半开区间，所以每个值都会归入一个且只有一个类别，从而避免了这个问题。

回答一般问题：是的，有一种方法可以传递额外的参数——使用 apply 而不是 map（感谢 Andy Hayden 指出这一点）：

import numpy as np
import pandas as pd

def getTag(rangeAttribute, sliceDef):
    for sl in sliceDef.keys():
        bounds = sliceDef[sl]
        if ((float(rangeAttribute) >= float(bounds[0]))
            and (float(rangeAttribute) <= float(bounds[1]))):
            return sl

sliceDef = {'tag1': [1, 4], 'tag2': [4, 6], 'tag3': [6, 9],
            'tag4': [9, 99]}
sliceDef1 = {'100-150': [100000, 150000],
            '150-650': [150000, 650000],
            '650-5M': [650000, 5000000]}

mn = pd.DataFrame(
    {'fld1': [2.23, 4.45, 7.87, 9.02, 8.85, 3.32, 5.55],
     'fld2': [125000, 350000, 700000, 800000, 200000, 600000, 500000],
     'lType': ['typ1', 'typ2', 'typ3', 'typ1', 'typ3', 'typ1', 'typ2'],
     'counter': [100, 200, 300, 400, 500, 600, 700]})

result = mn.groupby([mn['fld1'].apply(getTag, args=(sliceDef, ))
                     ,mn['fld2'].apply(getTag, args=(sliceDef1, )),
                     'lType'] ).sum()
print(result)

不过，对于这个特殊问题，我不建议使用 apply，因为 pd.cut 更快、更易于使用，并且避免了 dict 键问题的非确定性顺序。但是知道apply 可以接受额外的位置参数可能会对您将来有所帮助。

【讨论】：

回过头来看，我认为您可以使用.apply(getTag, sliceDef)，而不是使用部分
@AndyHayden：哇，我不知道你能做到这一点！谢谢。
我认为这基本上是 pd.cut 所做的（然后您可以在返回的间隔上进行分组）。
@Jeff 和@DSM：谢谢。我已将答案更改为使用pd.cut 而不是np.searchsorted。