防止 Python Pandas 中的行迭代答案

【问题标题】：Preventing row iteration in Python Pandas防止 Python Pandas 中的行迭代
【发布时间】：2021-11-06 14:19:14
【问题描述】：

我已将以下文件转换为 pandas df：

https://www.fca.org.uk/publication/data/position-limits-contract-names-vpc.xlsx

我已将相关行（为我自己）转换为字典。 dict 的格式为{principal: [spot, aggregate, set(product codes)]}。我已使用以下代码将其转换为此字典：

ifeu_dict = defaultdict(lambda: [0, 0, set()])


for (_, row) in df.iterrows():
        if row.loc["Venue MIC"] == "IFEU":
            ifeu_dict[row.loc["Principal Venue Product Code"]][2].add(row.loc["Venue Product Codes"])
            if type(row.loc["Spot month single limit#"]) == int:
                # no need for append as default is to create a dict
                ifeu_dict[row.loc["Principal Venue Product Code"]][0] = row.loc["Spot month single limit#"]
                ifeu_dict[row.loc["Principal Venue Product Code"]][1] = row.loc["Other month limit#"]
            if type(row.loc["Spot month single limit#"]) == str:
                try:
                    val = int(str(row.loc["Spot month single limit#"]).split()[0].replace(",", ""))
                    val_2 = int(str(row.loc["Other month limit#"]).split()[0].replace(",", ""))
                    ifeu_dict[row.loc["Principal Venue Product Code"]][0] = val
                    ifeu_dict[row.loc["Principal Venue Product Code"]][1] = val_2
                except ValueError:
                    pass

但是，这确实效率低下，所以我一直在尝试改变我创建这本词典的方式。

一种尝试如下：

ifeu_dict_2 = defaultdict(lambda: [0, 0, set()])

ifeu_mask = df["Venue MIC"] == "IFEU"
ifeu_df = df.loc[ifeu_mask]
spot_mask_int = ifeu_df["Spot month single limit#"].apply(type) == int


def spot_transform(x):
    try:
        return int(str(x).split()[0].replace(",", ""))
    except ValueError:
        return


ifeu_df["Spot month single limit#"] = ifeu_df.loc[~spot_mask_int, "Spot month single limit#"].apply(spot_transform)
ifeu_df["Other month limit#"] = ifeu_df.loc[~spot_mask_int, "Other month limit#"].apply(spot_transform)
spot_mask_int = ifeu_df["Spot month single limit#"].apply(type) == int

然后尝试：

temp_df = [~spot_mask_int, ["Principal Venue Product Code", "Spot month single limit#", "Other month limit#"]]
ifeu_dict_2[temp_df.loc["Principal Venue Product Code"]][0] = temp_df.loc["Spot month single limit#"]

# this gives me AttributeError: 'list' object has no attribute 'loc'

或：

ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][2].add(ifeu_df.loc["Venue Product Codes"])
ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][0] = ifeu_df.loc[spot_mask_int, "Spot month single limit#"]
ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][1] = ifeu_df.loc[spot_mask_int, "Other month limit#"]

# this gives me TypeError: 'Series' objects are mutable, thus they cannot be hashed

卡了很长时间，不知道如何继续。任何帮助将不胜感激，无论是答案还是有用的链接！（对于链接，我是编码新手，所以示例对我有帮助）。

如果你想玩 df：

Index(['Commodity Derivative Name\n(including associated contracts)',
       'Venue MIC', 'Name of Trading Venue', 'Venue Product Codes',
       'Principal Venue Product Code', 'Spot month single limit#',
       'Other month limit#', 'Conversion Factor', 'Unit of measurement',
       'Definition of spot month'],
      dtype='object')

    API2 Rotterdam Coal Average Price Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RCA,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
 Gasoil Diff - Gasoil 50ppm FOB Rotterdam Barges vs Low Sulphur Gasoil 1st Line Future,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ULH,ULH,2500,2500,nan,Lots,Calendar Month
 Marine Fuel 0.5% FOB Rotterdam Barges (Platts) Future,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,MF3,MF3,2500,2500,nan,Lots,Calendar Month
API2 Rotterdam Coal (supporting Cal 1x Options),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATC,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal (supporting Qtr 1x Options),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATQ,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Cal 1x Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATD,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (122 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDE,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (214 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDF,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (305 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDG,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Futures,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATW,ATW,5,550 (24.9%),38,800 (20.5%),nan,Lots,Calendar Month
API2 Rotterdam Coal Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RCO,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Qtr 1x Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATH,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month

完成的字典中的条目应如下所示：

ATW = [5550, 38800, {'ATH', 'ATC', 'RDF', 'ATQ', 'RCA', 'ATD', 'RCO', 'RDG', 'RDE', 'ATW'}]

【问题讨论】：

是什么让你说这是低效的，你需要以不同的方式来做？它是否需要比您的要求更长的时间才能发挥作用？
@scign 经理基本上是这么说的......他更喜欢我不使用 interrows，因为它们之间没有依赖关系。就脚本运行所需的时间而言，这样做也确实需要更长的时间。
将 df 转换为 numpy 数组并对其进行迭代，您必须准备使用的列的索引

标签： python python-3.x pandas performance dictionary

【解决方案1】：

看看我现在理解的数据。数据包括每个产品的多个代码，您最终需要得到一个 dict，其中每个代码组都有一个条目。您的方法是逐行进行，但更有效的方法是使用 DataFrame.groupby 方法并一次性处理每个组。

下面的代码应该比逐行更高效。

df_ifeu = df[df['Venue MIC ']=='IFEU']

ifeu_dict = {}
for principal,g in df_ifeu.groupby('Principal Venue Product Code'):
    # find where the product code is the same as the principal code
    pr = g['Venue Product Codes '] == principal
    # get the values for the principal
    spot_val = g.loc[pr, 'Spot month single limit#'].iloc[0]
    other_val = g.loc[pr, 'Other month limit#'].iloc[0]
    # get the codes
    codes = set(g['Venue Product Codes '])
    # add the product to the dict
    ifeu_dict[principal] = [spot_val, other_val, codes]

# confirm we have one dict entry per principal product code
assert(len(ifeu_dict)==df_ifeu['Principal Venue Product Code'].nunique())

【讨论】：

感谢您抽出宝贵时间回答问题，非常感谢！仍然有点不确定要走的路，但你的答案看起来像我应该瞄准的目标。过滤数据，然后从过滤后的数据框中创建字典/产品。几乎完成了我的代码（必须优先考虑其他一些东西），但也会分享我所拥有的。我在猜测代码运行多长时间可能是最合适的测试机制？