【问题标题】:Preventing row iteration in Python Pandas防止 Python Pandas 中的行迭代
【发布时间】:2021-11-06 14:19:14
【问题描述】:

我已将以下文件转换为 pandas df:

https://www.fca.org.uk/publication/data/position-limits-contract-names-vpc.xlsx

我已将相关行(为我自己)转换为字典。 dict 的格式为{principal: [spot, aggregate, set(product codes)]}。我已使用以下代码将其转换为此字典:

ifeu_dict = defaultdict(lambda: [0, 0, set()])


for (_, row) in df.iterrows():
        if row.loc["Venue MIC"] == "IFEU":
            ifeu_dict[row.loc["Principal Venue Product Code"]][2].add(row.loc["Venue Product Codes"])
            if type(row.loc["Spot month single limit#"]) == int:
                # no need for append as default is to create a dict
                ifeu_dict[row.loc["Principal Venue Product Code"]][0] = row.loc["Spot month single limit#"]
                ifeu_dict[row.loc["Principal Venue Product Code"]][1] = row.loc["Other month limit#"]
            if type(row.loc["Spot month single limit#"]) == str:
                try:
                    val = int(str(row.loc["Spot month single limit#"]).split()[0].replace(",", ""))
                    val_2 = int(str(row.loc["Other month limit#"]).split()[0].replace(",", ""))
                    ifeu_dict[row.loc["Principal Venue Product Code"]][0] = val
                    ifeu_dict[row.loc["Principal Venue Product Code"]][1] = val_2
                except ValueError:
                    pass

但是,这确实效率低下,所以我一直在尝试改变我创建这本词典的方式。

一种尝试如下:

ifeu_dict_2 = defaultdict(lambda: [0, 0, set()])

ifeu_mask = df["Venue MIC"] == "IFEU"
ifeu_df = df.loc[ifeu_mask]
spot_mask_int = ifeu_df["Spot month single limit#"].apply(type) == int


def spot_transform(x):
    try:
        return int(str(x).split()[0].replace(",", ""))
    except ValueError:
        return


ifeu_df["Spot month single limit#"] = ifeu_df.loc[~spot_mask_int, "Spot month single limit#"].apply(spot_transform)
ifeu_df["Other month limit#"] = ifeu_df.loc[~spot_mask_int, "Other month limit#"].apply(spot_transform)
spot_mask_int = ifeu_df["Spot month single limit#"].apply(type) == int

然后尝试:

temp_df = [~spot_mask_int, ["Principal Venue Product Code", "Spot month single limit#", "Other month limit#"]]
ifeu_dict_2[temp_df.loc["Principal Venue Product Code"]][0] = temp_df.loc["Spot month single limit#"]

# this gives me AttributeError: 'list' object has no attribute 'loc'

或:

ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][2].add(ifeu_df.loc["Venue Product Codes"])
ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][0] = ifeu_df.loc[spot_mask_int, "Spot month single limit#"]
ifeu_dict_2[ifeu_df.loc[spot_mask_int, "Principal Venue Product Code"]][1] = ifeu_df.loc[spot_mask_int, "Other month limit#"]

# this gives me TypeError: 'Series' objects are mutable, thus they cannot be hashed

卡了很长时间,不知道如何继续。任何帮助将不胜感激,无论是答案还是有用的链接! (对于链接,我是编码新手,所以示例对我有帮助)。

如果你想玩 df:

Index(['Commodity Derivative Name\n(including associated contracts)',
       'Venue MIC', 'Name of Trading Venue', 'Venue Product Codes',
       'Principal Venue Product Code', 'Spot month single limit#',
       'Other month limit#', 'Conversion Factor', 'Unit of measurement',
       'Definition of spot month'],
      dtype='object')

    API2 Rotterdam Coal Average Price Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RCA,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
 Gasoil Diff - Gasoil 50ppm FOB Rotterdam Barges vs Low Sulphur Gasoil 1st Line Future,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ULH,ULH,2500,2500,nan,Lots,Calendar Month
 Marine Fuel 0.5% FOB Rotterdam Barges (Platts) Future,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,MF3,MF3,2500,2500,nan,Lots,Calendar Month
API2 Rotterdam Coal (supporting Cal 1x Options),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATC,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal (supporting Qtr 1x Options),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATQ,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Cal 1x Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATD,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (122 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDE,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (214 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDF,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Early (305 days) Single Expiry Option (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RDG,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Futures,IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATW,ATW,5,550 (24.9%),38,800 (20.5%),nan,Lots,Calendar Month
API2 Rotterdam Coal Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,RCO,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month
API2 Rotterdam Coal Qtr 1x Options (Futures Style Margin),IFEU,INTERCONTINENTAL EXCHANGE - ICE FUTURES EUROPE,ATH,ATW,Aggregated with Principal,Aggregated with Principal,nan,Lots,Calendar Month

完成的字典中的条目应如下所示:

ATW = [5550, 38800, {'ATH', 'ATC', 'RDF', 'ATQ', 'RCA', 'ATD', 'RCO', 'RDG', 'RDE', 'ATW'}]

【问题讨论】:

  • 是什么让你说这是低效的,你需要以不同的方式来做?它是否需要比您的要求更长的时间才能发挥作用?
  • @scign 经理基本上是这么说的......他更喜欢我不使用 interrows,因为它们之间没有依赖关系。就脚本运行所需的时间而言,这样做也确实需要更长的时间。
  • 将 df 转换为 numpy 数组并对其进行迭代,您必须准备使用的列的索引

标签: python python-3.x pandas performance dictionary


【解决方案1】:

看看我现在理解的数据。数据包括每个产品的多个代码,您最终需要得到一个 dict,其中每个代码组都有一个条目。您的方法是逐行进行,但更有效的方法是使用 DataFrame.groupby 方法并一次性处理每个组。

下面的代码应该比逐行更高效。

df_ifeu = df[df['Venue MIC ']=='IFEU']

ifeu_dict = {}
for principal,g in df_ifeu.groupby('Principal Venue Product Code'):
    # find where the product code is the same as the principal code
    pr = g['Venue Product Codes '] == principal
    # get the values for the principal
    spot_val = g.loc[pr, 'Spot month single limit#'].iloc[0]
    other_val = g.loc[pr, 'Other month limit#'].iloc[0]
    # get the codes
    codes = set(g['Venue Product Codes '])
    # add the product to the dict
    ifeu_dict[principal] = [spot_val, other_val, codes]

# confirm we have one dict entry per principal product code
assert(len(ifeu_dict)==df_ifeu['Principal Venue Product Code'].nunique())

【讨论】:

  • 感谢您抽出宝贵时间回答问题,非常感谢!仍然有点不确定要走的路,但你的答案看起来像我应该瞄准的目标。过滤数据,然后从过滤后的数据框中创建字典/产品。几乎完成了我的代码(必须优先考虑其他一些东西),但也会分享我所拥有的。我在猜测代码运行多长时间可能是最合适的测试机制?
猜你喜欢
  • 2018-11-15
  • 2015-12-15
  • 2019-04-27
  • 1970-01-01
  • 2017-08-23
  • 2020-08-11
  • 2018-10-13
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多