【问题标题】:Pandas `.loc` to multiple assignment causing reasonable slowdownPandas `.loc` 多次赋值导致合理的减速
【发布时间】:2019-02-11 14:46:18
【问题描述】:

我有一些代码,如果我进行多重分配,而不是跨多行分配,例如

快速:

onset = pitch_df.loc[idx, 'onset_time']
dur = pitch_df.loc[idx, 'duration']

慢:

onset, dur = pitch_df.loc[idx, ['onset_time', 'duration']]

这是否有明显的原因,或者是一种更“熊猫”的方式来做我正在做的事情。我想在这里分配以使我的代码更具可读性(即我不想到处写.loc[...])。

这是一个最小的工作示例(此处加速 4 倍):

import pandas as pd
import numpy as np
from timeit import timeit

df = pd.DataFrame(
    {'onset_time': [0, 0, 1, 2, 3, 4], 
     'pitch': [61, 60, 60, 61, 60, 60],
     'duration': [4, 1, 1, 0.5, 0.5, 2]}
).sort_values(['onset_time', 'pitch']).reset_index(drop=True)

def foo():
    for pitch, pitch_df in df.groupby('pitch'):
        for iloc in range(len(pitch_df)):
            idx = pitch_df.index[iloc]
            onset = pitch_df.loc[idx, 'onset_time']
            dur = pitch_df.loc[idx, 'duration']
            note_off = onset + dur

def bar():
    for pitch, pitch_df in df.groupby('pitch'):
        for iloc in range(len(pitch_df)):
            idx = pitch_df.index[iloc]
            onset, dur = pitch_df.loc[idx, ['onset_time', 'duration']]
            note_off = onset + dur

print(f'foo time: {timeit(foo, number=100)}')
print(f'bar time: {timeit(bar, number=100)}')

下面的图片便于阅读。

【问题讨论】:

  • 您也可以尝试.at 而不是.loc 来访问单个单元格 - 应该更快。

标签: python pandas time


【解决方案1】:

正如 Poolka 在对您的问题的评论中提到的那样,如果您想要标量访问,.at 的开销较小。我不是 python 专家,但这里有一个可能对你有用的解决方案:

def foo2():
    for pitch, pitch_df in df.groupby('pitch'):
        for iloc in range(len(pitch_df)):
            idx = pitch_df.index[iloc]
            onset, dur = (pitch_df.at[idx, x] for x in ('onset_time', 'duration'))
            note_off = onset + dur
foo time: 0.12590176300000167
bar time: 0.47044453300077294
foo2 time: 0.12269815599938738

【讨论】:

  • 太好了,谢谢。我已经分离了超过 2 行(而不是列表理解),只是为了让触摸更清晰。干杯。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-12-02
  • 1970-01-01
  • 2018-09-23
  • 1970-01-01
  • 1970-01-01
  • 2017-10-17
相关资源
最近更新 更多