【问题标题】:Pandas: Faster method than rollforward?Pandas:比前滚更快的方法?
【发布时间】:2016-03-14 15:57:26
【问题描述】:

我正在为同类群组分析准备一些数据。我掌握的信息类似于可以用下面的代码生成的假数据集:

import random
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# prepare some fake data to build frames
subscription_prices = [x - 0.05 for x in range(100, 500, 25)]
companies = ['initech','ingen','weyland','tyrell']
starting_periods = ['2014-12-10','2015-1-15','2014-11-20','2015-2-9']

# use the lists and dict from above to create a fake dataset
pieces = []
for company, period in zip(companies,starting_periods):
    data = {
        'company': company,
        'revenue': random.choice(subscription_prices),
        'invoice_date': pd.date_range(period,periods=12,freq='31D')
    }
    frame = DataFrame(data)
    pieces.append(frame)
df = pd.concat(pieces, ignore_index=True)

我需要将发票日期标准化为每月。出于多种原因,最好将所有 invoice_date 值转移到月底。我用了这个方法:

from pandas.tseries.offsets import *
df['rev_period'] = df['invoice_date'].apply(lambda x: MonthEnd(normalize=True).rollforward(x))

但是,即使只有一百万行(这是我的实际数据集的大小),这也会变得非常缓慢:

In [11]: %time df['invoice_date'].apply(lambda x: MonthEnd(normalize=True).rollforward(x))
CPU times: user 3min 11s, sys: 1.44 s, total: 3min 12s
Wall time: 3min 17s

这种使用 Pandas 进行日期偏移的方法的重要之处在于,如果 invoice_date 恰好落在该月的最后一天,则该日期将保留为该月的最后一天。另一个好处是这将dtype 保持为datetime,而df['invoice_date'].apply(lambda x: x.strftime('%Y-%m')) 更快,但会将值转换为str

有没有一种矢量化的方式来做到这一点?我尝试了MonthEnd(normalize=True).rollforward(df['invoice_date']),但得到了错误TypeError: Cannot convert input to Timestamp

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    是的,有:

    df['rev_period'] = df['invoice_date'] + pd.offsets.MonthEnd(0)
    

    应该至少快一个数量级。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-04-10
      • 2020-04-05
      • 2021-02-04
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多