【发布时间】:2016-05-05 02:17:37
【问题描述】:
我有一个CSV 客户购买的文件,我读到了Pandas Dataframe,没有特定的顺序。我想为每次购买添加一列,并显示自上次购买以来经过了多长时间,按客户分组。我不确定差异在哪里,但它们太大了(即使在几秒钟内)。
CSV:
Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015
Python:
import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
.diff()
.fillna('-')
)
print data
输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 2678400000000000
4 2322 2015-03-01 2419200000000000
0 4543 2015-01-01 -
1 4543 2015-02-05 3024000000000000
2 4543 2015-03-15 328320000000000
期望的输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 -
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
【问题讨论】:
-
最后一个数据帧真的是您想要的输出,还是那里的差异太大?
-
@IanS 谢谢。我不是那个意思。更正了问题。
标签: python python-2.7 pandas