【问题标题】:Iterating over date range between two pandas dataframes for category count迭代两个熊猫数据框之间的日期范围以进行类别计数
【发布时间】:2015-12-23 10:45:30
【问题描述】:

我有两个 pandas 数据框(df1 和 df2):

df1 有 12 列,其中 a1、a2、...、a9 是空列。以下是 df1 的示例:

Stock Start_Date          End_Date        a1 a2 a3 a4 .... a9
A   09-12-2015 20:04    10-12-2015 23:04                
B   09-12-2015 10:04    09-12-2015 20:14                
A   11-12-2015 00:22    11-12-2015 08:04                
C   08-12-2015 06:56    10-12-2015 20:54                

df2 有 4 列。下面是一个示例:

Stock   date_time     Opening   closing
A   09-12-2015 21:24    144.3   10
A   09-12-2015 21:27    225.51  24
B   09-12-2015 10:20    134.42  11
A   09-12-2015 20:04    231.22  17
B   09-12-2015 10:24    399.55  32
A   09-12-2015 20:04    246.77  21
B   09-12-2015 14:22    76.23   8
C   08-12-2015 09:44    232.22  15
C   09-12-2015 20:04    222.91  12
A   11-12-2015 02:06    93.21   7
B   09-12-2015 20:04    211.36  26
C   09-12-2015 20:04    111.21  8

现在,我希望输出是这样的,df1:

Stock   Start_Date       End_Date          a1   a2  a3  a4 ....a9
A   09-12-2015 20:04    10-12-2015 23:04    0   2   2   0      0
B   09-12-2015 10:04    09-12-2015 20:14    1   1   2   0      0
A   11-12-2015 00:22    11-12-2015 08:04    1   0   0   0      0
C   08-12-2015 06:56    10-12-2015 20:54    0   0   0   1      0

即对于 df1 的每个 Stock、Start_Date 和 End_Date 组合,结果应该具有从 df2 开始的该日期时间范围内的每个类别的计数。

这里在最终输出中,a1 = count[opening(0-100)&closing(0-10)], a2 = count[opening(101-200)&closing(11-20)], a3 = count[opening( 201-400)&close(21-50)]、a4 = count[opening(0-100)&close(11-20)]等等,全部9种组合。

我有这方面的 R 代码,但对于更大的数据集效果不佳。任何人都可以帮助我如何在 python/pandas 中执行此操作。任何帮助表示赞赏!

【问题讨论】:

    标签: python python-2.7 pandas


    【解决方案1】:

    你可以试试这个解决方案,我删除了 df1 的空列,但它也适用于它们:

    #merge dataframes by Stock, select datetimes between start and end
    df = df1.merge(df2,on='Stock', how='left')
    df = df[(df.date_time >= df.Start_Date) & (df.date_time <= df.End_Date)]
    #remove column date_time
    df = df.drop(['date_time'], axis=1)
    print df
    #   Stock          Start_Date            End_Date  Opening  closing
    #0      A 2015-09-12 20:04:00 2015-10-12 23:04:00   144.30       10
    #1      A 2015-09-12 20:04:00 2015-10-12 23:04:00   225.51       24
    #2      A 2015-09-12 20:04:00 2015-10-12 23:04:00   231.22       17
    #3      A 2015-09-12 20:04:00 2015-10-12 23:04:00   246.77       21
    #5      B 2015-09-12 10:04:00 2015-09-12 20:14:00   134.42       11
    #6      B 2015-09-12 10:04:00 2015-09-12 20:14:00   399.55       32
    #7      B 2015-09-12 10:04:00 2015-09-12 20:14:00    76.23        8
    #8      B 2015-09-12 10:04:00 2015-09-12 20:14:00   211.36       26
    #13     A 2015-11-12 00:22:00 2015-11-12 08:04:00    93.21        7
    #14     C 2015-08-12 06:56:00 2015-10-12 20:54:00   232.22       15
    #15     C 2015-08-12 06:56:00 2015-10-12 20:54:00   222.91       12
    #16     C 2015-08-12 06:56:00 2015-10-12 20:54:00   111.21        8
    
    #values to new columns by conditions - cast boolean to integers
    df['a1'] = ((df.Opening.between(0,100)) & (df.closing.between(0,10))).astype(int)
    df['a2'] = ((df.Opening.between(100,200)) & (df.closing.between(11,20))).astype(int)
    #add other columns like a1 and a2
    print df
    #   Stock          Start_Date            End_Date  Opening  closing  a1  a2
    #0      A 2015-09-12 20:04:00 2015-10-12 23:04:00   144.30       10   0   0
    #1      A 2015-09-12 20:04:00 2015-10-12 23:04:00   225.51       24   0   0
    #2      A 2015-09-12 20:04:00 2015-10-12 23:04:00   231.22       17   0   0
    #3      A 2015-09-12 20:04:00 2015-10-12 23:04:00   246.77       21   0   0
    #5      B 2015-09-12 10:04:00 2015-09-12 20:14:00   134.42       11   0   1
    #6      B 2015-09-12 10:04:00 2015-09-12 20:14:00   399.55       32   0   0
    #7      B 2015-09-12 10:04:00 2015-09-12 20:14:00    76.23        8   1   0
    #8      B 2015-09-12 10:04:00 2015-09-12 20:14:00   211.36       26   0   0
    #13     A 2015-11-12 00:22:00 2015-11-12 08:04:00    93.21        7   1   0
    #14     C 2015-08-12 06:56:00 2015-10-12 20:54:00   232.22       15   0   0
    #15     C 2015-08-12 06:56:00 2015-10-12 20:54:00   222.91       12   0   0
    #16     C 2015-08-12 06:56:00 2015-10-12 20:54:00   111.21        8   0   0
    
    #groupby and sum rows
    df= df.groupby(['Stock', 'Start_Date', 'End_Date']).sum()
    df = df.drop(['Opening', 'closing'], axis=1)
    print df.reset_index()
    #  Stock          Start_Date            End_Date  a1  a2
    #0     A 2015-09-12 20:04:00 2015-10-12 23:04:00   0   0
    #1     A 2015-11-12 00:22:00 2015-11-12 08:04:00   1   0
    #2     B 2015-09-12 10:04:00 2015-09-12 20:14:00   1   1
    #3     C 2015-08-12 06:56:00 2015-10-12 20:54:00   0   0
    

    【讨论】:

    • 谢谢,工作得很好。还有一件事,如果我在 df1 中有另一列(双列或浮点)怎么办。是否可以通过更改合并中的“方式”在最终输出中得到它?
    • 我认为函数 merge 中的 on 用于匹配 - 更好的图片示例是 heredf = df1.merge(df2,on='Stock', how='left')df = pd.merge(df1, df2,on='Stock', how='left') 相同。
    • 谢谢,感谢您的帮助。工作原理类似于 R 中的合并函数。有没有办法在最终输出()中添加 df1 中存在的另一列(带有值)?
    • 嗯,我认为如果您将df1df2 通过Stock 合并,您会从df1 获得所有其他列,因此也会获得其他列。试试看。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-07-10
    • 1970-01-01
    • 2016-05-05
    • 2019-08-08
    • 2016-05-08
    相关资源
    最近更新 更多