我可以使用日期索引在熊猫中创建假人吗？答案

【问题标题】：Can I use date index to create dummies in pandas?我可以使用日期索引在熊猫中创建假人吗？
【发布时间】：2017-08-21 16:10:15
【问题描述】：

我一直在寻找是否可以使用在pandas 中索引的date 创建假人，但还没有找到任何东西。

我有一个由date 索引的df

                        dew    temp   
date
2010-01-02 00:00:00      129.0  -16     
2010-01-02 01:00:00      148.0  -15     
2010-01-02 02:00:00      159.0  -11     
2010-01-02 03:00:00      181.0   -7      
2010-01-02 04:00:00      138.0   -7   
...

我知道我可以使用，将date 设置为列，

df.reset_index(level=0, inplace=True)

然后使用类似的东西来创建假人，

df['main_hours'] = np.where((df['date'] >= '2010-01-02 03:00:00') & (df['date'] <= '2010-01-02 05:00:00')1,0)

但是，我想在不使用 date 作为列的情况下使用索引 date 即时创建虚拟变量。 pandas 有这样的方法吗？任何建议将不胜感激。

【问题讨论】：

您的预期输出是什么？你想要虚拟的只是时间，还是日期？

标签： python pandas indexing dummy-variable

【解决方案1】：

IIUC：

df['main_hours'] = \
    np.where((df.index  >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'),
             1,
             0)

或：

In [8]: df['main_hours'] = \
            ((df.index >= '2010-01-02 03:00:00') & 
             (df.index <= '2010-01-02 05:00:00')).astype(int)

In [9]: df
Out[9]:
                       dew  temp  main_hours
date
2010-01-02 00:00:00  129.0   -16           0
2010-01-02 01:00:00  148.0   -15           0
2010-01-02 02:00:00  159.0   -11           0
2010-01-02 03:00:00  181.0    -7           1
2010-01-02 04:00:00  138.0    -7           1

时间： 50.000 行 DF：

In [19]: df = pd.concat([df.reset_index()] * 10**4, ignore_index=True).set_index('date')

In [20]: pd.options.display.max_rows = 10

In [21]: df
Out[21]:
                       dew  temp
date
2010-01-02 00:00:00  129.0   -16
2010-01-02 01:00:00  148.0   -15
2010-01-02 02:00:00  159.0   -11
2010-01-02 03:00:00  181.0    -7
2010-01-02 04:00:00  138.0    -7
...                    ...   ...
2010-01-02 00:00:00  129.0   -16
2010-01-02 01:00:00  148.0   -15
2010-01-02 02:00:00  159.0   -11
2010-01-02 03:00:00  181.0    -7
2010-01-02 04:00:00  138.0    -7

[50000 rows x 2 columns]

In [22]: %timeit ((df.index  >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00')).astype(int)
1.58 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [23]: %timeit np.where((df.index  >= '2010-01-02 03:00:00') & (df.index <= '2010-01-02 05:00:00'), 1, 0)
1.52 ms ± 28.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [24]: df.shape
Out[24]: (50000, 2)

【讨论】：

这很快 :)，太棒了，魅力十足！我试过了，但我没有 astype(int)。感谢您的建议！
@i.n.n.m，很高兴它有帮助:)
快速提问，你为什么没有使用np.where？
@i.n.n.m，没有特别的原因。我们可以使用np.where，它可能会更快...
@i.n.n.m 你也可以查看pd.where和pd.mask

【解决方案2】：

或者使用between;

pd.Series(df.index).between('2010-01-02 03:00:00',  '2010-01-02 05:00:00', inclusive=True).astype(int)

Out[1567]: 
0    0
1    0
2    0
3    1
4    1
Name: date, dtype: int32

【讨论】：

【解决方案3】：

df = df.assign(main_hours=0)
df.loc[df.between_time(start_time='3:00', end_time='5:00').index, 'main_hours'] = 1
>>> df
                     dew  temp  main_hours
2010-01-02 00:00:00  129   -16           0
2010-01-02 01:00:00  148   -15           0
2010-01-02 02:00:00  159   -11           0
2010-01-02 03:00:00  181    -7           1
2010-01-02 04:00:00  138    -7           1

【讨论】：

谢谢，你有一种新的分配条件的方法！好建议！