在 PyTables 中存储和提取 numpy 日期时间答案

【问题标题】：Store and extract numpy datetimes in PyTables在 PyTables 中存储和提取 numpy 日期时间
【发布时间】：2014-09-07 10:59:38
【问题描述】：

我想将 numpy datetime64 数据存储在 PyTables Table 中。我想在不使用 Pandas 的情况下做到这一点。

到目前为止我已经尝试过什么

设置

In [1]: import tables as tb
In [2]: import numpy as np
In [3]: from datetime import datetime

创建数据

In [4]: data = [(1, datetime(2000, 1, 1, 1, 1, 1)), (2, datetime(2001, 2, 2, 2, 2, 2))]
In [5]: rec = np.array(data, dtype=[('a', 'i4'), ('b', 'M8[us]')])
In [6]: rec  # a numpy array with my data
Out[6]: 
array([(1, datetime.datetime(2000, 1, 1, 1, 1, 1)),
       (2, datetime.datetime(2001, 2, 2, 2, 2, 2))], 
      dtype=[('a', '<i4'), ('b', '<M8[us]')])

使用`Time64Col` 描述符打开 PyTables 数据集

In [7]: f = tb.open_file('foo.h5', 'w')  # New PyTables file
In [8]: d = f.create_table('/', 'bar', description={'a': tb.Int32Col(pos=0), 
                                                    'b': tb.Time64Col(pos=1)})
In [9]: d
Out[9]: 
/bar (Table(0,)) ''
  description := {
  "a": Int32Col(shape=(), dflt=0, pos=0),
  "b": Time64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (5461,)

将 NumPy 数据附加到 PyTables 数据集

In [10]: d.append(rec)
In [11]: d
Out[11]: 
/bar (Table(2,)) ''
  description := {
  "a": Int32Col(shape=(), dflt=0, pos=0),
  "b": Time64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (5461,)

我的约会时间怎么了？

In [12]: d[:]
Out[12]: 
array([(1, 0.0), (2, 0.0)], 
      dtype=[('a', '<i4'), ('b', '<f8')])

我了解 HDF5 不提供对日期时间的本机支持。我希望 PyTables 覆盖的额外元数据可以处理这个问题。

我的问题

如何在 PyTables 中存储包含日期时间的 numpy 记录数组？如何有效地将 PyTables 表中的数据提取回 NumPy 数组并保留我的日期时间？

常见答案

我通常会得到这样的答案：

使用熊猫

我不想使用 Pandas，因为我没有索引，我不希望将索引存储在我的数据集中，而且 Pandas 不允许您没有/存储索引（请参阅this question )

【问题讨论】：

标签： python datetime numpy pytables

【解决方案1】：

首先，将值放入Time64Col 时，它们需要为float64s。您可以通过调用astype 来完成此操作，如下所示：

new_rec = rec.astype([('a', 'i4'), ('b', 'f8')])

然后您需要将列 b 转换为自纪元以来的秒数，这意味着您需要除以 1,000,000，因为我们以微秒为单位：

new_rec['b'] = new_rec['b'] / 1e6

然后拨打d.append(new_rec)

当您将数组读回内存时，执行相反的操作并乘以 1,000,000。在放入任何东西之前，您必须确保事情在微秒内，这是由 astype('datetime64[us]') 在 numpy >= 1.7.x 中自动处理的

我使用了这个问题的解决方案：How to get unix timestamp from numpy.datetime64

这是您示例的工作版本：

In [4]: data = [(1, datetime(2000, 1, 1, 1, 1, 1)), (2, datetime(2001, 2, 2, 2, 2, 2))]

In [5]: rec = np.array(data, dtype=[('a', 'i4'), ('b', 'M8[us]')])

In [6]: new_rec = rec.astype([('a', 'i4'), ('b', 'f8')])

In [7]: new_rec
Out[7]:
array([(1, 946688461000000.0), (2, 981079322000000.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [8]: new_rec['b'] /= 1e6

In [9]: new_rec
Out[9]:
array([(1, 946688461.0), (2, 981079322.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [10]: f = tb.open_file('foo.h5', 'w')  # New PyTables file

In [11]: d = f.create_table('/', 'bar', description={'a': tb.Int32Col(pos=0),
   ....:                                             'b': tb.Time64Col(pos=1)})

In [12]: d.append(new_rec)

In [13]: d[:]
Out[13]:
array([(1, 946688461.0), (2, 981079322.0)],
      dtype=[('a', '<i4'), ('b', '<f8')])

In [14]: r = d[:]

In [15]: r['b'] *= 1e6

In [16]: r.astype([('a', 'i4'), ('b', 'datetime64[us]')])
Out[16]:
array([(1, datetime.datetime(2000, 1, 1, 1, 1, 1)),
       (2, datetime.datetime(2001, 2, 2, 2, 2, 2))],
      dtype=[('a', '<i4'), ('b', '<M8[us]')])

【讨论】：

现在我记得为什么 pandas 不使用 TimeCol64，因为它由不支持纳秒精度的 float64 支持
是的。往返 datetime64 所需的工作量使将 Time64Col 元数据提供给 PyTables 所提供的任何好处都相形见绌。