将 pandas 数据帧转换为带有标题和数据类型的 numpy 数组答案

【问题标题】：Converting pandas dataframe to numpy array with headers and dtypes将 pandas 数据帧转换为带有标题和数据类型的 numpy 数组
【发布时间】：2018-09-18 22:34:17
【问题描述】：

我一直在尝试将 pandas 数据帧转换为 numpy 数组，并保留 dtypes 和标头名称以便于参考。我需要这样做，因为对 pandas 的处理太慢了，numpy 快了 10 倍。我有来自 SO 的这段代码，除了结果看起来不像标准的 numpy 数组之外，它给了我我需要的东西 - 即它不显示形状中的列号。

[In]:
df = pd.DataFrame(randn(10,3),columns=['Acol','Ccol','Bcol'])
arr_ip = [tuple(i) for i in df.as_matrix()]
dtyp = np.dtype(list(zip(df.dtypes.index, df.dtypes)))
dfnp= np.array(arr_ip, dtype=dtyp)
print(dfnp.shape)
dfnp

[Out]: 

(10,) #expecting (10,3)

array([(-1.0645345 ,  0.34590193,  0.15063829),
( 1.5010928 ,  0.63312454,  2.38309797),
(-0.10203999, -0.40589525,  0.63262773),
( 0.92725915,  1.07961763,  0.60425353),
( 0.18905164, -0.90602597, -0.27692396),
(-0.48671514,  0.14182815, -0.64240004),
( 0.05012859, -0.01969079, -0.74910076),
( 0.71681329, -0.38473052, -0.57692395),
( 0.60363249, -0.0169229 , -0.16330232),
( 0.04078263,  0.55943898, -0.05783683)],
dtype=[('Acol', '<f8'), ('Ccol', '<f8'), ('Bcol', '<f8')])

是我遗漏了什么还是有其他方法可以做到这一点？我有许多要转换的 df，它们的 dtypes 和列名各不相同，所以我需要这种自动化方法。由于大量的df，我还需要它高效。

【问题讨论】：

仅供参考，这里的另一种方法（优点是将 pandas dtype=object 转换为 numpy dtype=string：stackoverflow.com/questions/52579601/…

标签： python arrays pandas numpy dataframe

【解决方案1】：

使用df.to_records() 将您的数据框转换为结构化数组。

您可以传递 index=False 以从结果中删除索引。

import numpy as np

df = pd.DataFrame(np.random.rand(10,3),columns=['Acol','Ccol','Bcol'])

res = df.to_records(index=False)

# rec.array([(0.12448699852020828, 0.7621451848466592, 0.0958529943831431),
#  (0.14534869167076214, 0.695297214355628, 0.3753874117495527),
#  (0.09890006207909052, 0.46364777245941025, 0.10216301104094272),
#  (0.3467673672203968, 0.4264108141950761, 0.1475998692158026),
#  (0.9272619907467186, 0.3116253419608288, 0.5681628329642517),
#  (0.34509767424461246, 0.5533523959180552, 0.02145207648054681),
#  (0.7982313824847291, 0.563383955627413, 0.35286630304880684),
#  (0.9574060540226251, 0.21296949881671157, 0.8882413119348652),
#  (0.0892793829627454, 0.6157843461905468, 0.8310360916075473),
#  (0.4691016244437851, 0.7007146447236033, 0.6672404967622088)], 
#           dtype=[('Acol', '<f8'), ('Ccol', '<f8'), ('Bcol', '<f8')])

结构化数组始终只有一维。这是无法改变的。

但您可以通过以下方式获得形状：

res.view(np.float64).reshape(len(res), -1).shape  # (10, 3)

为了提高性能，如果您要处理数据，最好通过df.to_numpy() 使用numpy.array 并使用整数键将列名记录在字典中。

【讨论】：

谢谢，太好了。在 numpy 中完成繁重的处理后，我确实必须做一个 ravel 才能将其重新放入 pandas 数据帧： res_pd = pd.DataFrame(res.ravel())
@GivenX 你是用你的代码转换成 numpy 数组还是只使用 df.to_records()？
我刚刚使用了 df.to_records()