pandas 使用 series.values 获取 numpy ndarray答案

【问题标题】：pandas get numpy ndarray using series.valuespandas 使用 series.values 获取 numpy ndarray
【发布时间】：2016-09-21 23:04:03
【问题描述】：

我想将series 转换为numpy.ndarray，这样使用ndarray 可以大大提高时间效率，

numpy_martix = df[some_col].values

我发现series.values 本身需要一些时间来进行转换，所以我想知道有没有更快的方法来做。

【问题讨论】：

“一点时间”有多长，你的数据框有多大？
你试过as_matrix吗？
@Evert 数据帧大约有 10 万行，耗时约 0.7-0.8 秒
在不到 1 秒的时间内将 100k 行复制（转换）到一个 numpy 数组并不快，但也不算太慢。如果您在某种循环中连续执行此操作，也许您需要寻找另一种方式来构建您的程序流。
你确定要花那么长时间吗？访问.values 时无需进行转换。对于 1m 行的数据帧，我需要 2.36 µs。

标签： python-3.x numpy pandas series

【解决方案1】：

（已编辑）

当您调用arr = df.values 时，会返回对df 数据的引用，因此速度非常快（没有真正完成工作）。另一方面，arr = df[list_of_cols].values 需要先在df 内部进行一些整合。

尝试以这种方式运行它：

arr = df.values[:, numeric_list_of_cols]

可能会快一点，因为所有工作都在 numpy 数组中完成。但实际的加速很可能取决于基础数据。

测试

我决定进行一些测试，结果如下。

首先，一个只包含数值的数据框。

'''Setup'''
a = np.random.rand(1000, 1000)
df = pd.DataFrame(a)
idx = np.arange(0, 1000, 3)

df.iloc[:3,:5]
Out[35]: 
          0         1         2         3         4
0  0.825100  0.556511  0.445429  0.972720  0.726258
1  0.818005  0.298689  0.684203  0.722038  0.848657
2  0.426488  0.270172  0.400533  0.946921  0.745236

让我们每隔三列获取一次：

# data frame:
%timeit x = df.iloc[:,idx]
1000 loops, best of 3: 1.69 ms per loop
%timeit x = df.iloc[:,idx].copy()
100 loops, best of 3: 2.75 ms per loop

# underlying values:
%timeit x = df.values[:,idx]
1000 loops, best of 3: 1.61 ms per loop
%timeit x = df.values[:,idx].copy()
100 loops, best of 3: 2.23 ms per loop

# numpy array for comparison
%timeit x = a[:,idx]
1000 loops, best of 3: 1.53 ms per loop
%timeit x = a[:,idx].copy()
100 loops, best of 3: 2.16 ms per loop

使用.values 访问只是快一点（事实上，在我运行的其他测试中，差异甚至更小，不到 1%）。但是让我们尝试对一组连续的列进行同样的操作。

%timeit x = df.iloc[:,300:600]
10000 loops, best of 3: 153 µs per loop
%timeit x = df.iloc[:,300:600].copy()
1000 loops, best of 3: 1.18 ms per loop

%timeit x = df.values[:,300:600]
The slowest run took 9.67 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 15.7 µs per loop
%timeit x = df.values[:,300:600].copy()
1000 loops, best of 3: 568 µs per loop

%timeit x = a[:,300:600]
The slowest run took 24.73 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 414 ns per loop
%timeit x = a[:,300:600].copy()
1000 loops, best of 3: 497 µs per loop

我们有时会怀疑我们正在获取视图。所以让我们专注于.copy() 的结果。使用values 访问大约快 2 倍。

我们可以做得更好。让我们将底层数组中的数据布局更改为 fortran 顺序。这意味着数组的列在内存中是连续放置的（不是行，这是默认设置）。

a = np.asfortranarray(a)
df = pd.DataFrame(np.asfortranarray(a))
df.iloc[:3,:5]
df.iloc[:3,:5]
Out[53]: 
          0         1         2         3         4
0  0.825100  0.556511  0.445429  0.972720  0.726258
1  0.818005  0.298689  0.684203  0.722038  0.848657
2  0.426488  0.270172  0.400533  0.946921  0.745236

我只粘贴复制的结果：

# Every third column:
%timeit x = df.iloc[:,idx].copy()
100 loops, best of 3: 1.85 ms per loop
%timeit x = df.values[:,idx].copy()
1000 loops, best of 3: 1.2 ms per loop
%timeit x = a[:,idx].copy()
1000 loops, best of 3: 1.13 ms per loop

# Contiguous group of columns
%timeit x = df.iloc[:,300:600].copy()
1000 loops, best of 3: 635 µs per loop
%timeit x = df.values[:,300:600].copy()
1000 loops, best of 3: 655 µs per loop
%timeit x = a[:,300:600].copy()
1000 loops, best of 3: 586 µs per loop

但是当数据框包含混合类型的列时会发生什么？让我们将每隔一列转换为字符串。

for i in range(0, 1000, 2):
    df[i] = df[i].astype(str)

df.iloc[:3,:5]
Out[71]: 
                0         1               2         3               4
0  0.825100137204  0.556511  0.445428873093  0.972720  0.726258247769
1  0.818005069404  0.298689  0.684203047084  0.722038  0.848656512757
2   0.42648763586  0.270172  0.400532581854  0.946921  0.745235906595

%timeit x = df.iloc[:,idx].copy()
100 loops, best of 3: 8.24 ms per loop
%timeit x = df.values[:,idx].copy()
10 loops, best of 3: 51.6 ms per loop

%timeit x = df.iloc[:,300:600].copy()
100 loops, best of 3: 6.91 ms per loop
%timeit x = df.values[:,300:600].copy()
10 loops, best of 3: 48.3 ms per loop

Numpy 不能很好地处理数组中的混合类型。直接访问数据框更胜一筹。

附录如何从list_of_columns 中取出numeric_list_of_cols。

纯蟒蛇：

cols = df.columns.tolist()
numeric_list_of_cols = [cols.index(i) for i in list_of_columns]

麻木：
```
numeric_lis_of_cols, = np.in1d(df.columns, list_of_columns).nonzero()
```
numeric_lis_of_cols 后面的逗号是解包元组所必需的。函数 in1d 返回一个布尔数组和 nonzero() - 一个非零索引数组的元组。

警告：它可以改变元素的顺序。

为了保持顺序，您可以遍历 list_of_columns（类似于 np.nonzero(df.columns == elem)）的元素以获取后续索引。

【讨论】：

我尝试了arr = df.values[:, numeric_list_of_cols]，但出现错误IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices。
@daiyue 你用数字索引代替numeric_list_of_cols吗？如果您有一个按此顺序包含“A”、“B”、“C”列的数据框，并且想要获得“A”和“C”，那么您应该这样做arr = df.values[:,[0, 2]]
如何找到numeric_list_of_col的索引，并传入values[]？假设我有一个名为list_of_col 的列列表，它是df 的子集，是列的选择。
@daiyue 查看更新。我不确定这是否是最简单的方法，但应该可以。
感谢您的输入，您的回答绝对是从列列表中获取 ndarray 的一种方法