NumPy/SciPy 中的多线程整数矩阵乘法答案

【问题标题】：Multi-threaded integer matrix multiplication in NumPy/SciPyNumPy/SciPy 中的多线程整数矩阵乘法
【发布时间】：2016-05-08 04:18:19
【问题描述】：

做类似的事情

import numpy as np
a = np.random.rand(10**4, 10**4)
b = np.dot(a, a)

使用多核，运行良好。

不过，a 中的元素是 64 位浮点数（或 32 位平台中的 32 位？），我想乘以 8 位整数数组。不过，请尝试以下方法：

a = np.random.randint(2, size=(n, n)).astype(np.int8)

导致点积不使用多个内核，因此在我的 PC 上运行速度慢了约 1000 倍。

array: np.random.randint(2, size=shape).astype(dtype)

dtype    shape          %time (average)

float32 (2000, 2000)    62.5 ms
float32 (3000, 3000)    219 ms
float32 (4000, 4000)    328 ms
float32 (10000, 10000)  4.09 s

int8    (2000, 2000)    13 seconds
int8    (3000, 3000)    3min 26s
int8    (4000, 4000)    12min 20s
int8    (10000, 10000)  It didn't finish in 6 hours

float16 (2000, 2000)    2min 25s
float16 (3000, 3000)    Not tested
float16 (4000, 4000)    Not tested
float16 (10000, 10000)  Not tested

我知道 NumPy 使用 BLAS，它不支持整数，但如果我使用 SciPy BLAS 包装器，即。

import scipy.linalg.blas as blas
a = np.random.randint(2, size=(n, n)).astype(np.int8)
b = blas.sgemm(alpha=1.0, a=a, b=a)

计算是多线程的。现在，blas.sgemm 的运行时间与 float32 的 np.dot 完全相同，但对于非浮点数，它将所有内容转换为 float32 并输出浮点数，这是 np.dot 不做的。（此外，b 现在处于F_CONTIGUOUS 顺序，这是一个较小的问题）。

所以，如果我想进行整数矩阵乘法，我必须执行以下操作之一：

使用 NumPy 令人痛苦的缓慢 np.dot，很高兴我能保留 8 位整数。
使用 SciPy 的 sgemm 并使用 4 倍内存。
使用 Numpy 的 np.float16 并且只使用 2 倍内存，但需要注意的是，np.dot 在 float16 数组上比在 float32 数组上慢得多，比 int8 更慢。
为多线程整数矩阵乘法找到一个优化的库（实际上，Mathematica 可以做到这一点，但我更喜欢 Python 解决方案），理想情况下支持 1 位数组，虽然 8 位数组也很好......（我实际上的目标是在有限域 Z/2Z 上进行矩阵乘法，并且我知道我可以使用 Sage 来做到这一点，这很 Pythonic，但是，再次，有什么严格意义上的 Python 吗？）

我可以遵循选项 4 吗？有这样的图书馆吗？

免责声明：我实际上是在运行 NumPy + MKL，但我在 vanilly NumPy 上尝试了类似的测试，结果类似。

【问题讨论】：

关于你的选项 n°4，也许你可以看看 PyCuda 或 Theano ？它们允许在 GPU 上完成大型操作（使用简单的 numpy 接口），性能非常好。
作为选项 4 的可能答案，bitbucket.org/malb/m4ri 看起来很有趣。 “M4RI 是一个在 F2 上具有密集矩阵的快速算术库。”我想这就是 Sage 已经在使用的东西，但我看不出有什么理由不能直接从 Python 中使用它，并使用合适的 Cython 包装器。（事实上，您可能已经在 Sage 源代码中找到了这样的包装器。）
还没有人提到numpy.einsum，但这可能是一个不错的选择 5。
请注意，如果要避免整数溢出，则需要将结果转换为更大的值。如果每个元素是 0 或 1，则需要一个整数格式，该格式可以保存至少 n 的值，以保证不会溢出。对于您的示例，n=10000, (u)int16 应该就足够了。你的真实矩阵是稀疏的吗？如果是这样，您最好使用scipy.sparse.csr_matrix。
您能否为您要解决的整体问题提供更多背景信息？将大整数矩阵相乘是一件相当不寻常的事情。更多地了解这些矩阵的属性将特别有用。这些值总是 0 还是 1？如果它们可以更大，那么您很可能会发现自己最终受到可以使用 uint64 表示的最大整数的限制。矩阵是如何生成的？它们是否有任何特殊结构（例如对称、块、带等）？

标签： python multithreading numpy matrix-multiplication blas

【解决方案1】：

请注意，虽然这个答案变得陈旧，但 numpy 可能会获得优化的整数支持。请验证此答案在您的设置中是否仍然可以更快地工作。

选项 5 - 推出自定义解决方案： 将矩阵产品划分为几个子产品并并行执行。使用标准 Python 模块可以相对容易地实现这一点。子产品使用numpy.dot 计算，这会释放全局解释器锁。因此，可以使用相对轻量级的threads，并且可以从主线程访问数组以提高内存效率。

实施：

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time


def blockshaped(arr, nrows, ncols):
    """
    Return an array of shape (nrows, ncols, n, m) where
    n * nrows, m * ncols = arr.shape.
    This should be a view of the original array.
    """
    h, w = arr.shape
    n, m = h // nrows, w // ncols
    return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)


def do_dot(a, b, out):
    #np.dot(a, b, out)  # does not work. maybe because out is not C-contiguous?
    out[:] = np.dot(a, b)  # less efficient because the output is stored in a temporary array?


def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
    """
    Return the matrix product a * b.
    The product is split into nblocks * mblocks partitions that are performed
    in parallel threads.
    """
    n_jobs = nblocks * mblocks
    print('running {} jobs in parallel'.format(n_jobs))

    out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)

    out_blocks = blockshaped(out, nblocks, mblocks)
    a_blocks = blockshaped(a, nblocks, 1)
    b_blocks = blockshaped(b, 1, mblocks)

    threads = []
    for i in range(nblocks):
        for j in range(mblocks):
            th = threading.Thread(target=dot_func, 
                                  args=(a_blocks[i, 0, :, :], 
                                        b_blocks[0, j, :, :], 
                                        out_blocks[i, j, :, :]))
            th.start()
            threads.append(th)

    for th in threads:
        th.join()

    return out


if __name__ == '__main__':
    a = np.ones((4, 3), dtype=int)
    b = np.arange(18, dtype=int).reshape(3, 6)
    assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))

    a = np.random.randn(1500, 1500).astype(int)

    start = time()
    pardot(a, a, 2, 4)
    time_par = time() - start
    print('pardot: {:.2f} seconds taken'.format(time_par))

    start = time()
    np.dot(a, a)
    time_dot = time() - start
    print('np.dot: {:.2f} seconds taken'.format(time_dot))

通过这个实现，我得到了大约 x4 的加速，这是我机器中的物理内核数：

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken

【讨论】：

有效！这是O(n**3) 矩阵乘积，正好是n**2 点乘积，对吗？
它将 Matrix 产品拆分为多个较小的 Matrix 产品。在极端情况下，这可能是矢量点积。
当类型为浮点型时，pardot 比 np.dot 慢：并行运行 4 个作业并行运行 8 个作业 pardot：耗时 0.13 秒 np.dot：耗时 0.07 秒
当数据集是 10 倍大小时更糟：pardot：1212.89 秒 np.dot：73.11 秒
@kory 这是意料之中的。请使用np.dot 进行浮点乘法。

【解决方案2】：

“Why is it faster to perform float by float matrix multiplication compared to int by int?”解释了为什么整数这么慢：首先，CPU 具有高吞吐量的浮点流水线。其次，BLAS 没有整数类型。

解决方法：将矩阵转换为float32 值可以大大加快速度。 2015 款 MacBook Pro 的 90 倍加速如何？（使用float64 是一半好。）

import numpy as np
import time

def timeit(callable):
    start = time.time()
    callable()
    end = time.time()
    return end - start

a = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8)

timeit(lambda: a.dot(a))  # ≈0.9 sec
timeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) )  # ≈0.01 sec

【讨论】：