并行化执行按位操作的代码答案

【问题标题】：Parallelize code which is doing bit wise operation并行化执行按位操作的代码
【发布时间】：2021-07-17 03:28:48
【问题描述】：

我有这段代码通过将 AU 矩阵的每个字节的 8 个元素打包到 A 中来减少 100k*200k 矩阵占用的空间，以减少内存消耗。如您所料，此代码需要永远运行，我也计划将行数增加到 200k。我在一个非常强大的实例（CPU 和 GPU）上运行代码并且可以对其进行扩展，因此任何人都可以帮助并行化此代码以使其更快。

import numpy as np
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()

A = np.empty((rows,colm), dtype=np.uint8)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        A[i,j] = 0
        for k in range(8):
            if AU[i,(j*8)+k] == 1:
                A[i,j] = A[i,j] | (1<<(7-k))

【问题讨论】：

如果你只是使用 numba，它应该已经比你的代码和向量化的 numpy 例程快了很多。此外，您还应该能够为 GPU 目标编译它......我明天会整理一个版本并发布它......我仍然需要看看最里面的部分究竟做了什么......也许可以改进通过以某种方式重写循环来缓存局部性......
我假设随机部分只是测试数据，或者您只需生成该格式的数据，例如A = np.random.randint(256,size=(rows, colm),dtype=np.int8)
AU中的数据是我的真实数据，我会读一段时间，但由于它太大了，我把它压缩成这样
@2b-t 谢谢，感激不尽

标签： python multithreading performance parallel-processing dask

【解决方案1】：

Python 解决方案：

import numpy as np
import time

def compute(A, AU):
    A[:,:] = 0
    # Put every 8 columns in AU into A
    for i in range(A.shape[1]):
        A[:, i//8] = np.bitwise_or(A[:, i//8], np.left_shift(AU[:, i], i % 8))

colm = int(20000/8)
rows = 10000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()

A = np.empty((rows,colm), dtype=np.uint8)

start_time = time.time()

compute(A, AU)
    
end_time = time.time()
print(end_time - start_time)

在 1/2 秒内打包位

C 中的相同代码：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char* argv[]) {

    int colm = 200000/8;
    int rows = 10000;
    int cols = colm*8;
    unsigned char *A = (unsigned char *)malloc(rows * colm * sizeof(unsigned char)); 
    unsigned char *AU = (unsigned char *)malloc(rows * cols * sizeof(unsigned char)); 
    int i, j;
    clock_t begin;
    clock_t end;
    double time_spent;

    begin = clock();
        
    // Create AU
    for (i = 0; i < rows; i++)
        for (j = 0; j < cols; j++)
            *(AU + i*cols + j) = (unsigned char) (rand() & 0x01);  
            
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to create AU\n", time_spent);
            
    begin = clock();
    
    // Create a zeroed out A
    for (i = 0; i < rows; i++)
        for (j = 0; j < colm; j++)
            *(A + i*colm + j) = (unsigned char) 0;  

    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to create A\n", time_spent);

    begin = clock();
            
    // Pack into bits
    for (i = 0; i < rows; i++)
        for (j = 0; j < colm; j++) {
            int au_idx = i*cols + j*8;
            for (int k=0; k<8; k++)
                *(A + i*colm + j) |= *(AU + au_idx + k) << k;
            }
            
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to pack\n", time_spent);
            

    free(A); 
    free(AU);
    return 0;
}

使用 colm=200,000 进行测试。位打包需要 0.27 秒，而 Jérôme Richard 提供的优化 Python 版本需要 0.64 秒。对 rand() 的调用非常昂贵，并且大大增加了整体运行时间。在内存方面，C 版本的峰值为 2GB，而 Python 为 4.2GB。进一步的代码优化和并行化肯定会减少运行时间。

Julia version:

using Random
colm = 200000÷8
rows = 30000
cols = colm*8

AU = zeros(UInt8, (rows, cols))

rand!(AU)
AU .&= 0x01

A = zeros(UInt8, (rows, colm))

function compute(A, AU)
    for i in 1:size(A)[2]
        start_col = (i-1) << 3
        @views A[:, i] .=  AU[:, start_col + 1] .| 
                   (AU[:, start_col + 2] .<< 1) .|
                   (AU[:, start_col + 3] .<< 2) .|
                   (AU[:, start_col + 4] .<< 3) .|
                   (AU[:, start_col + 5] .<< 4) .|
                   (AU[:, start_col + 6] .<< 5) .|
                   (AU[:, start_col + 7] .<< 6) .|
                   (AU[:, start_col + 8] .<< 7)        
    end
end

@time compute(A, AU)

Julia 在性能方面的扩展性很好。 colm=25,000 和 rows=30,000 的结果：

Language  Total Run Time (secs)   Bit Packing Time (secs)  Peak Memory (GB)
Python    22.1                    3.0                      6
Julia     11.7                    1.2                      6

【讨论】：

它在第二个循环中产生错误TypeError: ufunc 'bitwise_or' output (typecode 'h') could not be coerced to provided output parameter (typecode 'B') according to the casting rule ''same_kind''
您可能必须通过指定数据类型来启动零字节的 numpy 数组。
你用的是什么编译器？我在 Numba 和 Clang 上使用 clang Code.c -O3 -march=native -o Code.exe 得到了完全相同的结果（都在 490 毫秒左右，都是单线程的），在 Numba 并行版本上使用了大约 100 毫秒。（使用'void(uint8[:,::1],int8[:,::1])' 或不输入签名）
@max9111 我正在使用 gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0。我得到了一个包含 30,000 行的核心转储。也许我的代码坏了（自 90 年代以来一直没有做 C 代码）。
@Tarik 你的 C 和 Julia 代码给出了错误的结果：正确的移位 (7-k) 而不是 k。我认为 Python 代码也有同样的问题。此外，请注意输入 AU 矩阵的类型与 OP 需要的并不完全相同：它是 int8 而不是 uint8。这可能会对性能产生影响。

【解决方案2】：

警告：您尝试分配大量内存：大约 2 TB 内存您可能没有。 p>

假设您有足够的内存或者您可以减小数据集的大小，您可以使用 Numba JIT 编写一个更快的实现。此外，您可以并行化代码并用无分支实现替换慢速条件，以显着加快计算速度，因为AU 填充了二进制值。最后，您可以展开处理k 的内部循环，以使代码更快。这是最终的实现：

import numpy as np
import numba as nb
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

compute(A, AU)

在我的机器上，此代码比在较小数据集（使用 colm=int(20000/8) 和 rows=10000）上的原始实现快 37851 倍。原始实现耗时 6min3s，优化后耗时 9.6ms。

这段代码在我的机器上是内存绑定的。对于当前输入，此代码接近最优，因为它大部分时间都在读取AU 输入矩阵。一个很好的附加优化是将AU 矩阵“压缩”为更小的矩阵（如果可能）。

【讨论】：

如果我错了，请纠正我，但20,000 / 8 * 10,000 = 25,000,000。我们在这里谈论的是 25MB。算不上什么壮举。不是要批评你的工作，但如果这就是 Python 能做的全部，是时候使用 C 或 Julia。
@Tarik 您忘记将大小乘以 8（请参阅 cols = int(colm*8)）。数组的大小为：rows * cols = 10000 * int(int(20000/8)*8) = 10000 * 20000 = 200 MB 和 rows * colm = 10000 * int(20000/8) = 25 MB。在 9.6 毫秒内读取/写入 225 MB 意味着内存吞吐量为 23.4 GB/s，这是相当高的。更不用说我有一个英特尔处理器。由于英特尔处理器缓存的写入分配行为，内存写入实际上会导致从内存中读取。因此，真正的内存吞吐量为 26 GB/s。这接近我可以达到的实际最大吞吐量（~38 GB/s）。
@Tarik 至于内存大小，我只是想做个小测试。我可以尝试 2.25 GB 的测试，但原始版本需要 1 小时来计算……优化版本在这个 2.25 GB 数据集上大约需要 0.13 秒。对我来说似乎非常好（对于普通的台式机）。
您也可以声明 c-contiguous 数组 'void(uint8[:,::1],int8[:,::1])' 或不带签名。差异不是很大（在单线程版本中只有大约 10%），但显式声明非连续数组几乎不是一个好主意。这通常会阻止 SIMD 向量化并且可能会慢得多。
@JérômeRichard 您提供的代码非常好。我不能用 Python 做得更好。我正在考虑搬到 Julia，但到目前为止我的结果好坏参半。 Julia 可以更好地扩展，但匹配优化的 Python 代码有时具有挑战性。

【解决方案3】：

昨天最初阅读您的帖子后，我实际上计划自己使用即时编译器 numba 编写一个例程，但 Jérôme 比我快，并为您提供了一个出色的解决方案。但是我有一个替代方案可以提供：当已经存在一个完全相同的 numpy 函数 时，为什么要重新发明轮子：numpy.packbits。

import numpy as np
A = np.packbits(AU,axis=-1)

完成这项工作。从我的测试来看，它似乎比 Jérôme 的版本慢了很多，但无论如何都比你的初始版本快。

【讨论】：