SIMD 矩阵乘法导致 segfault 或 segabrt答案

【问题标题】：SIMD matrix multiplication causing segfault or segabrtSIMD 矩阵乘法导致 segfault 或 segabrt
【发布时间】：2021-12-31 14:55:28
【问题描述】：

我从这个链接中得到启发，编写了一个矩阵乘法器，它是 4 的倍数：SSE matrix-matrix multiplication

我想出了一些类似的东西，但我观察到，如果 j 的 for 循环像建议代码中那样增加 4，则每 4 列只填充 1 列（这是有道理的）。我可以将for 循环减少2，结果是只填充了一半的列。

所以从逻辑上讲，解决方案应该是只将循环增加 1，但是当我在代码中进行更改时，如果我使用_mm_store_ps，我会收到 segfault 错误，如果我使用@，我会收到数据 corrupted size vs. prev_size 987654329@，这让我相信数据根本不对齐。

我应该如何以及如何对齐数据以不导致此类错误并填充结果矩阵？

这是我目前的代码：

void mat_mult(Matrix A, Matrix B, Matrix C, n) {
   for(int i = 0; i < n; ++i) {
   for(int j = 0; j < n; j+=1) {   
       __m128 vR = _mm_setzero_ps();
       for(int k = 0; k < n; k++) {  
           __m128 vA = _mm_set1_ps(A(i,k));  
           __m128 vB = _mm_loadu_ps(&B(k,j));  
           vR = _mm_add_ss(vR,vA*vB);
       }
       _mm_storeu_ps(&C(i,j), vR);
   }
 }
}

【问题讨论】：

Matrix 类是如何定义的？请提供minimal reproducible example。
我无法在此处发布整个课程，但几乎所有数据都存储在与static_cast<float*>(aligned_alloc(64, n * n * sizeof(float))) 对齐的浮点指针中。另外还有() 的运算符可以访问矩阵的元素。
始终使用调试器来查看您的段错误的确切位置。如果它在 movaps 指令上（或带有内存源操作数的传统 SSE 数学指令，如 addps xmm0, [rdi]），则可能存在对齐问题。否则它不是，就像当你通过存储在你的数组之外来破坏 malloc 簿记信息时，它会在其他地方出错。
您不能在矩阵乘法上使用 SIMD，除非您转置第二个矩阵。想想看。当您从矩阵 1 中提取四个连续值时，您需要从矩阵 2 中相应的四个元素是 VERTICAL。它们不是连续的。由于转置的成本通常很高，这使得它成为并行化的糟糕选择。
@TimRoberts：或者除非您重新构造循环以沿矩阵 和 += 的行运行到目标中的内存，而不是尝试累积一个向量行*列点积。这是一种众所周知的策略，例如在 What Every Programmer Should Know About Memory? 的附录中显示（以及缓存阻止版本）。每个 FMA 需要 3 次加载和一次存储并不是很好，但这可以稍微摊销。您关于写入数组末尾的答案可能是正确的；我对此表示赞成。

标签： c++ matrix simd sse

【解决方案1】：

我更正了您的代码，还实现了很多其他补充代码来完全运行测试和打印输出，包括我需要从头开始实现 Matrix 类。以下代码可以用 C++11 标准编译。

对您的函数的主要更正是：您应该单独处理 B 列数不是 4 的倍数的情况，这种不均匀的尾部情况应该由单独的循环处理，您实际上应该以 4 的步长运行 j 循环（因为 128 位 SSE float-32 寄存器包含 4 个浮点数），您应该使用 _mm_mul_ps(vA, vB) 而不是 vA * vB。

您的代码的主要错误是，您应该使用_mm_add_ps() 而不是您的_mm_add_ss()，因为您需要单独添加的不是单个值，而是其中的 4 个。仅由于使用了 _mm_add_ss()，您才发现 4 列中只有 1 列被填充（其余 3 列为零）。

或者，您可以使用_mm_load_ss() 代替_mm_loadu_ps() 和_mm_store_ss() 代替_mm_storeu_ps() 来修复您的代码工作。仅在此修复后，您的代码将给出正确的结果，但速度会很慢，不会比常规的非 SSE 解决方案快。要真正提高速度，您必须在任何地方仅使用 ..._ps() 指令，还要正确处理非 4 倍数的情况。

因为您不处理 B 列不是 4 的倍数的情况，因此您的程序段错误，您只需将内存存储在矩阵 C 的范围之外。

你还问了一个关于对齐的问题。永远不要像_mm_store_ps()/_mm_load_ps()那样使用对齐的存储/加载，总是使用_mm_storeu_ps()/_mm_loadu_ps()。因为未对齐的访问指令保证与相同内存指针值的对齐访问指令具有相同的速度。但是对齐的指令可能会出现段错误。所以未对齐总是更好，同样的速度，永远不会出现段错误。它曾经在旧 CPU 上以更快的速度对齐指令，但现在它们在 CPU 中以完全相同的速度实现。对齐指令不会产生任何利润，只会产生段错误。但是，如果您想确保程序的内存指针始终对齐，您可能仍想使用对齐指令来故意进行段错误。

我还使用矩阵的参考慢速乘法实现了一个单独的函数，以便运行参考测试来检查快速 (SSE) 乘法的正确性。

作为@АлексейНеудачин 的commented out，我以前版本的Matrix 类为数组分配未对齐的内存，现在我实现了新的帮助类AlignmentAllocator，它确保Matrix 分配对齐的内存，这个分配器由std::vector<> 使用存储底层 Matrix 的数据。

包含所有更正、测试和控制台输出以及所有额外补充代码的完整代码如下。另见代码后的控制台输出，我确实打印了由两个不同的乘法函数产生的两个矩阵，以便可以直观地比较两个矩阵。所有测试用例都是随机生成的。向下滚动我的代码以查看您的固定功能mat_mult()。如果你想在线查看/运行我的代码，也可以点击Try it online!链接。

Try it online!

#include <cmath>
#include <iostream>
#include <vector>
#include <random>
#include <stdexcept>
#include <string>
#include <iomanip>
#include <cstdlib>

#include <malloc.h>
#include <immintrin.h>

using FloatT = float;

template <typename T, std::size_t N>
class AlignmentAllocator {
public:
    typedef T value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;
    typedef T * pointer;
    typedef const T * const_pointer;
    typedef T & reference;
    typedef const T & const_reference;

public:
    inline AlignmentAllocator() throw() {}
    template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
    inline ~AlignmentAllocator() throw() {}
    inline pointer adress(reference r) { return &r; }
    inline const_pointer adress(const_reference r) const { return &r; }
    inline pointer allocate(size_type n);
    inline void deallocate(pointer p, size_type);
    inline void construct(pointer p, const value_type & v) { new (p) value_type(v); }
    inline void destroy(pointer p) { p->~value_type(); }
    inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
    template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
    bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
    bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};

template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
    #ifdef _MSC_VER
        auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
    #else
        auto p = (pointer)aligned_alloc(N, n * sizeof(value_type));
    #endif
    if (!p)
        throw std::bad_alloc();
    return p;
}

template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
    #ifdef _MSC_VER
        _aligned_free(p);
    #else
        std::free(p);
    #endif
}

static size_t constexpr MatrixAlign = 64;

template <typename T, size_t Align = MatrixAlign>
using AlignedVector = std::vector<T, AlignmentAllocator<T, Align>>;

class Matrix {
public:
    Matrix(size_t rows, size_t cols)
        : rows_(rows), cols_(cols) {
        cols_aligned_ = (sizeof(FloatT) * cols_ + MatrixAlign - 1)
            / MatrixAlign * MatrixAlign / sizeof(FloatT);
        Clear();
        if (size_t(m_.data()) % 64 != 0 ||
                (cols_aligned_ * sizeof(FloatT)) % 64 != 0)
            throw std::runtime_error("Matrix was allocated unaligned!");
    }
    Matrix & Clear() {
        m_.clear();
        m_.resize(rows_ * cols_aligned_);
        return *this;
    }
    FloatT & operator() (size_t i, size_t j) {
        if (i >= rows_ || j >= cols_)
            throw std::runtime_error("Matrix index (" +
            std::to_string(i) + ", " + std::to_string(j) + ") out of bounds (" +
            std::to_string(rows_) + ", " + std::to_string(cols_) + ")!");
        return m_[i * cols_aligned_ + j];
    }
    FloatT const & operator() (size_t i, size_t j) const {
        return const_cast<Matrix &>(*this)(i, j);
    }
    size_t Rows() const { return rows_; }
    size_t Cols() const { return cols_; }
    bool Equal(Matrix const & b, int round = 7) const {
        if (Rows() != b.Rows() || Cols() != b.Cols())
            return false;
        FloatT const eps = std::pow(FloatT(10), -round);
        for (size_t i = 0; i < Rows(); ++i)
            for (size_t j = 0; j < Cols(); ++j)
                if (std::fabs((*this)(i, j) - b(i, j)) > eps)
                    return false;
        return true;
    }
private:
    size_t rows_ = 0, cols_ = 0, cols_aligned_ = 0;
    AlignedVector<FloatT> m_;
};

void mat_print(Matrix const & A, int round = 7, size_t width = 0) {
    FloatT const pow10 = std::pow(FloatT(10), round);
    for (size_t i = 0; i < A.Rows(); ++i) {
        for (size_t j = 0; j < A.Cols(); ++j)
            std::cout << std::setprecision(round) << std::fixed << std::setw(width)
                << std::right << (std::round(A(i, j) * pow10) / pow10) << " ";
        std::cout << std::endl;;
    }
}

void mat_mult(Matrix const & A, Matrix const & B, Matrix & C) {
    if (A.Cols() != B.Rows())
        throw std::runtime_error("Number of A.Cols and B.Rows don't match!");
    if (A.Rows() != C.Rows() || B.Cols() != C.Cols())
        throw std::runtime_error("Wrong C rows, cols!");
    
    for (size_t i = 0; i < A.Rows(); ++i)
        for (size_t j = 0; j < B.Cols() - B.Cols() % 4; j += 4) {
            auto sum = _mm_setzero_ps();
            for (size_t k = 0; k < A.Cols(); ++k)
                sum = _mm_add_ps(
                    sum,
                    _mm_mul_ps(
                        _mm_set1_ps(A(i, k)),
                        _mm_loadu_ps(&B(k, j))
                    )
                );
            _mm_storeu_ps(&C(i, j), sum);
        }
        
    if (B.Cols() % 4 == 0)
        return;
        
    for (size_t i = 0; i < A.Rows(); ++i)
        for (size_t j = B.Cols() - B.Cols() % 4; j < B.Cols(); ++j) {
            FloatT sum = 0;
            for (size_t k = 0; k < A.Cols(); ++k)
                sum += A(i, k) * B(k, j);
            C(i, j) = sum;
        }
}

void mat_mult_slow(Matrix const & A, Matrix const & B, Matrix & C) {
    if (A.Cols() != B.Rows())
        throw std::runtime_error("Number of A.Cols and B.Rows don't match!");
    if (A.Rows() != C.Rows() || B.Cols() != C.Cols())
        throw std::runtime_error("Wrong C rows, cols!");
        
    for (size_t i = 0; i < A.Rows(); ++i)
        for (size_t j = 0; j < B.Cols(); ++j) {
            FloatT sum = 0;
            for (size_t k = 0; k < A.Cols(); ++k)
                sum += A(i, k) * B(k, j);
            C(i, j) = sum;
        }
}

void mat_fill_random(Matrix & A) {
    std::mt19937_64 rng{std::random_device{}()};
    std::uniform_real_distribution<FloatT> distr(-9.99, 9.99);
    for (size_t i = 0; i < A.Rows(); ++i)
        for (size_t j = 0; j < A.Cols(); ++j)
            A(i, j) = distr(rng);
}

int main() {
    try {
        {
            Matrix a(17, 23), b(23, 19), c(17, 19), d(c.Rows(), c.Cols());
            
            mat_fill_random(a);
            mat_fill_random(b);
            mat_mult_slow(a, b, c);
            
            mat_mult(a, b, d);
            
            if (!c.Equal(d, 5))
                throw std::runtime_error("Test failed, c != d.");
        }
                
        {
            Matrix a(3, 7), b(7, 5), c(3, 5), d(c.Rows(), c.Cols());
            
            mat_fill_random(a);
            mat_fill_random(b);
            mat_mult_slow(a, b, c);
            
            mat_mult(a, b, d);
            
            mat_print(c, 3, 8);
            std::cout << std::endl;
            mat_print(d, 3, 8);
        }
            
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

输出：

 -37.177 -114.438   36.094  -49.689 -139.857 
  22.113 -127.210  -94.434  -14.363   -6.336 
  71.878   94.234   33.372   32.573   73.310 

 -37.177 -114.438   36.094  -49.689 -139.857 
  22.113 -127.210  -94.434  -14.363   -6.336 
  71.878   94.234   33.372   32.573   73.310

【讨论】：

评论不用于扩展讨论；这个对话是moved to chat。