这似乎是问题Bitwise transpose of 8 bytes 的概括。这个问题只是关于 8x8 转置,所以你问的有点不同。但是您的问题在Hacker's Delight 一书的第 7.3 节中也得到了解答(您可能能够在 Google 图书上看到 the relevant pages)。那里显示的代码显然源自Guy Steele。
Hacker's Delight website 仅包含书中8x8 和32x32 案例的源代码,但后者可以简单地概括为您的 64x64 案例:
#include <stdint.h>
void
transpose64(uint64_t a[64]) {
int j, k;
uint64_t m, t;
for (j = 32, m = 0x00000000FFFFFFFF; j; j >>= 1, m ^= m << j) {
for (k = 0; k < 64; k = ((k | j) + 1) & ~j) {
t = (a[k] ^ (a[k | j] >> j)) & m;
a[k] ^= t;
a[k | j] ^= (t << j);
}
}
}
其工作方式是该函数连续交换较小的位块,从 32x32 块开始(不转置位在这些块中),然后在那些 32x32 块中交换适当的16x16 块等。保存块大小的变量是j。因此,外循环有j连续取值32、16、8、4、2和1,这意味着外循环运行了六次。内部循环运行在 half 位的行上,变量k 中的给定位等于零的行。当j 是 32 时,这些是 0-31 行,当 j 是 16 时,这些是 0-15 和 32-47 行,等等。循环的内部部分一起运行 6*32 = 192 次。在这个内部部分内部发生的是掩码m 确定应该交换哪些位,在t 中计算异或或那些位,并且使用异或的位列表来更新位两个地方都合适。
这本书(和网站)也有这个代码的一个版本,其中这些循环都已展开,并且掩码 m 不是计算的,而是分配的。我想这取决于寄存器的数量和指令缓存的大小这是否是一种改进?
为了测试它是否有效,假设我们定义了一些位模式,比如:
uint64_t logo[] = {
0b0000000000000000000000000000000000000000000100000000000000000000,
0b0000000000000000000000000000000000000000011100000000000000000000,
0b0000000000000000000000000000000000000000111110000000000000000000,
0b0000000000000000000000000000000000000001111111000000000000000000,
0b0000000000000000000000000000000000000000111111100000000000000000,
0b0000000000000000000000000000000000000000111111100000000000000000,
0b0000000000000000000000000000000000000000011111110000000000000000,
0b0000000000000000000000000000000000000000001111111000000000000000,
0b0000000000000000000000000000000000000000001111111100000000000000,
0b0000000000000000000000000000000010000000000111111100000000000000,
0b0000000000000000000000000000000011100000000011111110000000000000,
0b0000000000000000000000000000000111110000000001111111000000000000,
0b0000000000000000000000000000001111111000000001111111100000000000,
0b0000000000000000000000000000011111111100000000111111100000000000,
0b0000000000000000000000000000001111111110000000011111110000000000,
0b0000000000000000000000000000000011111111100000001111111000000000,
0b0000000000000000000000000000000001111111110000001111111100000000,
0b0000000000000000000000000000000000111111111000000111111100000000,
0b0000000000000000000000000000000000011111111100000011111110000000,
0b0000000000000000000000000000000000001111111110000001111111000000,
0b0000000000000000000000000000000000000011111111100001111111100000,
0b0000000000000000000000001100000000000001111111110000111111100000,
0b0000000000000000000000001111000000000000111111111000011111110000,
0b0000000000000000000000011111110000000000011111111100001111100000,
0b0000000000000000000000011111111100000000001111111110001111000000,
0b0000000000000000000000111111111111000000000011111111100110000000,
0b0000000000000000000000011111111111110000000001111111110000000000,
0b0000000000000000000000000111111111111100000000111111111000000000,
0b0000000000000000000000000001111111111111100000011111110000000000,
0b0000000000000000000000000000011111111111111000001111100000000000,
0b0000000000000000000000000000000111111111111110000011000000000000,
0b0000000000000000000000000000000001111111111111100000000000000000,
0b0000000000000000000000000000000000001111111111111000000000000000,
0b0000000000000000000000000000000000000011111111111100000000000000,
0b0000000000000000000111000000000000000000111111111100000000000000,
0b0000000000000000000111111110000000000000001111111000000000000000,
0b0000000000000000000111111111111100000000000011111000000000000000,
0b0000000000000000000111111111111111110000000000110000000000000000,
0b0000000000000000001111111111111111111111100000000000000000000000,
0b0000000000000000001111111111111111111111111111000000000000000000,
0b0000000000000000000000011111111111111111111111100000000000000000,
0b0000001111110000000000000001111111111111111111100000111111000000,
0b0000001111110000000000000000000011111111111111100000111111000000,
0b0000001111110000000000000000000000000111111111100000111111000000,
0b0000001111110000000000000000000000000000001111000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000001111111111111111111111111111000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111110000000000000000000000000000000000000000111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
0b0000001111111111111111111111111111111111111111111111111111000000,
};
然后我们调用transpose32 函数并打印生成的位模式:
#include <stdio.h>
void
printbits(uint64_t a[64]) {
int i, j;
for (i = 0; i < 64; i++) {
for (j = 63; j >= 0; j--)
printf("%c", (a[i] >> j) & 1 ? '1' : '0');
printf("\n");
}
}
int
main() {
transpose64(logo);
printbits(logo);
return 0;
}
然后将其作为输出:
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000011000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000100000000011111000000011111100000111111
0000000000000000000000011110000000011111100000011111100000111111
0000000000000000000001111110000000011111100000011111100000111111
0000000000000000000001111111000000011111100000011111100000111111
0000000000000000000000111111000000011111100000011111100000111111
0000000000000000000000111111100000001111110000011111100000111111
0000000000000000000000011111100000001111110000011111100000111111
0000000000000100000000011111110000001111110000011111100000111111
0000000000001110000000001111110000001111110000011111100000111111
0000000000011110000000001111111000001111110000011111100000111111
0000000001111111000000000111111000000111111000011111100000111111
0000000000111111100000000111111100000111111000011111100000111111
0000000000111111110000000011111100000111111000011111100000111111
0000000000011111111000000011111100000111111000011111100000111111
0000000000001111111100000001111110000011111000011111100000111111
0000000000000111111100000001111110000011111100011111100000111111
0000000000000011111110000000111111000011111100011111100000111111
0001000000000001111111000000111111000011111100011111100000111111
0011110000000001111111100000111111100011111100011111100000111111
0111111000000000111111110000011111100001111100011111100000111111
0111111110000000011111111000011111110001111110011111100000111111
1111111111000000001111111000001111110001111110011111100000111111
0011111111100000000111111100001111111001111110011111100000111111
0001111111111000000011111110000111111001111110011111100000111111
0000111111111100000011111111000111111100111100000000000000111111
0000001111111110000001111111100011111100000000000000000000111111
0000000111111111100000111111110011111000000000000000000000111111
0000000011111111110000011111110001100000000000000000000000111111
0000000000111111111000001111111000000000000000000000000000111111
0000000000011111111110000111111000000000000000000000000000111111
0000000000001111111111000111110000000000011111111111111111111111
0000000000000011111111100011100000000000011111111111111111111111
0000000000000001111111111001000000000000011111111111111111111111
0000000000000000111111111100000000000000011111111111111111111111
0000000000000000001111111100000000000000011111111111111111111111
0000000000000000000111111000000000000000011111111111111111111111
0000000000000000000011110000000000000000000000000000000000000000
0000000000000000000000100000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
正如我们所希望的那样,这很好地翻转了。
编辑:
这实际上并不是您真正要求的,因为您要求的是此代码的非破坏性版本。您可以通过将 32x32 块的第一次交换从 x 转到 y 来实现这一点。例如,您可能会执行以下操作:
void
non_destructive_transpose64(uint64_t x[64], uint64_t y[64]) {
int j, k;
uint64_t m, t;
for (k = 0; k < 64; k += 2) {
((uint32_t *) y)[k] = ((uint32_t *) x)[k ^ 64 + 1];
((uint32_t *) y)[k + 1] = ((uint32_t *) x)[k + 1];
}
for (; k < 128; k += 2) {
((uint32_t *) y)[k] = ((uint32_t *) x)[k];
((uint32_t *) y)[k + 1] = ((uint32_t *) x)[k ^ 64];
}
for (j = 16, m = 0x0000FFFF0000FFFF; j; j >>= 1, m ^= m << j) {
for (k = 0; k < 64; k = ((k | j) + 1) & ~j) {
t = (y[k] ^ (y[k | j] >> j)) & m;
y[k] ^= t;
y[k | j] ^= (t << j);
}
}
}
与其他版本的代码不同,无论架构的字节序如何,它都不工作。另外,我知道 C 标准不允许您将 uint64_t 的数组作为 uint32_t 的数组来访问。但是,我喜欢这样操作时,移动块循环的第一次迭代不需要移位或异或。