【问题标题】：Open XML document protection implementation (documentProtection class)Open XML 文档保护实现（documentProtection 类）
【发布时间】：2021-04-28 20:04:30
【问题描述】：

我正在尝试在 Python 中实现 MS Word (2019) 文档的 Open XML documentProtection 哈希保护，以测试哈希算法。所以我创建了一个 Word 文档，并使用以下密码保护它不被编辑：johnjohn。然后，以 ZIP/XML 格式打开文档，我在 documentProtection 部分看到以下内容：

<w:documentProtection w:edit="readOnly" w:enforcement="1" w:cryptProviderType="rsaAES" w:cryptAlgorithmClass="hash" w:cryptAlgorithmType="typeAny" w:cryptAlgorithmSid="14" w:cryptSpinCount="100000" w:hash="pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw==" w:salt="pH1TDVHSfGBxkd3Q88UNhQ==" />

根据 Open XML 文档 (ECMA-376-1:2016 #17.15.1.29)：

cryptAlgorithmSid="14" 指向 SHA-512 算法
cryptSpinCount="100000" 表示哈希必须在 100k 轮中完成，使用以下算法（引用上述标准）：

指定哈希函数在尝试时迭代运行的次数（使用每次迭代的结果加上一个包含迭代次数的 4 字节值（从 0 开始，小端序）作为下一次迭代的输入）将用户提供的密码与 hashValue 属性中存储的值进行比较。

用于散列的 BASE64 编码盐 ("pH1TDVHSfGBxkd3Q88UNhQ==") 被添加到原始密码之前。目标 BASE64 编码哈希必须是“pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw==”

所以我的 Python 脚本尝试使用如下所述的算法生成相同的哈希值：

import hashlib
import base64
import struct

TARGET_HASH = 'pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw=='

TARGET_SALT = 'pH1TDVHSfGBxkd3Q88UNhQ=='
bsalt = base64.b64decode(TARGET_SALT)

def hashit(what, alg='sha512', **kwargs):
    if alg == 'sha1':
        return hashlib.sha1(what)
    elif alg == 'sha512':
        return hashlib.sha512(what)
    # etc...
    else:
        raise Exception(f'Unsupported hash algorithm: {alg}')

def gethash(data, salt=None, alg='sha512', iters=100000, base64result=True, returnstring=True):
    # encode password in UTF-16LE
    # ECMA-376-1:2016 17.15.1.29 (p. 1026)
    if isinstance(data, str): data = data.encode('utf-16-le')
    
    # prepend salt if provided
    if not salt is None:
        if isinstance(salt, str): salt = salt.encode('utf-16-le')
        ghash = salt + data
    else:
        ghash = data
    
    # hash iteratively for 'iters' rounds
    for i in range(iters):
        try:
            # next hash = hash(previous data) + 4-byte integer (previous round number) with LE byte ordering
            # ECMA-376-1:2016 17.15.1.29 (p. 1020)
            ghash = hashit(ghash, alg).digest() + struct.pack('<I', i)
        except Exception as err:
            print(err)
            break
    
    # remove trailing round number bytes
    ghash = ghash[:-4]

    # BASE64 encode if requested
    if base64result:
        ghash = base64.b64encode(ghash)
    # return as an ASCII string if requested
    if returnstring:
        ghash = ghash.decode()
        
    return ghash

但是当我跑步时

print(gethash('johnjohn', bsalt))

我得到以下不等于目标的哈希：

G47RT4/+JdE6pnrP6MqUKa3JyL8abeYSCX+E4+9J+6shiZqImBJ8M6bb+IMKEdvKd6+9dVnQ3oeOsgQz/aCdcQ==

我的实现是否有误，或者您认为低级哈希函数实现（Python 的 hashlib 与 Open XML）有什么不同？

更新

我意识到 Word 使用旧算法来预处理密码（为了与旧版本兼容）。该算法在ECMA-376-1:2016 第 4 部分（过渡迁移功能，#14.8.1 “Legacy Password Hash Algorithm”）中有详细描述。所以我设法制作了一个重现官方 ECMA 示例的脚本：

def strtobytes(s, trunc=15):    
    b = s.encode('utf-16-le')
    # remove BOM symbol if present
    if b[0] == 0xfeff: b = b[1:]    
    pwdlen = min(trunc, len(s))
    if pwdlen < 1: return None
    return bytes([b[i] or b[i+1] for i in range(0, pwdlen * 2, 2)])

def process_pwd(pwd):
    # 1. PREPARE PWD STRING (TRUNCATE, CONVERT TO BYTES)
    pw = strtobytes(pwd) if isinstance(pwd, str) else pwd[:15]
    pwdlen = len(pw)
    
    # 2. HIGH WORD CALC
    HW = InitialCodeArray[pwdlen - 1]
    for i in range(pwdlen):
        r = 15 - pwdlen + i
        for ibit in range(7):
            if (pw[i] & (0x0001 << ibit)):                
                HW ^= EncryptionMatrix[r][ibit]
    
    # 3. LO WORD CALC
    LW = 0
    for i in reversed(range(pwdlen)):
        LW = (((LW >> 14) & 0x0001) | ((LW << 1) & 0x7FFF)) ^ pw[i]
    LW = (((LW >> 14) & 0x0001) | ((LW << 1) & 0x7FFF)) ^ pwdlen ^ 0xCE4B    
    
    # 4. COMBINE AND REVERSE
    return bytes([LW & 0xff, LW >> 8, HW & 0xff, HW >> 8])

所以当我执行 process_pwd('Example') 时，我会得到 ECMA (0x7EEDCE64) 中所说的内容。散列函数也被修改了（最初的 SALT + HASH 不应该包含在主迭代循环中，正如我在论坛上发现的那样）：

def gethash(data, salt=None, alg='sha512', iters=100000, base64result=True, returnstring=True):
    
    def hashit(what, alg='sha512'):
        return getattr(hashlib, alg)(what)
    
    # encode password with legacy algorithm if a string is given
    if isinstance(data, str): 
        data = process_pwd(data)
        
    if data is None: 
        print('WRONG PASSWORD STRING!')
        return None
    
    # prepend salt if provided
    if not salt is None:
        if isinstance(salt, str): 
            salt = process_pwd(salt)
            if salt is None:
                print('WRONG SALT STRING!')
                return None
        ghash = salt + data
    else:
        ghash = data
    
    # initial hash (salted)
    ghash = hashit(ghash, alg).digest()
    
    # hash iteratively for 'iters' rounds
    for i in range(iters):
        try:
            # next hash = hash(previous data + 4-byte integer (previous round number) with LE byte ordering)
            # ECMA-376-1:2016 17.15.1.29 (p. 1020)
            ghash = hashit(ghash + struct.pack('<I', i), alg).digest()
        except Exception as err:
            print(err)
            return None

    # BASE64 encode if requested
    if base64result:
        ghash = base64.b64encode(ghash)
        
    # return as an ASCII string if requested
    if returnstring:
        ghash = ghash.decode()
        
    return ghash

然而，我已经多次重新检查此代码，但我再也看不到任何错误。但是我仍然无法在测试Word文档中重现目标哈希：

myhash = gethash('johnjohn', base64.b64decode('pH1TDVHSfGBxkd3Q88UNhQ=='))
print(myhash)
print(TARGET_HASH == myhash)

我明白了：

wut2VOpT+X8pKXky6u/+YtwRX2inDv1WVC8FtZcdxKsyX0gHNBJGYwBgV8xzq7Rke/hWMfWe9JVvqDQAZ11A5w==

错误

【问题讨论】：

标签： python hash openxml sha hashlib

【解决方案1】：

今天也不得不看这个并设法对其进行逆向工程。

简单来说，步骤是：

将密码截断为 15 个字符（不清楚这是 ASCII 编码还是 UTF8 - 一些文档引用了“Unicode 密码”，但所有示例似乎都是基于 ASCII 的）。我的实现只是采用 UTF8 转换后的截断字节（保留 ASCII 集）。
根据密码长度从魔法列表中获取高位单词。如果密码长度为 0，则只有两个零字节。
对于密码中的每个字节，根据其在加密矩阵中的位置获取位（注意最后一个字符始终对应于最后一行，如果密码短于15 个字节）。对于第 1 到第 7 位，如果已设置，则与高位字的当前值进行 XOR 运算。对每个字符重复。
获取一个低位字（2 个字节）并初始化为零。对每个字符执行操作，从密码中的最后一个字符开始并向前推进： low-order word = (((low-order word >> 14) AND 0x0001) | (low-order word << 1) & 0x7FFF)) ^ character (byte)（> 分别是位移位左移和右移运算符。|、&、^ 分别为按位或、与、异或。）
然后做low-order word = (((low-order word >> 14) & 0x0001) | (low-order word << 1) & 0x7FFF)) ^ password length ^ 0xCE4B.
通过将低位字附加到高位字来形成密钥。然后颠倒字节顺序。
出于某种原因，Microsoft Word 然后使用上述键的 Unicode 十六进制表示，然后将该表示转换为字节（参见 cmets 中的链接）。
现在通过将盐字节添加到上面的结果中来计算一次哈希。如果没有盐字节，则跳过此步骤。
如果要计算迭代，对于每次迭代，将迭代计数（0 基）转换为 32 位（4 字节）整数（小端），并且（文档对此并不清楚，它只是说要“添加”字节-但要与输出对齐，我必须附加它）将其附加到当前计算的哈希值。应用请求的哈希算法（Word 似乎默认为 SHA512，但从测试中我发现它也可以很好地处理其他选项）。
将上述内容作为 base-64 编码字符串返回。这就是 documentProtection 属性中的内容。

这是我在 C# 中的实现 (NuGet)：

/// <summary>
/// Class that generates hashes suitable for use with OpenXML Wordprocessing ML documents with the documentProtection element.
/// </summary>
public class WordprocessingMLDocumentProtectionHashGenerator
{
    private static readonly byte[][] HighOrderWords = new byte[][]
    {
        new byte[] { 0xE1, 0xF0 },
        new byte[] { 0x1D, 0x0F },
        new byte[] { 0xCC, 0x9C },
        new byte[] { 0x84, 0xC0 },
        new byte[] { 0x11, 0x0C },
        new byte[] { 0x0E, 0x10 },
        new byte[] { 0xF1, 0xCE },
        new byte[] { 0x31, 0x3E },
        new byte[] { 0x18, 0x72 },
        new byte[] { 0xE1, 0x39 },
        new byte[] { 0xD4, 0x0F },
        new byte[] { 0x84, 0xF9 },
        new byte[] { 0x28, 0x0C },
        new byte[] { 0xA9, 0x6A },
        new byte[] { 0x4E, 0xC3 }
    };

    private static readonly byte[,,] EncryptionMatrix = new byte[,,]
    {
        { { 0xAE, 0xFC }, { 0x4D, 0xD9 }, { 0x9B, 0xB2 }, { 0x27, 0x45 }, { 0x4E, 0x8A }, { 0x9D, 0x14 }, { 0x2A, 0x09 } },
        { { 0x7B, 0x61 }, { 0xF6, 0xC2 }, { 0xFD, 0xA5 }, { 0xEB, 0x6B }, { 0xC6, 0xF7 }, { 0x9D, 0xCF }, { 0x2B, 0xBF } },
        { { 0x45, 0x63 }, { 0x8A, 0xC6 }, { 0x05, 0xAD }, { 0x0B, 0x5A }, { 0x16, 0xB4 }, { 0x2D, 0x68 }, { 0x5A, 0xD0 } },
        { { 0x03, 0x75 }, { 0x06, 0xEA }, { 0x0D, 0xD4 }, { 0x1B, 0xA8 }, { 0x37, 0x50 }, { 0x6E, 0xA0 }, { 0xDD, 0x40 } },
        { { 0xD8, 0x49 }, { 0xA0, 0xB3 }, { 0x51, 0x47 }, { 0xA2, 0x8E }, { 0x55, 0x3D }, { 0xAA, 0x7A }, { 0x44, 0xD5 } },
        { { 0x6F, 0x45 }, { 0xDE, 0x8A }, { 0xAD, 0x35 }, { 0x4A, 0x4B }, { 0x94, 0x96 }, { 0x39, 0x0D }, { 0x72, 0x1A } },
        { { 0xEB, 0x23 }, { 0xC6, 0x67 }, { 0x9C, 0xEF }, { 0x29, 0xFF }, { 0x53, 0xFE }, { 0xA7, 0xFC }, { 0x5F, 0xD9 } },
        { { 0x47, 0xD3 }, { 0x8F, 0xA6 }, { 0x0F, 0x6D }, { 0x1E, 0xDA }, { 0x3D, 0xB4 }, { 0x7B, 0x68 }, { 0xF6, 0xD0 } },
        { { 0xB8, 0x61 }, { 0x60, 0xE3 }, { 0xC1, 0xC6 }, { 0x93, 0xAD }, { 0x37, 0x7B }, { 0x6E, 0xF6 }, { 0xDD, 0xEC } },
        { { 0x45, 0xA0 }, { 0x8B, 0x40 }, { 0x06, 0xA1 }, { 0x0D, 0x42 }, { 0x1A, 0x84 }, { 0x35, 0x08 }, { 0x6A, 0x10 } },
        { { 0xAA, 0x51 }, { 0x44, 0x83 }, { 0x89, 0x06 }, { 0x02, 0x2D }, { 0x04, 0x5A }, { 0x08, 0xB4 }, { 0x11, 0x68 } },
        { { 0x76, 0xB4 }, { 0xED, 0x68 }, { 0xCA, 0xF1 }, { 0x85, 0xC3 }, { 0x1B, 0xA7 }, { 0x37, 0x4E }, { 0x6E, 0x9C } },
        { { 0x37, 0x30 }, { 0x6E, 0x60 }, { 0xDC, 0xC0 }, { 0xA9, 0xA1 }, { 0x43, 0x63 }, { 0x86, 0xC6 }, { 0x1D, 0xAD } },
        { { 0x33, 0x31 }, { 0x66, 0x62 }, { 0xCC, 0xC4 }, { 0x89, 0xA9 }, { 0x03, 0x73 }, { 0x06, 0xE6 }, { 0x0D, 0xCC } },
        { { 0x10, 0x21 }, { 0x20, 0x42 }, { 0x40, 0x84 }, { 0x81, 0x08 }, { 0x12, 0x31 }, { 0x24, 0x62 }, { 0x48, 0xC4 } }
    };

    /// <summary>
    /// Generates a base-64 string according to the Wordprocessing ML Document DocumentProtection security algorithm.
    /// </summary>
    /// <param name="password"></param>
    /// <param name="salt"></param>
    /// <param name="iterations"></param>
    /// <param name="hashAlgorithmName"></param>
    /// <returns></returns>
    public string GenerateHash(string password, byte[] salt, int iterations, HashAlgorithmName hashAlgorithmName)
    {
        if (password == null)
        {
            throw new ArgumentNullException(nameof(password));
        }

        // Algorithm given in ECMA-374, 1st Edition, December 2006
        // https://www.ecma-international.org/wp-content/uploads/ecma-376_first_edition_december_2006.zip
        // Alternatively: https://c-rex.net/projects/samples/ooxml/e1/Part4/OOXML_P4_DOCX_documentProtection_topic_ID0EJVTX.html
        byte[] passwordBytes = Encoding.UTF8.GetBytes(password);
        passwordBytes = passwordBytes.Take(15).ToArray();
        int passwordLength = passwordBytes.Length;

        // If the password length is 0, the key is 0.
        byte[] highOrderWord = new byte[] { 0x00, 0x00 };
        if (passwordLength > 0)
        {
            highOrderWord = HighOrderWords[passwordLength - 1].ToArray();
        }
        for (int i = 0; i < passwordLength; i++)
        {
            byte passwordByte = passwordBytes[i];
            int encryptionMatrixIndex = i + (EncryptionMatrix.GetLength(0) - passwordLength);

            BitArray bitArray = passwordByte.ToBitArray();

            for (int j = 0; j < EncryptionMatrix.GetLength(1); j++)
            {
                bool isSet = bitArray[j];

                if (isSet)
                {
                    for (int k = 0; k < EncryptionMatrix.GetLength(2); k++)
                    {
                        highOrderWord[k] = (byte)(highOrderWord[k] ^ EncryptionMatrix[encryptionMatrixIndex, j, k]);
                    }
                }
            }
        }

        byte[] lowOrderWord = new byte[] { 0x00, 0x00 };
        BitSequence lowOrderBitSequence = lowOrderWord.ToBitSequence();
        BitSequence bitSequence1 = new byte[] { 0x00, 0x01 }.ToBitSequence();
        BitSequence bitSequence7FFF = new byte[] { 0x7F, 0xFF }.ToBitSequence();

        for (int i = passwordLength - 1; i >= 0; i--)
        {
            byte passwordByte = passwordBytes[i];
            lowOrderBitSequence = (((lowOrderBitSequence >> 14) & bitSequence1) | ((lowOrderBitSequence << 1) & bitSequence7FFF)) ^ new byte[] { 0x00, passwordByte }.ToBitSequence();
        }

        lowOrderBitSequence = (((lowOrderBitSequence >> 14) & bitSequence1) | ((lowOrderBitSequence << 1) & bitSequence7FFF)) ^ new byte[] { 0x00, (byte)passwordLength }.ToBitSequence() ^ new byte[] { 0xCE, 0x4B }.ToBitSequence();
        lowOrderWord = lowOrderBitSequence.ToByteArray();

        byte[] key = highOrderWord.Concat(lowOrderWord).ToArray();
        key = key.Reverse().ToArray();

        // https://docs.microsoft.com/en-us/openspecs/office_standards/ms-oe376/fb220a2f-88d4-488c-a9b7-e094756b6699
        // In Word, an additional third stage is added to the process of hashing and storing a user supplied password.  In this third stage, the reversed byte order legacy hash from the second stage shall be converted to Unicode hex string representation [Example: If the single byte string 7EEDCE64 is converted to Unicode hex string it will be represented in memory as the following byte stream: 37 00 45 00 45 00 44 00 43 00 45 00 36 00 34 00. end example], and that value shall be hashed as defined by the attribute values.
        key = Encoding.Unicode.GetBytes(BitConverter.ToString(key).Replace("-", string.Empty));

        HashAlgorithm hashAlgorithm = hashAlgorithmName.Create();

        byte[] computedHash = key;

        if (salt != null)
        {
            computedHash = salt.Concat(key).ToArray();
        }

        // Word requires that the initial hash of the password with the salt not be considered in the count.
        computedHash = hashAlgorithm.ComputeHash(computedHash);

        for (int i = 0; i < iterations; i++)
        {
            // ISO/IEC 29500-1 Fourth Edition, 2016-11-01
            // 17.15.1.29 - spinCount
            // Specifies the number of times the hashing function shall be iteratively run (runs using each iteration''s result plus a 4 byte value (0-based, little endian) containing the number of the iteration as the input for the next iteration) when attempting to compare a user-supplied password with the value stored in the hashValue attribute.
            byte[] iterationBytes = BitConverter.GetBytes(i);
            computedHash = computedHash.Concat(iterationBytes).ToArray();
            computedHash = hashAlgorithm.ComputeHash(computedHash);
        }

        return Convert.ToBase64String(computedHash);
    }
}

我用您的示例哈希对其进行了测试，并检查它是否通过了：

    [TestClass]
[TestCategory("WordprocessingMLDocumentProtectionHashGenerator")]
public class WordprocessingMLDocumentProtectionHashGeneratorTests
{
    [TestMethod]
    public void GeneratesKnownHashes()
    {
        WordprocessingMLDocumentProtectionHashGenerator wordprocessingMLDocumentProtectionHashGenerator = new WordprocessingMLDocumentProtectionHashGenerator();

        Assert.AreEqual("sstT7oPzpUQTchSUE6WbidCrZv1c8k+/5D1Pm+weZt7QoaeSnBEg/cZFg2W+1eohg1mgXGXLci1CWbnbHDYsXQ==", wordprocessingMLDocumentProtectionHashGenerator.GenerateHash("Example", Convert.FromBase64String("KPr2WqWFihenPDtAmpqUtw=="), 100000, HashAlgorithmName.SHA512));
        Assert.AreEqual("uBuZhlyVTOQtRwQuOGjY7GU3FnJbe1VFKvN+j9u27HSbthOY+n1/daU/WCkqV40fG6HxX+pxgR+Ow4ZvAE7aZg==", wordprocessingMLDocumentProtectionHashGenerator.GenerateHash("Password", Convert.FromBase64String("On9D022mrdqvHTb6eEkFGA=="), 100000, HashAlgorithmName.SHA512));
        Assert.AreEqual("mkGbBri0a1icL1nJKTQL7PyLUY2Uei2wyMHC0Y6s1+DOMYvPWdB6cy0Npao15O0+yqtyZW4hAP0+dcdyrEk7qg==", wordprocessingMLDocumentProtectionHashGenerator.GenerateHash("Password", Convert.FromBase64String("On9D022mrdqvHTb6eEkFGA=="), 0, HashAlgorithmName.SHA512));
        Assert.AreEqual("qdPI8cSBM/21Mr29mfFrR6l7hIn8oLKKT1nTDXHsAQA=", wordprocessingMLDocumentProtectionHashGenerator.GenerateHash("Testerman", Convert.FromBase64String("On9D022mrdqvHTb6eEkFGA=="), 100000, HashAlgorithmName.SHA256));
        Assert.AreEqual("d5FZvHnQhm6Mzqy6cYE7ZbniYXA/8qJxkAze0sFcNirWYhaLpScmSsfBHptuEmuBreLuNjyV5IjdUoOFWM9mbQ==", wordprocessingMLDocumentProtectionHashGenerator.GenerateHash("Password", null, 100000, HashAlgorithmName.SHA512));
        Assert.AreEqual("pVjR9ktO9vlxijXcMPlH+4PLwD4Xwy1aqbNQOFmWaSpvBjipNh//T8S3nBhq6HRoRVfWL6s/+NdUCPTxUr0vZw==", wordprocessingMLDocumentProtectionHashGenerator.GenerateHash("johnjohn", Convert.FromBase64String("pH1TDVHSfGBxkd3Q88UNhQ=="), 100000, HashAlgorithmName.SHA512));
    }
}

【讨论】：

非常感谢！这看起来像生意。我现在需要在 Python 中对此进行测试 :)