使用堆栈的 Bencode 解析器答案

【问题标题】：Bencode parser using stack使用堆栈的 Bencode 解析器
【发布时间】：2020-08-20 13:48:34
【问题描述】：

我正在尝试使用基于堆栈的方法来解析编码字符串。此链接描述了编码：https://www.bittorrent.org/beps/bep_0003.html

我的伪代码无法处理存在嵌套列表的情况，例如 [1, [2]] 和 [[1, 2]] 都将返回 [[1 ,2]]，即使显然编码不同，"li1eli2eee" 与 "lli1ei2eee"。

到目前为止，这是我的伪代码

input: string
output: map/list/integer/string in a bencoded data structure
first, tokenize the string into valid tokens
Valid tokens "d, l, [words], [numbers], e, s (virtual token)"
Strings are tokenized as 4:spam becomes "s spam e" with s being a virtual token
Eg. li1el4:spamee becomes [l i 1 e i 22 e l s spam e i 2 e e e]
Parsing:
make two stacks:
stack1
stack2
for token in tokens:

    if stack is empty
        return error

    if the token isn’t an “e”
        push token onto stack1

    while the stack isn’t empty:
        elem = pop off the stack
        if elem is “i”
            elem2 = pop elem off stack2 and check if it can be converted to an int
            if not
                return error
            push elem2 onto stack2 again
        elif elem is “d”
            make a new dict
            while stack2 isn’t empty:
                key = pop off stack2
                if stack2 is empty:
                    return error (because then we have an odd key value encoding)
                value = pop off stack2
                dict[key] = value
            push dict onto stack2
        elif elem is “l”
            make a new list
            while stack2 isn’t empty:
                append pop off stack2 to l
            push l onto stack2
        elif elem is “s”
            dont need to do anything :P
        else
            push elem onto stack2

if stack2 isn’t empty:
    ret = pop the lone element off stack2
if stack2 isn’t empty:
    return error

return ret

【问题讨论】：

“Bencoding”的规范对我来说不是很清楚，即使查看了链接和其他几个地方，如npm package。此代码似乎没有考虑字典可能嵌套的可能性。那不可能吗？至于列表和字典，while stack2 isn't empty 循环似乎总是错误的。如果您有一个实现，那么共享比伪代码更有意义。感谢您的澄清。
字典可以嵌套，列表也可以嵌套。 stack2 不为空的想法是将所有内容包含在列表或字典中。本质上，stack2 被用作列表或字典数据结构的临时存储。
好的，谢谢。我的回答符合你的要求吗？如果没有，请随时告诉我我错过了什么，我会更新。

标签： algorithm parsing recursion stack tokenize

【解决方案1】：

我不太遵循规范或伪代码，但实现“Bencoding”的一个子集来处理您显示的两个字符串（列表和整数）似乎很简单。据我所知，其他一切都相对微不足道（字典或多或少与列表相同，字符串和其他非递归定义的数据类型与整数基本相同）。

我的算法如下：

做一个栈，把一个空数组放进去。
对于编码字符串中的每个索引：
- 如果当前字符是i，则解析整数并将索引快进到关闭整数的e。将整数追加到堆栈顶部的数组中。
- 如果当前字符是l，则将新的 arr 压入堆栈。
- 如果当前字符是e，则弹出堆栈并将弹出的数组压入它下面的数组（即新的顶部）。
返回堆栈中唯一的元素。

这是在 JS 中：

const tinyBencodeDecode = s => {
  const stack = [[]];

  for (let i = 0; i < s.length; i++) {
    if (s[i] === "i") {
      for (var j = ++i; s[j] !== "e"; j++);

      stack[stack.length-1].push(+s.slice(i, j));
      i = j;
    }
    else if (s[i] === "l") {
      stack.push([]);
    }
    else if (s[i] === "e") {
      stack[stack.length-2].push(stack.pop());
    }
  }

  return stack[0];
};

[
  "i1ei2e",     // => [1, 2]
  "lli1ei2eee", // => [[1, 2]]
  "li1eli2eee", // => [[1, [2]]]

  // [44, [1, [23, 561, [], 1, [78]]], 4]
  "i44eli1eli23ei561elei1eli78eeeei4e",
].forEach(e => console.log(JSON.stringify(tinyBencodeDecode(e))));

不执行错误处理，并且假定所有内容都是格式正确的，但错误处理不会影响基本算法；只需在工作时添加一堆条件来检查索引、堆栈和字符串。

这是一个（诚然懒惰的）示例，说明如何支持 4 种数据类型。同样，省略了错误处理。这个想法与上面的基本相同，只是需要更多的大惊小怪来确定我们是在构建字典还是列表。由于null 似乎不是规范中的有效键，因此我使用它作为占位符来将值标记与其对应的键配对。

在这两种情况下，如果 Bencoding 只有一个根元素（列表或字典），则需要进行细微的调整。在这种情况下，s = "i42ei43e" 在顶层无效，我们将从一个空堆栈开始。

const back = (a, n=1) => a[a.length-n];

const append = (stack, data) => {
  if (Array.isArray(back(stack))) {
    back(stack).push(data);
  }
  else {
    const emptyKey = Object.entries(back(stack))
                           .find(([k, v]) => v === null);

    if (emptyKey) {
      back(stack)[emptyKey[0]] = data;
    }
    else {
      back(stack)[data] = null;
    }
  }
};

const bencodeDecode = s => {
  const stack = [[]];

  for (let i = 0; i < s.length; i++) {
    if (s[i] === "i") {
      for (var j = ++i; s[j] !== "e"; j++);

      append(stack, +s.slice(i, j));
      i = j;
    }
    else if (/\d/.test(s[i])) {
      for (var j = i; s[j] !== ":"; j++);
      
      const num = +s.slice(i, j++);
      append(stack, s.slice(j, j + num));
      i = j + num - 1;
    }
    else if (s[i] === "l") {
      stack.push([]);
    }
    else if (s[i] === "d") {
      stack.push({});
    }
    else if (s[i] === "e") {
      append(stack, stack.pop());
    }
  }

  return stack[0];
};

[
  "i1ei2e",     // => [1, 2]
  "lli1ei2eee", // => [[1, 2]]
  "li1eli2eee", // => [[1, [2]]]
  "li1e4:spamli2eee", // => [[1, "spam", [2]]]

  // [[1, "spam", {"cow": "moo", "spam": {"eggs": [6, "rice"]}}, [2]]]
  "li1e4:spamd3:cow3:moo4:spamd4:eggsli6e4:riceeeeli2eee",

  // [44, [1, [23, 561, [], 1, [78]]], 4]
  "i44eli1eli23ei561elei1eli78eeeei4e",
].forEach(e => console.log(JSON.stringify(bencodeDecode(e))));

【讨论】：