给定原始字符串和编码字符串，如何诱导编码？答案

【问题标题】：Given original string and encoded string, how to induce encoding?给定原始字符串和编码字符串，如何诱导编码？
【发布时间】：2016-02-17 16:59:36
【问题描述】：

假设我有一个原始字符串和一个编码字符串，如下所示：

“abcd” -> “0010111111001010”，那么一种可能的解决方案是“a”匹配“0010”，“b”匹配“1111”，“c”匹配“1100”，“d”匹配与“1010”。

如何编写一个程序，给定这两个字符串，并找出可能的编码规则？

我的第一个划痕是这样的：

fun partition(orgl, encode) =
let
    val part = size(orgl)
    fun porpt(str, i, len) =
        if i = len - 1 then
            [substring(str, len * (len - 1), size(str) - (len - 1) * len)]
        else
            substring(str, len * i, len)::porpt(str, i + 1, len)
in
    porpt(encode, 0, part)
end;

但显然它不能检查两个子字符串是否匹配相同的字符，除了按比例划分字符串之外还有很多其他的可能性。

对于这个问题应该有什么合适的算法？

附：只允许使用前缀代码。

我所学到的还没有真正进入严肃的算法，但我做了一些关于回溯的搜索并编写了我的第二版代码：

fun partition(orgl, encode) =
let
    val part = size(orgl)
    fun backtrack(str, s, len, count, code) =
        let
           val current =
               if count = 1 then
                  code@[substring(str, s, size(str) - s)]
               else
                  code@[substring(str, s, len)]
        in
           if len > size(str) - s then []
           else
              if proper_prefix(0, orgl, code) then
                  if count = 1 then current
                  else
                     backtrack(str, s + len, len, count - 1, current)
              else
                 backtrack(str, s, len + 1, count, code)
        end
 in
    backtrack(encode, 0, 1, part, [])
 end;

proper_prefix 函数将检查前缀代码和唯一映射。但是，此功能无法正常运行。

例如，当我输入：

partition("abcd", "001111110101101");

返回结果是：

uncaught exception Subscript

仅供参考，proper_prefix 的主体如下所示：

fun proper_prefix(i, orgl, nil) = true
  | proper_prefix(i, orgl, x::xs) =
    let
      fun check(j, str, nil) = true
        | check(j, str, x::xs) =
          if String.isPrefix str x then
             if str = x andalso substring(orgl, i, 1) = substring(orgl, i + j + 1, 1) then
                check(j + 1, str, xs)
             else
                false
          else
             check(j + 1, str, xs)
    in
      if check(0, x, xs) then proper_prefix(i + 1, orgl, xs)
      else false
    end;

【问题讨论】：

你认为这是substitution cipher吗？您是否假设所有位串都具有相同的长度？您是否假设所有字符都映射到唯一的位串？这些假设极大地影响了问题，从而影响了任何确定翻译的算法。
是的。这是一个替换密码，它的长度可能不相等——这就是为什么我说我的第一次划痕很糟糕，因为它假设字符的长度相等。是的，映射应该是单射的。

标签： algorithm character-encoding sml

【解决方案1】：

我会尝试回溯方法：

从一个空假设开始（即将所有编码设置为未知）。然后逐个字符处理编码后的字符串。

对于每个新的代码字符，您有两个选择：将代码字符附加到当前源字符的编码或转到下一个源字符。如果您遇到已经有编码的源字符，请检查它是否匹配并继续。或者，如果不匹配，请返回并尝试其他选项。您还可以在此遍历期间检查 prefix-property。

您的示例输入可以按如下方式处理：

Assume 'a' == '0'
Go to next source character
Assume 'b' == '0'
Violation of prefix property, go back
Assume 'a' == '00'
Go to next source character
Assume 'b' == '1'
...

这将探索所有可能的编码范围。您可以返回找到的第一个编码或所有可能的编码。

【讨论】：

【解决方案2】：

如果要天真地迭代 abcd → 0010111111001010 的所有可能翻译，这可能会导致爆炸。简单的迭代似乎也导致了许多必须跳过的无效翻译：

(a, b, c, d) → (0, 0, 1, 0111111001010) is invalid because a = b
(a, b, c, d) → (0, 0, 10, 111111001010) is invalid because a = b
(a, b, c, d) → (0, 01, 0, 111111001010) is invalid because a = c
(a, b, c, d) → (00, 1, 0, 111111001010) is one possibility
(a, b, c, d) → (0, 0, 101, 11111001010) is invalid because a = b
(a, b, c, d) → (0, 010, 1, 11111001010) is another possibility
(a, b, c, d) → (001, 0, 1, 11111001010) is another possibility
(a, b, c, d) → (0, 01, 01, 11111001010) is invalid because b = c
(a, b, c, d) → (00, 1, 01, 11111001010) is another possibility
(a, b, c, d) → (00, 10, 1, 11111001010) is another possibility
...

如果所有字符串中的每个字符都只包含一次，那么这个结果爆炸就是答案。如果同一个字符不止一次出现，这将进一步限制解决方案。例如。匹配 abca → 111011 可以生成

(a, b, c, a) → (1, 1, 1, 011) is invalid because a = b = c, a ≠ a
(a, b, c, a) → (1, 1, 10, 11) is invalid because a = b, a ≠ a
(a, b, c, a) → (1, 11, 0, 11) is invalid because a = b, a ≠ a
(a, b, c, a) → (11, 1, 0, 11) is one possibility
... (all remaining combinations would eventually prove invalid)

对于给定的假设，您可以选择验证约束的顺序。要么

查看是否有任何映射重叠。（我认为这就是 Nico 所说的前缀属性。）
查看是否有任何多次出现的字符实际上出现在位串的两个位置。

使用这种搜索策略的算法必须找到检查约束的顺序，以便尽快尝试假设。我的直觉告诉我，如果位串 β 很长并且出现多次，那么约束 a → β 值得早点研究。

另一种策略是排除特定字符可以映射到/高于/低于特定长度的任何位串。例如，aaab → 1111110 排除 a 映射到任何长度大于 2 的位串，abcab → 1011101 排除 a 映射到任何长度不为 2 的位串。

对于编程部分，试着想办法表达假设。例如

(* For the hypothesis (a, b, c, a) → (11, 1, 0, 11) *)

(* Order signifies first occurrence *)
val someHyp1 = ([(#"a", 2), (#"b", 1), (#"c", 1)], "abca", "111011")

(* Somehow recurse over hypothesis and accumulate offsets for each character, e.g. *)
val someHyp2 = ([(#"a", 2), (#"b", 1), (#"c", 1)],
                [(#"a", 0), (#"b", 2), (#"c", 3), (#"a", 4)])

并创建一个以某种顺序生成新假设的函数，以及一个判断假设是否有效的函数。

fun nextHypothesis (hyp, origStr, encStr) = ... (* should probably return SOME/NONE *)
fun validHypothesis (hyp, origStr, encStr) =
    allStr (fn (i, c) => (* is bit string for c at its
                            accumulated offset in encStr? *)) origStr

(* Helper function that checks whether a predicate is true for each
   character in a string. The predicate function takes both the index
   and the character as argument. *)
and allStr p s =
    let val len = size s
        fun loop i = i >= len orelse p (i, String.sub (s, i)) andalso loop (i+1)
    in loop 0 end

对这个框架的改进是改变探索假设的顺序，因为一些搜索路径可以排除比其他路径更多的无效映射。

【讨论】：