在 Swift 中解码字符串，包括像 '\xc3\xa6' 这样的 utf8 文字？答案

【问题标题】：Decoding strings including utf8-literals like '\xc3\xa6' in Swift?在 Swift 中解码字符串，包括像 '\xc3\xa6' 这样的 utf8 文字？
【发布时间】：2022-01-01 04:59:21
【问题描述】：

向我的前任thread 提出关于 UTF-8 文字的问题：

已确定您可以从专门包含 UTF-8 文字的字符串中解码 UTF-8 文字：

let s = "\\xc3\\xa6"
let bytes = s
    .components(separatedBy: "\\x")
    // components(separatedBy:) would produce an empty string as the first element
    // because the string starts with "\x". We drop this
    .dropFirst() 
    .compactMap { UInt8($0, radix: 16) }
if let decoded = String(bytes: bytes, encoding: .utf8) {
    print(decoded)
} else {
    print("The UTF8 sequence was invalid!")
}

但是，这仅在字符串仅包含 UTF-8 文字时才有效。当我正在获取包含这些 UTF-8 文字的名称的 Wi-Fi 列表时，我该如何解码整个字符串？

例子：

let s = "This is a WiFi Name \\xc3\\xa6 including UTF-8 literals \\xc3\\xb8"

预期结果：

print(s)
> This is a WiFi Name æ including UTF-8 literals ø

在 Python 中有一个简单的解决方案：

contents = source_file.read()
uni = contents.decode('unicode-escape')
enc = uni.encode('latin1')
dec = enc.decode('utf-8')

在 Swift 5 中是否有类似的方法来解码这些字符串？

【问题讨论】：

你如何从文本中分隔十六进制文字？他们后面会一直有空格吗？
@flanker 不幸的是，它们就在文本中。中间或之后没有空间或任何东西。因此，“Netværk 5GHz”的常见字符串可能是“Netv\\xc3\\xa6rk 5GHz”
循环，使用正则表达式查找下一个文字的范围，提取范围并对其进行解码，然后用解码后的版本替换该范围？如果没有人提出可行的解决方案，我稍后会破解。
@flanker，谢谢，我希望有一个更简单的 Swift 选项，如下面的 Python 示例所示。因为我不是很擅长正则表达式。但是，如果没有人有答案，我绝对会感谢您的努力！

标签： swift string utf-8 decode swift5

【解决方案1】：

据我所知，没有原生的 Swift 解决方案。为了使它看起来像调用站点上的 Python 版本一样紧凑，您可以在 String 上构建一个扩展来隐藏复杂性

extension String {
   func replacingUtf8Literals() -> Self {

      let regex = #"(\\x[a-zAZ0-9]{2})+"#
      
      var str = self
      
      while let range = str.range(of: regex, options: .regularExpression) {
         let literalbytes = str[range]
            .components(separatedBy: "\\x")
            .dropFirst()
            .compactMap{UInt8($0, radix: 16)}
         guard let actuals = String(bytes: literalbytes, encoding: .utf8) else {
            fatalError("Regex error")
         }
         str.replaceSubrange(range, with: actuals)
      }
      return str
   }
}

这让你可以打电话

print(s.replacingUtf8Literals()). 

//prints: This is a WiFi Name æ including UTF-8 literals ø

为方便起见，我使用fatalError 捕获失败的转换。您可能希望在生产代码中以更好的方式处理此问题（尽管，除非正则表达式错误，否则它永远不会发生！）。这里需要有某种形式的中断或错误抛出，否则你有一个无限循环。

【讨论】：

如果你从你的正则表达式中删除 2 \\ 和 "separatedBy" 以便它的 let regex = #"(?:\\x[a-zAZ0-9]{2})+"# 和separatorBy: "\\x" 这很好用，谢谢！
是的，那是我的粗心。在我玩正则表达式的地方，我逃脱了已经逃脱的 sring，并以双重转义结束......然后将其复制过来！我已经编辑了答案。

【解决方案2】：

首先将解码代码作为计算属性添加到字符串扩展中（或创建函数）

extension String {
    var decodeUTF8: String {
        let bytes = self.components(separatedBy: "\\x")
            .dropFirst()
            .compactMap { UInt8($0, radix: 16) }
        return String(bytes: bytes, encoding: .utf8) ?? self
    }
}

然后使用正则表达式并使用while循环匹配替换所有匹配值

while let range = string.range(of: #"(\\x[a-f0-9]{2}){2}"#, options: [.regularExpression, .caseInsensitive]) {
    string.replaceSubrange(range, with: String(string[range]).decodeUTF8)
}

【讨论】：