【问题标题】:Using Swift, how do you re-encode then decode a String like this short script in Python?使用 Swift,你如何在 Python 中像这个短脚本一样重新编码然后解码一个字符串?
【发布时间】:2018-09-18 22:55:35
【问题描述】:

XKCD 的 API 存在一些问题和奇怪的编码问题。

Minor encoding issue with xkcd alt texts in chat

解决方案(在 Python 中)是将其编码为 latin1,然后解码为 utf8,但如何在 Swift 中做到这一点?

测试字符串:

"Be careful\u00e2\u0080\u0094it's breeding season"

预期输出:

Be careful—it's breeding season

Python(来自上面的链接):

import json
a = '''"Be careful\u00e2\u0080\u0094it's breeding season"'''
print(json.loads(a).encode('latin1').decode('utf8'))

这是如何在 Swift 中完成的?

let strdata = "Be careful\\u00e2\\u0080\\u0094it's breeding season".data(using: .isoLatin1)!
let str = String(data: strdata, encoding: .utf8)

这行不通!

【问题讨论】:

  • 对不起,我不知道 Swift,所以我不知道该建议什么。 Latin1 “技巧”之所以有效,是因为对于 0 n n 编码为 Latin1 中值为 n 的字节。也就是说,b''.join([chr(i).encode('latin1') for i in range(256)]) == bytes(range(256)) 是 True。
  • 您对 Swift 示例的预期结果是什么?
  • @PM2Ring 那么这对这部漫画有用吗? xkcd.com/1814/info.0.json
  • @MartinR 更新以使其在 Swift 中的预期输出和正确的字符串更加清晰
  • 当然。我得到♫ When the spacing is tight / And the difference is slight / That's a moiré ♫

标签: python swift encoding utf-8 character-encoding


【解决方案1】:

您必须先解码 JSON 数据,然后提取字符串,最后“修复”字符串。这是一个自包含的示例,其中包含来自https://xkcd.com/1814/info.0.json 的 JSON:

let data = """
    {"month": "3", "num": 1814, "link": "", "year": "2017", "news": "",
    "safe_title": "Color Pattern", "transcript": "",
    "alt": "\\u00e2\\u0099\\u00ab When the spacing is tight / And the difference is slight / That's a moir\\u00c3\\u00a9 \\u00e2\\u0099\\u00ab",
    "img": "https://imgs.xkcd.com/comics/color_pattern.png",
    "title": "Color Pattern", "day": "22"}
""".data(using: .utf8)!

// Alternatively:
// let url = URL(string: "https://xkcd.com/1814/info.0.json")!
// let data = try! Data(contentsOf: url)

do {
    if let dict = (try JSONSerialization.jsonObject(with: data, options: [])) as? [String: Any],
        var alt = dict["alt"] as? String {

        // Now try fix the "alt" string
        if let isoData = alt.data(using: .isoLatin1),
            let altFixed = String(data: isoData, encoding: .utf8) {
            alt = altFixed
        }

        print(alt)
        // ♫ When the spacing is tight / And the difference is slight / That's a moiré ♫
    }
} catch {
    print(error)
}

如果你只有一个表单字符串

小心\u00e2\u0080\u0094这是繁殖季节

那么你仍然可以使用JSONSerialization 来解码\uNNNN 转义序列,然后像上面那样继续。

一个简单的例子(为了简洁省略了错误检查):

let strbad = "Be careful\\u00e2\\u0080\\u0094it's breeding season"
let decoded = try! JSONSerialization.jsonObject(with: Data("\"\(strbad)\"".utf8), options: .allowFragments) as! String
let strgood = String(data: decoded.data(using: .isoLatin1)!, encoding: .utf8)!
print(strgood)
// Be careful—it's breeding season

【讨论】:

    【解决方案2】:

    我找不到任何内置的东西,但我确实设法为你写了这个。

    extension String {
        func range(nsRange: NSRange) -> Range<Index> {
            return Range(nsRange, in: self)!
        }
    
        func nsRange(range: Range<Index>) -> NSRange {
            return NSRange(range, in: self)
        }
    
        var fullRange: Range<Index> {
            return startIndex..<endIndex
        }
    
        var fullNSRange: NSRange {
            return nsRange(range: fullRange)
        }
    
        subscript(nsRange: NSRange) -> Substring {
            return self[range(nsRange: nsRange)]
        }
    
        func convertingUnicodeCharacters() -> String {
            var string = self
            // Characters need to be replaced in groups in case of clusters
            let groupedRegex = try! NSRegularExpression(pattern: "(\\\\u[0-9a-fA-F]{1,8})+")
            for match in groupedRegex.matches(in: string, range: string.fullNSRange).reversed() {
                let groupedHexValues = String(string[match.range])
                var characters = [Character]()
                let regex = try! NSRegularExpression(pattern: "\\\\u([0-9a-fA-F]{1,8})")
                for hexMatch in regex.matches(in: groupedHexValues, range: groupedHexValues.fullNSRange) {
                    let hexString = groupedHexValues[Range(hexMatch.range(at: 1), in: string)!]
                    if let hexValue = UInt32(hexString, radix: 16),
                        let scalar = UnicodeScalar(hexValue) {
                        characters.append(Character(scalar))
                    }
                }
                string.replaceSubrange(Range(match.range, in: string)!, with: characters)
            }
            return string
        }
    }
    

    它基本上查找任何\u&lt;1-8 digit hex&gt; 值并将它们转换为标量。应该相当简单......

    我的游乐场测试代码很简单:

    let string = "Be careful\\u00e2\\u0080\\u0094-\\u1F496\\u65\\u301it's breeding season"
    let expected = "Be careful\u{00e2}\u{0080}\u{0094}-\u{1f496}\u{65}\u{301}it's breeding season"
    string.convertingUnicodeCharacters() == expected // true ?
    

    【讨论】:

    • Martin R 的解决方案看起来比我的好。简单得多,tbh。但这是一个有趣的挑战:)
    • 实际上他的代码使用我上面的测试字符串崩溃了。
    猜你喜欢
    • 1970-01-01
    • 2019-11-16
    • 2011-04-09
    • 1970-01-01
    • 2021-11-19
    • 2023-03-13
    • 2019-05-05
    • 2017-04-16
    • 2022-11-15
    相关资源
    最近更新 更多