Facebook JSON 编码错误答案

【问题标题】：Facebook JSON badly encodedFacebook JSON 编码错误
【发布时间】：2018-10-05 02:31:57
【问题描述】：

我下载了我的 Facebook Messenger 数据（在您的 Facebook 帐户中，转到设置，然后转到您的 Facebook 信息，然后下载您的信息 ，然后创建一个至少选中 Messages 框的文件）来做一些很酷的统计数据

但是编码有一个小问题。我不确定，但看起来 Facebook 对这些数据使用了错误的编码。当我用文本编辑器打开它时，我看到如下内容：Rados\u00c5\u0082aw。当我尝试使用 python (UTF-8) 打开它时，我得到RadosÅ\x82aw。但是我应该得到：Radosław。

我的python脚本：

text = open(os.path.join(subdir, file), encoding='utf-8')
conversations.append(json.load(text))

我尝试了一些最常见的编码。示例数据为：

{
  "sender_name": "Rados\u00c5\u0082aw",
  "timestamp": 1524558089,
  "content": "No to trzeba ostatnie treningi zrobi\u00c4\u0087 xD",
  "type": "Generic"
}

【问题讨论】：

你为什么假设数据是 UTF-8 ？如果您不知道它的编码，您是否尝试过其他合理的可能性，例如Windows 1250 还是 ISO 8859-2？
我试了几个。没有工作。我遇到过之前问过的这个问题：stackoverflow.com/questions/19161501/… 但是我不知道如何让它对我有用
不知道是否有帮助，但表情符号编码在 facebooks api 中似乎很时髦：stackoverflow.com/questions/20045268/…
@JakubJendryka：对，我不熟悉那个系统，也许里面确实有 mojibake； UTF-8 数据被解码为 Latin-1，然后编码为 JSON。
@Patrick：到目前为止，这已经是相当古老的历史了。我们不再使用该编码（并且仅适用于表情符号）。

标签： python python-3.x unicode mojibake

【解决方案1】：

这是我对 Node 17.0.1 的方法，基于 @hotigeftas 递归代码，使用 iconv-lite 包。

import iconv from 'iconv-lite';

function parseObject(object) {
  if (typeof object == 'string') {
    return iconv.decode(iconv.encode(object, 'latin1'), 'utf8');;
  }

  if (typeof object == 'object') {
    for (let key in object) {
      object[key] = parseObject(object[key]);
    }
    return object;
  }

  return object;
}

//usage
let file = JSON.parse(fs.readFileSync(fileName));
file = parseObject(file);

【讨论】：

您的答案可以通过添加有关代码的作用以及它如何帮助 OP 的更多信息来改进。

【解决方案2】：

扩展 Martijn 解决方案 #1，我认为它可以导致递归对象处理（它最初肯定会引导我）：

如果不ensure_asciiensure_ascii，您可以将其应用于整个json对象字符串

json.dumps(obj, ensure_ascii=False, indent=2).encode('latin-1').decode('utf-8')

然后将其写入文件或其他东西。

PS：这应该是@Martijn 的评论答案：https://stackoverflow.com/a/50011987/1309932（但我不能添加 cmets）

【讨论】：

【解决方案3】：

这是@Geekmoss 的答案，但适用于 Python 3：

def parse_facebook_json(json_file_path):
    def parse_obj(obj):
        for key in obj:
            if isinstance(obj[key], str):
                obj[key] = obj[key].encode('latin_1').decode('utf-8')
            elif isinstance(obj[key], list):
                obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
            pass
        return obj
    with json_file_path.open('rb') as json_file:
        return json.load(json_file, object_hook=parse_obj)

# Usage
parse_facebook_json(Path("/.../message_1.json"))

【讨论】：

【解决方案4】：

Facebook 程序员似乎混淆了 Unicode encoding 和 转义序列 的概念，可能是在实现他们自己的 ad-hoc 序列化程序时。更多详情见Invalid Unicode encodings in Facebook data exports。

试试这个：

import json
import io

class FacebookIO(io.FileIO):
    def read(self, size: int = -1) -> bytes:
        data: bytes = super(FacebookIO, self).readall()
        new_data: bytes = b''
        i: int = 0
        while i < len(data):
            # \u00c4\u0085
            # 0123456789ab
            if data[i:].startswith(b'\\u00'):
                u: int = 0
                new_char: bytes = b''
                while data[i+u:].startswith(b'\\u00'):
                    hex = int(bytes([data[i+u+4], data[i+u+5]]), 16)
                    new_char = b''.join([new_char, bytes([hex])])
                    u += 6

                char : str = new_char.decode('utf-8')
                new_chars: bytes = bytes(json.dumps(char).strip('"'), 'ascii')
                new_data += new_chars
                i += u
            else:
                new_data = b''.join([new_data, bytes([data[i]])])
                i += 1

        return new_data

if __name__ == '__main__':
    f = FacebookIO('data.json','rb')
    d = json.load(f)
    print(d)

【讨论】：

【解决方案5】：

这是一个带有 jq 和 iconv 的命令行解决方案。在 Linux 上测试。

cat message_1.json | jq . | iconv -f utf8 -t latin1 > m1.json

【讨论】：

为什么需要jq .？它只会漂亮地打印原始文件
@jjmerelo 您需要它来将转义字符转换为原始的 b0rked 形式。

【解决方案6】：

我想用以下递归代码 sn-p 扩展 @Geekmoss 的答案，我曾经解码我的 facebook 数据。

import json

def parse_obj(obj):
    if isinstance(obj, str):
        return obj.encode('latin_1').decode('utf-8')

    if isinstance(obj, list):
        return [parse_obj(o) for o in obj]

    if isinstance(obj, dict):
        return {key: parse_obj(item) for key, item in obj.items()}

    return obj

decoded_data = parse_obj(json.loads(file))

我注意到这效果更好，因为您下载的 facebook 数据可能包含 dicts 列表，在这种情况下，由于 lambda 标识函数，这些 dicts 将按“原样”返回。

【讨论】：

【解决方案7】：

基于@Martijn Pieters 解决方案，我用 Java 写了一些类似的东西。

public String getMessengerJson(Path path) throws IOException {
    String badlyEncoded = Files.readString(path, StandardCharsets.UTF_8);
    String unescaped = unescapeMessenger(badlyEncoded);
    byte[] bytes = unescaped.getBytes(StandardCharsets.ISO_8859_1);
    String fixed = new String(bytes, StandardCharsets.UTF_8);
    return fixed;
}

unescape 方法的灵感来自 org.apache.commons.lang.StringEscapeUtils。

private String unescapeMessenger(String str) {
    if (str == null) {
        return null;
    }
    try {
        StringWriter writer = new StringWriter(str.length());
        unescapeMessenger(writer, str);
        return writer.toString();
    } catch (IOException ioe) {
        // this should never ever happen while writing to a StringWriter
        throw new UnhandledException(ioe);
    }
}

private void unescapeMessenger(Writer out, String str) throws IOException {
    if (out == null) {
        throw new IllegalArgumentException("The Writer must not be null");
    }
    if (str == null) {
        return;
    }
    int sz = str.length();
    StrBuilder unicode = new StrBuilder(4);
    boolean hadSlash = false;
    boolean inUnicode = false;
    for (int i = 0; i < sz; i++) {
        char ch = str.charAt(i);
        if (inUnicode) {
            unicode.append(ch);
            if (unicode.length() == 4) {
                // unicode now contains the four hex digits
                // which represents our unicode character
                try {
                    int value = Integer.parseInt(unicode.toString(), 16);
                    out.write((char) value);
                    unicode.setLength(0);
                    inUnicode = false;
                    hadSlash = false;
                } catch (NumberFormatException nfe) {
                    throw new NestableRuntimeException("Unable to parse unicode value: " + unicode, nfe);
                }
            }
            continue;
        }
        if (hadSlash) {
            hadSlash = false;
            if (ch == 'u') {
                inUnicode = true;
            } else {
                out.write("\\");
                out.write(ch);
            }
            continue;
        } else if (ch == '\\') {
            hadSlash = true;
            continue;
        }
        out.write(ch);
    }
    if (hadSlash) {
        // then we're in the weird case of a \ at the end of the
        // string, let's output it anyway.
        out.write('\\');
    }
}

【讨论】：

所以我花了一些时间尝试你的 Java 解决方案，只需要调试和学习在更大的 unescapeMessenger 例程中，在 for 循环的顶部，你有一个 if ( inUnicode)，您在循环开始之前将其设置为 false ...所以没有处理任何内容...这是怎么回事？
但是 for 循环块并没有以第一个条件块结束。如果我们位于 '\u' 前缀的 'u' 字符上，则 inUnicode 变量在第二个条件块中设置为 true。
好吧，它从来没有为我工作过，我用另一种方式解析字符串，虽然很粗糙，但很有效。

【解决方案8】：

我解析对象的解决方案使用parse_hook callback on load/loads函数：

import json


def parse_obj(dct):
    for key in dct:
        dct[key] = dct[key].encode('latin_1').decode('utf-8')
        pass
    return dct


data = '{"msg": "Ahoj sv\u00c4\u009bte"}'

# String
json.loads(data)  
# Out: {'msg': 'Ahoj svÄ\x9bte'}
json.loads(data, object_hook=parse_obj)  
# Out: {'msg': 'Ahoj světe'}

# File
with open('/path/to/file.json') as f:
     json.load(f, object_hook=parse_obj)
     # Out: {'msg': 'Ahoj světe'}
     pass

更新：

使用字符串解析列表的解决方案不起作用。所以这里是更新的解决方案：

import json


def parse_obj(obj):
    for key in obj:
        if isinstance(obj[key], str):
            obj[key] = obj[key].encode('latin_1').decode('utf-8')
        elif isinstance(obj[key], list):
            obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key]))
        pass
    return obj

【讨论】：

非常感谢！这个问题一直让我发疯，您的解决方案完美运行

【解决方案9】：

我确实可以确认 Facebook 下载数据的编码不正确； Mojibake。原始数据采用 UTF-8 编码，但被解码为拉丁语 -1。我会确保提交错误报告。

同时，您可以通过两种方式修复损坏：

将数据解码为 JSON，然后将任何字符串重新编码为 Latin-1，再次解码为 UTF-8：

>>> import json
>>> data = r'"Rados\u00c5\u0082aw"'
>>> json.loads(data).encode('latin1').decode('utf8')
'Radosław'

将数据加载为二进制，将所有\u00hh序列替换为最后两个十六进制数字代表的字节，解码为UTF-8，然后解码为JSON：

import re
from functools import partial

fix_mojibake_escapes = partial(
     re.compile(rb'\\u00([\da-f]{2})').sub,
     lambda m: bytes.fromhex(m.group(1).decode()))

with open(os.path.join(subdir, file), 'rb') as binary_data:
    repaired = fix_mojibake_escapes(binary_data.read())
data = json.loads(repaired.decode('utf8'))

根据您的示例数据，这会产生：

{'content': 'No to trzeba ostatnie treningi zrobić xD',
 'sender_name': 'Radosław',
 'timestamp': 1524558089,
 'type': 'Generic'}

【讨论】：

@Alper：没有导致该错误的数据样本，我想我无能为力，抱歉。