lua cjson无法解码特定的unicode char？答案

【问题标题】：lua cjson cannot decode specific unicode char?lua cjson无法解码特定的unicode char？
【发布时间】：2017-01-12 02:18:39
【问题描述】：

在尝试解码特定的 unicode char 时，我从 lua cjson 收到以下错误，

root@9dc8433e6d83:~/torch-rnn# th train.lua -input_h5 data/aud.h5 -input_json data/aud.json -batch_size 50 -seq_length 100 -rnn_size 256 -max_epochs 50
Running with CUDA on GPU 0  
/root/torch/install/bin/luajit: train.lua:77: Expected value but found invalid unicode escape code at character 350873
stack traceback:
    [C]: in function 'read_json'
    train.lua:77: in main chunk
    [C]: in function 'dofile'
    /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

通过跟踪源，我可以看到 train.lua read_json 在幕后使用 cjson。

有问题的 unicode 转义码是 \uda85

如果我转到https://www.branah.com/unicode-converter，它会告诉我转义应该解码到的字符。

unicode 转义是使用 python unichr(55941) 生成的，并通过 python 脚本输出的重定向写入到 PYTHONIOENCODING=UTF-8 的文件中。

下面演示了字符是如何生成的；

echo "print unichr(55941)" > test.py
python test.py
Traceback (most recent call last):
  File "test.py", line 1, in <module>
    print unichr(55941)
UnicodeEncodeError: 'ascii' codec can't encode character u'\uda85' in position 0: ordinal not in range(128)

# export PYTHONIOENCODING=UTF-8
# python test.py
���
# python test.py > tfile
# cat tfile
���
# python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f=open("tfile",'r')
>>> s=f.readline()
>>> s
'\xed\xaa\x85\n'
>>> print s
���

>>> s.decode('utf-8')
u'\uda85\n'

总体而言，我要做的是获取 0-65535 范围内的一组整数，并使用 python 将它们映射到 UTF-8 字符并将它们写入文件。然后我想使用使用 LUA 的 torch-rnn 在字符序列上训练 RNN。尝试对由 torch-rnn python scripts/preprocess.py 生成的文件运行 th train.lua 时出现错误

【问题讨论】：

\uda85 是代理对的第一个代码，在第一个代码之后必须有第二个代码（dc00-dfff）才能完成 unicode 字符。第一部分没有第二部分是错误的。
很有趣，谢谢。您知道仅显示所有代理对的列表吗？在这个应用程序中，我可以简单地将它们切换为不同的值而不会出现问题，因此我可以对它们进行硬编码。 - 另外，只是出于兴趣，当我只给它 \uda85 解码时，我如何链接到生成有效字符的解码站点？
@MattWarren 前导或“高代理”范围是 D800–DBFF，尾随或“低代理”范围是 DC00–DFFF。见en.wikipedia.org/wiki/…
谢谢，我一直在四处寻找，但没有找到那个页面。现在我可以破解它们了 :)

标签： python unicode utf-8 lua torch

【解决方案1】：

似乎问题出在 unicode 代理，理解这意味着我可以过滤/切换它们以获得不同的值。在这个用例中，这不是什么大问题。

【讨论】：