节点 JS 不显示缓冲区中的 UTF-8 字符答案

【问题标题】：Node JS not displaying UTF-8 character from Buffer节点 JS 不显示缓冲区中的 UTF-8 字符
【发布时间】：2016-02-03 18:51:34
【问题描述】：

我正在做一些网络爬虫，并注意到我得到了一些奇怪的文档，其中包含像“�”这样的字符。

我访问了有问题的网站，但文档本身没有明显问题。

拿了显示不正确的buffer开始测试，问题好像出在nodejs上？

    var actual = new Buffer([0x50, 0x72, 0x65, 0xe7, 0x6f]) // this is the buffer I got
    var correct = 'Preço' // This is what I expected to be displayed

    console.log('Correct: ', correct)
    console.log('Actual:', actual.toString('utf8'))

    // Test code per code
    console.log(correct.charCodeAt(0) + '=' + parseInt(actual[0]))
    console.log(correct.charCodeAt(1) + '=' + parseInt(actual[1]))
    console.log(correct.charCodeAt(2) + '=' + parseInt(actual[2]))
    console.log(correct.charCodeAt(3) + '=' + parseInt(actual[3]))
    console.log(correct.charCodeAt(4) + '=' + parseInt(actual[4]))

输出：

Correct:  Preço
Actual: Pre�o
80=80
114=114
101=101
231=231
111=111

如您所见，所有字节对应相同的字符代码！他们怎么会产生不同的结果？

【问题讨论】：

你确定是utf-8？
@MinusFour 缓冲区肯定是有效的utf8 char码序列，我看了一下utf8表来检查
对，所以您试图从0xe7 中获取U+00E7，但事实并非如此。 U+00E7 在 utf-8 中是 0xC3, 0xA7。
@MinusFour 成功了...
我会尝试看看它是否适用于binary 编码，而不是utf8（使用您的原始缓冲区）

标签： javascript node.js web-crawler

【解决方案1】：

试试iconv:

var actual = new Buffer([0x50, 0x72, 0x65, 0xe7, 0x6f]) // this is the buffer I got

var correct = 'Preço' // This is what I expected to be displayed

console.log('Correct: ', correct)
console.log('Actual:', actual.toString('utf8'))

var iconv = require('iconv');
var converter = new iconv.Iconv('windows-1250', 'utf8');
var data = converter.convert(actual).toString();
console.log('iconv: ',data);

【讨论】：

关于如何处理同一文档中的多个编码的任何提示？正如我所看到的，他们在同一个文档中有 windows-1250 和 utf-8！这显然是我遇到的问题的一部分。

【解决方案2】：

使用此代码示例来表示字符串中的 2 字节字符。在上面的示例中，缓冲区正在修剪更高的字节。

function SpecialCharsTest (str)
  {
    //mix-data = '€uro' // This is what I expected to be displayed, € sign is 2-byte 0x20AC
    console.log('InStr: ', str);
    var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
    var bufView = new Uint16Array(buf);
    var strLen=str.length;
    for (var i=0; i < strLen; i++) {
      bufView[i] = str.charCodeAt(i);
    }
    console.log('InStr to bufView Array (2-byte):      ', bufView);
    console.log('InStr to buf back to String (2-byte): ' + String.fromCharCode.apply(null, new Uint16Array(buf)));  
    return buf;
  }

结果：

InStr:  €uro
InStr to bufView Array (2-byte):       Uint16Array(4) [8364, 117, 114, 111]
InStr to buf back to String (2-byte): €uro

【讨论】：