JavaScript 文本字符分割

这是逐个字符分解 JavaScript 字符串时的注意事项。

介绍

我记下了将“好天气?”分解为[“好”、“我”、“天堂”、“气”、“?”]时注意到的内容。

它可能会有所帮助，因为它与下面的解释重叠。

JavaScript 文本长度为半宽度
- https://qiita.com/yoya/items/5da038312279f98bdd28

只考虑日文，而Unicode的“一个字符”的规范本来就比较复杂，所以如果要妥善处理，最好使用专门的库。像这样。

https://www.npmjs.com/package/graphemesplit

文本[i]

JavaScript 可以将字符串的每个字符称为数组元素。
大多数字符可以按如下方式一一获取。

const text = "良い天気?";
const charArr = []
for (let i = 0; i < text.length; i++) {
    charArr.push(text[i]);
}
console.log(charArr);

[ '良', 'い', '天', '気', '�', '�' ]

汉字和平假名等日语一般都可以，但象形图和一些汉字就不好了。
特别是表情符号是后来添加的，所以它们的代码点不适合 16 位，
这是因为它是一个所谓的代理对。

数组.from

JavaScript 字符串使用 Array.from 并通过迭代器来分解对应于代理对的字符。
有了这个，大多数表情符号可以一个字符一个字符地分解。

const text = Array.from("良い天気?");
const charArr = []
for (let i = 0; i < text.length; i++) {
    charArr.push(text[i]);
}
console.log(charArr);

[ '良', 'い', '天', '気', '?' ]

for of 也一样。

const text = "良い天気?";
const charArr = []
for (const c of text) {
    charArr.push(c);
}

但是，不允许使用复合字形（unicode 连字）。此外，变体字符是无用的。

const text = "葛?城市?‍?‍?‍?";
const charArr = []
for (const c of text) {
    charArr.push(c);
}
console.log(charArr);

[
  '葛', '?', '城', '市',
  '?', '‍', '?', '‍',
  '?', '‍', '?'
]

使用 c.codePointAt(0).toString(16) 转换为十六进制会产生：

[
   '845b',  'e0100',   '57ce',  '5e02',
  '1f468',   '200d',  '1f469',  '200d',
  '1f467',   '200d',  '1f466'
]

'' Kuzu 和 '城' 之间的 '' 是 0xe0100 并且是一个变体选择器。
夹在象形图之间的''是0x200d，可以识别为ZWJ（Zero Width Joiner）。

ZWJ（零宽度连接器）

> Array.from("?‍?‍?‍?").map(c => c.codePointAt(0).toString(16))
(7) ['1f468', '200d', '1f469', '200d', '1f467', '200d', '1f466']

如果0x200d（ZWJ）跟在一个字符后面，那么混合下一个字符的过程是这样的。

const text = "??‍?‍?‍?";
const charArr = []
let chara = []
let needCode = 0;

for (const c of text) {
    const code = c.codePointAt(0);
    if (code === 0x200d) {  // ZWJ (Zero Width Joiner)
        needCode += 1;
    } else if (needCode > 0) {
        needCode -= 1;
    } else if (chara.length > 0) {
        charArr.push(chara.join(''));
        chara = [];
    }
    chara.push(c);
}
if (chara.length > 0) {
    charArr.push(chara.join(''));
    chara = [];
}

console.log(charArr);

[ '?', '?‍?‍?‍?' ]

备用字形选择器

还需要注意变体字符。

> Array.from("葛?城市").map(c => c.codePointAt(0).toString(16))
(4) ['845b', 'e0100', '57ce', '5e02']

“845b”和“e0100”似乎是“Kuzu”的一组字符。

这两个变体选择器很可能在日语环境中使用。

适用于	范围
用于 SVS	FE00 至 FE0F
用于 IVS	E0100 至 E01FE

https://ja.wikipedia.org/wiki/異体字セレクタ

 if (((0xfe00 <= code) && (code <= 0xfe0f)) ||
     ((0xe0100 <= code) && (code <= 0xe01fe))) {
      ;  // Variation Selector

表情符号修改

它还支持表情符号修饰符。这也稍后出现，例如变体选择器。

Array.from("??????????").map(c => c.codePointAt(0).toString(16))
(10) ['1f44d', '1f3fb', '1f44d', '1f3fc', '1f44d', '1f3fd', '1f44d', '1f3fe', '1f44d', '1f3ff']

 if ((0x1f3fb <= code) && (code <= 0x1f3ff)) {
      ;  // Emoji Modifier

我可以应付这个。

[ '??', '??', '??', '??', '??' ]

概括

function textCharaSplit(text) {
    const charArr = []
    let chara = []
    let needCode = 0;
    for (const c of text) {
        const code = c.codePointAt(0);
        if (code === 0x200d) {  // ZWJ (Zero Width Joiner)
            needCode += 1;
        } else if (((0xfe00 <= code) && (code <= 0xfe0f)) ||
                   ((0xe0100 <= code) && (code <= 0xe01fe))) {
                ;  // Variation Selector
        } else if ((0x1f3fb <= code) && (code <= 0x1f3ff)) {
                ;  // Emoji Modifier
        } else if (needCode > 0) {
            needCode -= 1;
        } else if (chara.length > 0) {
            charArr.push(chara.join(''));
            chara = [];
        }
        chara.push(c);
    }
    if (chara.length > 0) {
        charArr.push(chara.join(''));
        chara = [];
    }
    return charArr;
}

> textCharaSplit("A01赤-葛?城市??‍?‍?‍?!");
(12) ['A', '0', '1', '赤', '-', '葛', '?', '城', '市', '?', '?‍?‍?‍?', '!']

它似乎工作。

参考

你知道 emojis 在 JavaScript 中是如何处理的吗？
- https://tech.smartcamp.co.jp/entry/emoji-problem-on-javascript
如何在 JavaScript 中计算字符代码和“字符数”
- https://blog.jxck.io/entries/2017-03-02/unicode-in-javascript.html
我在 JavaScript 中正确处理表情符号、代理对、组合字符和字素集群时遇到了一些麻烦。
- https://qiita.com/amanoese/items/68bb9999829de4323302
JavaScript：计算考虑变体选择器的字符数
- https://blog.sarabande.jp/post/79238170880
  -https://ja.wikipedia.org/wiki/異体字セレクタ
字形选择器（变体选择器）变体选择器的字符列表 - 1 Unicode U+FE00 到 U+FE0F（第 65025 个字符到第 65040 个字符）
- https://0g0.org/category/FE00-FE0F/1/
https://www.weblio.jp/wkpja/content/その他の記号及び絵記号_その他の記号及び絵記号の概要

原创声明：本文系作者授权爱码网发表，未经许可，不得转载;

原文地址：https://www.likecs.com/show-308622250.html