Android - 如何从字符串中过滤表情符号（表情符号）？答案

【问题标题】：Android - How to filter emoji (emoticons) from a string?Android - 如何从字符串中过滤表情符号（表情符号）？
【发布时间】：2014-04-06 08:46:28
【问题描述】：

我正在开发一个 Android 应用，我不希望人们在输入中使用表情符号。

如何从字符串中删除表情符号？

【问题讨论】：

正则表达式是一个选项。或者，如果表情符号列表是众所周知的，那么您可以迭代并删除输入中的匹配项的简单列表会很好。
见stackoverflow.com/questions/12013341/…
你可以使用字符类stackoverflow.com/questions/28366172/check-if-letter-is-emoji/…
@user2474486 这不是这里要问的。 Character 类确实可以识别代理对，但这并不意味着该字符是表情符号。例如。 U+1D120 不是表情符号，而是代理对。

标签： android emoji

【解决方案1】：

可以在以下范围内找到表情符号 (source)：

U+2190 到 U+21FF
U+2600 到 U+26FF
U+2700 到 U+27BF
U+3000 到 U+303F
U+1F300 到 U+1F64F
U+1F680 到 U+1F6FF

您可以在脚本中使用这一行来一次过滤它们：

【讨论】：

这是一个可能的答案，但不能处理所有情况。但尽管如此
@user210504 不处理哪些情况？如果您没有示例，则说“这不能处理所有情况”是没有用的。
\u 需要 4 位数字——这应该如何用于 1f300 等？
不工作。最后我使用了github.com/vdurmont/emoji-java。例如删除所有表情符号：EmojiParser.removeAllEmojis(text);

【解决方案2】：

最新的表情符号数据可以在这里找到：

http://unicode.org/Public/emoji/

有一个以表情符号版本命名的文件夹。作为应用程序开发人员，一个好主意是使用可用的最新版本。

当您查看文件夹内部时，您会在其中看到文本文件。您应该检查 emoji-data.txt。它包含所有标准表情符号代码。

表情符号有很多小符号代码范围。最好的支持是在您的应用中检查所有这些。

有人问为什么我们只能在\u后面指定4位，为什么还有5位代码。这些是由代理对组成的代码。通常使用 2 个符号来编码一个 emoji。

例如，我们有一个字符串。

String s = ...;

UTF-16 表示

byte[] utf16 = s.getBytes("UTF-16BE");

遍历 UTF-16

for(int i = 0; i < utf16.length; i += 2) {

获取一个字符

char c = (char)((char)(utf16[i] & 0xff) << 8 | (char)(utf16[i + 1] & 0xff));

现在检查代理对。表情符号位于第一个平面上，因此请检查 0xd800..0xd83f 范围内的对的第一部分。

if(c >= 0xd800 && c <= 0xd83f) {
    high = c;
    continue;
}

代理对范围的第二部分是 0xdc00..0xdfff。我们现在可以将一对转换为一个 5 位代码。

else if(c >= 0xdc00 && c <= 0xdfff) {
    low = c;
    long unicode = (((long)high - 0xd800) * 0x400) + ((long)low - 0xdc00) + 0x10000;
}

所有其他符号都不是对，所以按原样处理它们。

else {
    long unicode = c;
}

现在使用 emoji-data.txt 中的数据来检查它是否是表情符号。如果是，则跳过它。如果没有，则将字节复制到输出字节数组。

最后通过字节数组转换为字符串

String out = new String(outarray, Charset.forName("UTF-16BE"));

【讨论】：

P.S.如果您想删除一些额外的符号，可以在这里找到 Unicode 范围：jrgraphix.net/research/unicode.php
链接对我来说似乎坏了:(
@jacoballenwood 为我工作。尝试谷歌“Unicode 字符范围”
这是 13.0.0 版的 emoji-data.txt 文件：unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt。对于下一个版本，请转到unicode.org/Public > [版本] > ucd > emoji

【解决方案3】：

对于使用 Kotlin 的用户，Char.isSurrogate 也可以提供帮助。从中查找并删除正确的索引。

【讨论】：

如果表情符号由多个组成，例如肤色的，则无济于事。

【解决方案4】：

这是我用来删除表情符号的方法。注意：这仅适用于 API 24 及更高版本

public  String remove_Emojis_For_Devices_API_24_Onwards(String name)
   {
    // we will store all the non emoji characters in this array list
     ArrayList<Character> nonEmoji = new ArrayList<>();

    // this is where we will store the reasembled name
    String newName = "";

    //Character.UnicodeScript.of () was not added till API 24 so this is a 24 up solution
    if (Build.VERSION.SDK_INT > 23) {
        /* we are going to cycle through the word checking each character
         to find its unicode script to compare it against known alphabets*/
        for (int i = 0; i < name.length(); i++) {
            // currently emojis don't have a devoted unicode script so they return UNKNOWN
            if (!(Character.UnicodeScript.of(name.charAt(i)) + "").equals("UNKNOWN")) {
                nonEmoji.add(name.charAt(i));//its not an emoji so we add it
            }
        }
        // we then cycle through rebuilding the string
        for (int i = 0; i < nonEmoji.size(); i++) {
            newName += nonEmoji.get(i);
        }
    }
    return newName;
}

所以如果我们传入一个字符串：

remove_Emojis_For_Devices_API_24_Onwards("? test ? Indic:ढ Japanese:な ? Korean:ㅂ");

返回：test Indic:ढ Japanese:な Korean:ㅂ

表情符号的位置或数量无关紧要

【讨论】：

真的很有趣，但并不完美。这无法过滤驻留在 dingbats 和杂项符号块中的“❤”和“☤”。
@Jenix 那些不是表情符号
@Auras 是的。看到这个。 unicode.org/cldr/utility/…
表情符号四处传播。这就是为什么很难过滤所有这些。您可以在一般标点符号、装饰符号、表情符号、杂项符号、杂项符号和象形文字、补充符号和象形文字、运输和地图符号块中找到它们。至于“☤”，我认为是拼写错误，但“❤”确实是表情符号。