【问题标题】:How to convert unicode of an emoji into CLDR Short Name如何将表情符号的 unicode 转换为 CLDR 短名称
【发布时间】:2020-04-30 13:25:59
【问题描述】:

我正在使用 python 来提取 cmets 并显示它们。 打印出来是这样的。

This was heart wrenching \u2764\ufe0f
Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
\u2764\ufe0f\u2764\ufe0f\u2764\ufe0f

如何将表情符号的 unicode 转换为其各自的 CLDR 短名称? 例如,U+1F44D 将打印为竖起大拇指。

【问题讨论】:

    标签: python unicode emoji extraction data-extraction


    【解决方案1】:

    编辑:我想我找到了代码问题的解决方案\ud83d\udc9c

    text = text.encode('utf-16', 'surrogatepass').decode('utf-16')
    

    它将代理值\ud83d\udc9c 转换为正确的表情符号值\U0001f49c

    来源:How to work with surrogate pairs in Python?

    维基百科:Surrogate

    其他:Unicode character inspector


    使用谷歌我发现

    print('\U0001F44D'.encode('ascii', 'namereplace').decode())
    

    结果

    \N{THUMBS UP SIGN}
    

    还有

    import unicodedata
    
    print(unicodedata.name('\U0001F44D'))
    

    结果:

    THUMBS UP SIGN
    

    所以在 Stackoverflow 上提问之前最好使用 Google

    https://docs.python.org/3/howto/unicode.html


    文字也一样

    text = '''This was heart wrenching \u2764\ufe0f
    Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
    \u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
    
    print(text.encode('ascii', 'namereplace').decode())
    

    结果:

    This was heart wrenching \N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}
    Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
    \N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}\N{HEAVY BLACK HEART}\N{VARIATION SELECTOR-16}
    \N{THUMBS UP SIGN}
    

    现在您可能需要删除 \N{}

    但是\ud83d\udc9c\ud83d\udc9c\ud83d\udc9c有问题


    您可以在for-loop 中使用unicodedata 来获取文本中每个字符的名称,但如果它没有名称即可能有问题。 '\n'。它还为普通字符提供名称,因此您可能必须使用unicodedata.category() 来决定要替换哪些字符。

    这对\ud83d\udc9c\ud83d\udc9c\ud83d\udc9c也有问题

    import unicodedata
    
    # http://www.unicode.org/reports/tr44/#General_Category_Values
    
    for char in text:
        try:
            print(char, '|', unicodedata.category(char), '|', unicodedata.name(char))
        except ValueError:
            print(repr(char), '| (repr)')
    

    结果:

    T | Lu | LATIN CAPITAL LETTER T
    h | Ll | LATIN SMALL LETTER H
    i | Ll | LATIN SMALL LETTER I
    s | Ll | LATIN SMALL LETTER S
      | Zs | SPACE
    w | Ll | LATIN SMALL LETTER W
    a | Ll | LATIN SMALL LETTER A
    s | Ll | LATIN SMALL LETTER S
      | Zs | SPACE
    h | Ll | LATIN SMALL LETTER H
    e | Ll | LATIN SMALL LETTER E
    a | Ll | LATIN SMALL LETTER A
    r | Ll | LATIN SMALL LETTER R
    t | Ll | LATIN SMALL LETTER T
      | Zs | SPACE
    w | Ll | LATIN SMALL LETTER W
    r | Ll | LATIN SMALL LETTER R
    e | Ll | LATIN SMALL LETTER E
    n | Ll | LATIN SMALL LETTER N
    c | Ll | LATIN SMALL LETTER C
    h | Ll | LATIN SMALL LETTER H
    i | Ll | LATIN SMALL LETTER I
    n | Ll | LATIN SMALL LETTER N
    g | Ll | LATIN SMALL LETTER G
      | Zs | SPACE
    ❤ | So | HEAVY BLACK HEART
    ️ | Mn | VARIATION SELECTOR-16
    '\n' | (repr)
    A | Lu | LATIN CAPITAL LETTER A
    m | Ll | LATIN SMALL LETTER M
    a | Ll | LATIN SMALL LETTER A
    z | Ll | LATIN SMALL LETTER Z
    i | Ll | LATIN SMALL LETTER I
    n | Ll | LATIN SMALL LETTER N
    g | Ll | LATIN SMALL LETTER G
      | Zs | SPACE
    c | Ll | LATIN SMALL LETTER C
    o | Ll | LATIN SMALL LETTER O
    m | Ll | LATIN SMALL LETTER M
    p | Ll | LATIN SMALL LETTER P
    a | Ll | LATIN SMALL LETTER A
    s | Ll | LATIN SMALL LETTER S
    s | Ll | LATIN SMALL LETTER S
    i | Ll | LATIN SMALL LETTER I
    o | Ll | LATIN SMALL LETTER O
    n | Ll | LATIN SMALL LETTER N
      | Zs | SPACE
    '\ud83d' | (repr)
    '\udc9c' | (repr)
    '\ud83d' | (repr)
    '\udc9c' | (repr)
    '\ud83d' | (repr)
    '\udc9c' | (repr)
      | Zs | SPACE
    # | Po | NUMBER SIGN
    t | Ll | LATIN SMALL LETTER T
    e | Ll | LATIN SMALL LETTER E
    a | Ll | LATIN SMALL LETTER A
    r | Ll | LATIN SMALL LETTER R
    s | Ll | LATIN SMALL LETTER S
    '\n' | (repr)
    ❤ | So | HEAVY BLACK HEART
    ️ | Mn | VARIATION SELECTOR-16
    ❤ | So | HEAVY BLACK HEART
    ️ | Mn | VARIATION SELECTOR-16
    ❤ | So | HEAVY BLACK HEART
    ️ | Mn | VARIATION SELECTOR-16
    

    因为\ud83d\udc9c\ud83d\udc9c\ud83d\udc9c有问题所以我换成?

    import unicodedata
    
    text = '''This was heart wrenching \u2764\ufe0f
    Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
    \u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
    
    result = []
    
    for char in text:
        if unicodedata.category(char) in ('So', 'Mn'):
            result.append(':{}:'.format(unicodedata.name(char)))
        elif unicodedata.category(char) in ('Cs'):
            result.append('?') #char)
        else:
            result.append(char)
    
    print(''.join(result)) 
    

    结果:

    This was heart wrenching :HEAVY BLACK HEART::VARIATION SELECTOR-16:
    Amazing compassion ?????? #tears
    :HEAVY BLACK HEART::VARIATION SELECTOR-16::HEAVY BLACK HEART::VARIATION SELECTOR-16::HEAVY BLACK HEART::VARIATION SELECTOR-16:
    

    编辑: 再次使用 Google 我发现外部模块 emoji 可以转换一些名称,但它也有问题 \ud83d\udc9c 所以我使用 repr 显示它 - 但它也打印新行为\n

    text = '''This was heart wrenching \u2764\ufe0f
    Amazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears
    \u2764\ufe0f\u2764\ufe0f\u2764\ufe0f'''
    
    import emoji
    
    #print( repr(emoji.demojize(text, use_aliases=True)) ) 
    print( repr(emoji.demojize(text)) ) 
    

    结果:

    'This was heart wrenching :heart:\nAmazing compassion \ud83d\udc9c\ud83d\udc9c\ud83d\udc9c #tears\n:heart::heart::heart:'
    

    http://www.unicode.org/emoji/charts/full-emoji-list.html

    https://www.webfx.com/tools/emoji-cheat-sheet/

    http://unicode.org/Public/emoji/12.0/emoji-test.txt


    顺便说一句: 我找到了模块demoji,它可以找到表情符号并给出名称。但是代码\ud83d\udc9c也有问题

    import demoji
    
    # run only once after installing module
    demoji.download_codes()
    
    print(demoji.findall(text))
    

    它只需要一次demoji.download_codes() - 在安装模块之后。

    结果:

    {'❤️': 'red heart'}
    

    如果您将其作为 JSON 数据 "\ud83d\udc9c" 获取,那么您应该没有问题 - 它应该会自动转换它

    import json
    
    # escaped unicode in " "  
    data = r'"\ud83d\udc9c"' 
    print(json.loads(data))
    

    在其他情况下,您必须转换它

    # convert to escaped unicode and put in " "  
    data = '"{}"'.format('\ud83d\udc9c'.encode('unicode-escape').decode())
    print(json.loads(data))
    

    How to work with surrogate pairs in Python?

    【讨论】:

      猜你喜欢
      • 2015-10-18
      • 2015-01-15
      • 1970-01-01
      • 2012-01-27
      • 2018-11-16
      • 1970-01-01
      • 2021-05-27
      • 2017-11-27
      • 1970-01-01
      相关资源
      最近更新 更多