【问题标题】:Extract parenthesized acronyms and abbreviations based on letter count and length根据字母数和长度提取带括号的首字母缩写词和缩写词
【发布时间】:2022-01-21 23:00:41
【问题描述】:

我确实意识到这里已经解决了这个问题(例如, Retrieve definition for parenthesized abbreviation, based on letter count)。不过,我希望这个问题有所不同。

我想从给定的字符串中提取带括号的首字母缩写词和缩写词。

def extract_acronyms_abbreviations(text):
    eaa = {}
    for match in re.finditer(r"\((.*?)\)", text):
        start_index = match.start()
        abbr = match.group(1)
        size = len(abbr)
        words = text[:start_index].split()[-size:]
        definition = " ".join(words)

        eaa[abbr] = definition


    return eaa

但上述函数将(such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant)(MSWC) 视为首字母缩略词。它将括号中的所有字符视为首字母缩写词。

就我而言,我想提取所有缩写词和首字母缩略词,首字母缩略词大写,括号内的首字母缩略词长度小于8。如果有任何and& 我需要在缩写中再添加一个单词。

示例文本:

text = """Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant).Resource and Information Management (RIM)"""

输出

extract_acronyms_abbreviations(text)
{'MSWC': 'Multilingual Spoken Words Corpus',
 'such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant': 'It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices',
'RIM': 'and Information Management'}

期望的输出

{'MSWC': 'Multilingual Spoken Words Corpus',
'RIM': 'Resource and Information Management'}

【问题讨论】:

  • 如果我理解得很好,你的问题是当有一些额外的小词如和......时的词数没有正确计算?
  • Number of characters,如果有&and需要加一个缩写词

标签: python regex string


【解决方案1】:

您可以将正则表达式更改为:r"\(([A-Z]{1,7})\)"。这将只匹配大写字母 A-Z,并确保首字母缩略词的长度为 1 到 7 个字符。

【讨论】:

  • 对不起,我的错。刚刚更新了问题。请再看一遍
猜你喜欢
  • 1970-01-01
  • 2022-01-21
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-25
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多