【发布时间】:2022-01-21 23:00:41
【问题描述】:
我确实意识到这里已经解决了这个问题(例如, Retrieve definition for parenthesized abbreviation, based on letter count)。不过,我希望这个问题有所不同。
我想从给定的字符串中提取带括号的首字母缩写词和缩写词。
def extract_acronyms_abbreviations(text):
eaa = {}
for match in re.finditer(r"\((.*?)\)", text):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = text[:start_index].split()[-size:]
definition = " ".join(words)
eaa[abbr] = definition
return eaa
但上述函数将(such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant)、(MSWC) 视为首字母缩略词。它将括号中的所有字符视为首字母缩写词。
就我而言,我想提取所有缩写词和首字母缩略词,首字母缩略词大写,括号内的首字母缩略词长度小于8。如果有任何and 或& 我需要在缩写中再添加一个单词。
示例文本:
text = """Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant).Resource and Information Management (RIM)"""
输出
extract_acronyms_abbreviations(text)
{'MSWC': 'Multilingual Spoken Words Corpus',
'such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant': 'It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices',
'RIM': 'and Information Management'}
期望的输出
{'MSWC': 'Multilingual Spoken Words Corpus',
'RIM': 'Resource and Information Management'}
【问题讨论】:
-
如果我理解得很好,你的问题是当有一些额外的小词如和......时的词数没有正确计算?
-
是
Number of characters,如果有&或and需要加一个缩写词