提取数据框中的首字母缩略词和毛利语（非英语）单词，并将它们放在数据框中的相邻列中答案

【问题标题】：Extract Acronyms and Māori (non-english) words in a dataframe, and put them in adjacent columns within the dataframe提取数据框中的首字母缩略词和毛利语（非英语）单词，并将它们放在数据框中的相邻列中
【发布时间】：2021-10-13 16:16:33
【问题描述】：

正则表达式对我来说似乎是一条陡峭的学习曲线。我有一个包含文本（最多 300,000 行）的数据框。名为foo_df.csv 的虚拟文件的outcome 列中包含的文本混合了英语单词、首字母缩略词和毛利语单词。 foo_df.csv 是这样的：

    outcome
0   I want to go to DHB
1   Self Determination and Self-Management Rangatiratanga
2   mental health wellness and AOD counselling
3   Kai on my table
4   Fishing
5   Support with Oranga Tamariki Advocacy
6   Housing pathway with WINZ
7   Deal with personal matters
8   Referral to Owaraika Health services

我想要的结果是下面的表格形式，其中包含 Abreviation 和 Māori_word 列：

    outcome                                                 Abbreviation     Māori_word             
0   I want to go to DHB                                     DHB      
1   Self Determination and Self-Management Rangatiratanga                    Rangatiratanga
2   mental health wellness and AOD counselling              AOD              
3   Kai on my table                                                          Kai
4   Fishing                                                                  
5   Support with Oranga Tamariki Advocacy                                    Oranga Tamariki
6   Housing pathway with WINZ                               WINZ             
7   Deal with personal matters                                               
8   Referral to Owaraika Health services                                     Owaraika

我使用的方法是使用正则表达式提取缩写词，并使用 nltk 模块提取毛利语单词。

我已经能够使用以下代码使用正则表达式提取缩写词：

pattern = '(\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b)'
foo_df['Abbreviation'] = foo_df.outcome.str.extract(pattern)

我已经能够使用以下代码从句子中提取非英语单词：

import nltk
nltk.download('words')
from nltk.corpus import words

words = set(nltk.corpus.words.words())

sent = "Self Determination and Self-Management Rangatiratanga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if not w.lower() in words or not w.isalpha())

但是，当我尝试在数据帧上迭代上述代码时，出现错误 TypeError: expected string or bytes-like object。我尝试的迭代如下：

def no_english(text):
  words = set(nltk.corpus.words.words())
  " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
         if not w.lower() in words or not w.isalpha())

foo_df['Māori_word'] = foo_df.apply(no_english, axis = 1)
print(foo_df)

python3 中的任何帮助将不胜感激。谢谢。

【问题讨论】：

请添加您尝试过的代码，并说明将子字符串/单词作为缩写的条件，因为PackNSave 似乎不是缩写，还要解释Māori的话。
这似乎是自然语言处理问题，您必须使用 SpaCy 之类的库来处理文本。
我改进了数据的格式，将PakNSave 替换为适当的缩写DHB。我的想法是使用regex 来提取首字母缩略词并使用合适的nlp 库来提取毛利语单词。非英语或首字母缩略词的文本是毛利语单词。
我已经通过使用正则表达式提取缩写词部分回答了我的问题，并且已经能够使用 nltk 库从句子中提取毛利语单词。但是当我在数据帧上迭代这段代码时，我得到了这个错误TypeError: expected string or bytes-like object。

标签： python-3.x regex pandas nlp acronym

【解决方案1】：

您无法通过简单的短正则表达式神奇地判断一个单词是否为英语/毛利语/缩写。实际上，很可能某些词可以在多个类别中找到，因此任务本身不是二元的（或者在这种情况下是三元的）。

你想做的是自然语言处理，这里有一些examples of libraries for language detection in python。您将得到输入是给定语言的概率。这通常在全文上运行，但您可以将其应用于单个单词。

另一种方法是使用毛利语和缩写词词典（=详尽/选定的单词列表）并制作一个函数来判断一个单词是否是其中之一，否则假定为英语。

【讨论】：