PYTHON：使用附魔库识别 Pandas 数据框中的非英语单词答案

【问题标题】：PYTHON: Identify Non-English words in a Pandas dataframe using enchant libraryPYTHON：使用附魔库识别 Pandas 数据框中的非英语单词
【发布时间】：2021-08-25 03:21:05
【问题描述】：

我喜欢与pandas 合作，因为我在处理表格时对R 中的tidyverse 有亲和力。我有一个大约 200,000 行的表，需要替换标点符号并提取非英语单词，并将其放在同一张表中名为 non_english 的另一列。我更喜欢使用enchant 库，因为我发现它比使用nltk 库更准确。我的虚拟表df 有我正在处理的dundee 列。一个虚拟数据是这样的：

df = pandas.DataFrame({'dundee':    ["I love:Marae", "My Whanau is everything",  "I love Matauranga", "Tāmaki Makaurau is Whare", "AOD problem is common"]})

我的想法是先去掉标点符号，写一个函数来提取非英语单词，然后将该函数应用于数据框，但是我得到了这个错误ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().。这是我的代码：

import pandas as pd
import enchant
import re
import string

# remove punctuations
df['dundee1'] = df['dundee'].str.replace(r'[^\w\s]+', ' ')

# change words to lower case
df['dundee1'] = df['dundee1'].str.lower()


# Function to check if a word is english
def check_eng(word):
    
    # use all available english dictionary
    en_ls = ['en_NZ', 'en_US', 'en_AU', 'en_GB']
    en_bool = False
            
    # check all common dictionaries if word is English 
    for en in en_ls:
        dic = enchant.Dict(en)
        if word != '':
            if dic.check(word) == True:
                en_bool = True
                break

    disp_non_en = ""
    word = word.str.split(' ')

    if len(word) != 0:
        if en_bool == False:
             disp_non_en = disp_non_en + word + ', '

    return disp_non_en

df['non_english'] = check_eng(df['dundee1'])

想要的表是这样的：

    dundee                          non_english
0   I love:Marae                    Marae
1   My Whanau is everything         Whanau
2   I lov Matauranga                love, Matauranga
3   Tāmaki Makaurau is Whare        Tāmaki Makaurau, Whare
4   AOD problem is common           AOD

【问题讨论】：

标签： python pandas text nlp

【解决方案1】：

错误与调用有关：

check_eng(df['dundee1'])

其中df['dundee1'] 的类型为Series，并且您有一个if 语句试图引出以下的布尔值：

 if word != '':

word 是 Series，所以你应该使用：

df['dundee1'].apply(check_eng)

改为。

check_eng 中还有一个问题：

代替：

 if len(word) != 0:
        if en_bool == False:
             disp_non_en = disp_non_en + word + ', '

你应该使用：

words = word.str.split(' ')
for word in words:
    if en_bool == False:
        disp_non_en = disp_non_en + word + ', '

因为你这样做：

word = word.str.split(' ')

这会将word 的类型从str 更改为list，并使if 无效。

您可能需要查看错误的其他一些方面：Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

【讨论】：

谢谢@sophos。我把check_eng(df['dundee1'])改成了f['dundee1'].apply(check_eng)，但是出现了错误AttributeError: 'str' object has no attribute 'str'
@WiktorStribiżew：对不起，你错了。问题在于函数中更改的word 类型。现在在答案中指出。

【解决方案2】：

从 word.str.split(' ') 中删除 str，它会正常工作。试试这个： words = word.split(' ')

【讨论】：