【发布时间】:2021-08-25 03:21:05
【问题描述】:
我喜欢与pandas 合作,因为我在处理表格时对R 中的tidyverse 有亲和力。我有一个大约 200,000 行的表,需要替换标点符号并提取非英语单词,并将其放在同一张表中名为 non_english 的另一列。我更喜欢使用enchant 库,因为我发现它比使用nltk 库更准确。我的虚拟表df 有我正在处理的dundee 列。一个虚拟数据是这样的:
df = pandas.DataFrame({'dundee': ["I love:Marae", "My Whanau is everything", "I love Matauranga", "Tāmaki Makaurau is Whare", "AOD problem is common"]})
我的想法是先去掉标点符号,写一个函数来提取非英语单词,然后将该函数应用于数据框,但是我得到了这个错误ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().。这是我的代码:
import pandas as pd
import enchant
import re
import string
# remove punctuations
df['dundee1'] = df['dundee'].str.replace(r'[^\w\s]+', ' ')
# change words to lower case
df['dundee1'] = df['dundee1'].str.lower()
# Function to check if a word is english
def check_eng(word):
# use all available english dictionary
en_ls = ['en_NZ', 'en_US', 'en_AU', 'en_GB']
en_bool = False
# check all common dictionaries if word is English
for en in en_ls:
dic = enchant.Dict(en)
if word != '':
if dic.check(word) == True:
en_bool = True
break
disp_non_en = ""
word = word.str.split(' ')
if len(word) != 0:
if en_bool == False:
disp_non_en = disp_non_en + word + ', '
return disp_non_en
df['non_english'] = check_eng(df['dundee1'])
想要的表是这样的:
dundee non_english
0 I love:Marae Marae
1 My Whanau is everything Whanau
2 I lov Matauranga love, Matauranga
3 Tāmaki Makaurau is Whare Tāmaki Makaurau, Whare
4 AOD problem is common AOD
【问题讨论】: