如何检查数据集的拼写错误并替换它们？答案

【问题标题】：How to check dataset for typos and replace them?如何检查数据集的拼写错误并替换它们？
【发布时间】：2021-11-17 06:50:32
【问题描述】：

我有一个问题。有没有办法检查特定列中是否有拼写错误？我有一张使用 pandas 读取的 Excel 表格。

首先我需要在 Python 中根据列名创建一个唯一列表；其次，我需要用新值替换错误的值。

【问题讨论】：

错别字是什么意思？就像拼写错误的单词一样？
数据在'Region'列中有多个值 Midwest Northwest West Northeast East Coast Central South International Centrall Typo => 需要更改 South Typo => 需要更改
没有真正简单的方法可以做到这一点。我想您可以尝试制作某种模式匹配算法来识别不符合您的正常标准的术语，但这并不容易。你可以看看this，看看它是否适合你的使用
有没有办法使用 group_by 并找到并替换文件中的值？
您当然可以找到具有不在列表中的值的行，但是计算机很难确定该值的预期值。更容易将它们呈现给用户进行手动更正。

标签： python pandas replace unique

【解决方案1】：

在 Jupyter 笔记本中工作并半手动执行此操作可能是最好的方法。一种选择是从创建正确拼写列表开始：

correct= ['terms','that','are','spelt','correctly']

并从您的数据框中创建一个不包含该列表中的值的子集。

df[~df['columnname'].str.startswith(tuple(correct))]

然后您将知道有多少行受到影响。然后，您可以计算不同变体的数量：

df['columnname'].value_counts()

如果合理，您可以查看唯一值，并将它们列成一个列表：

listoftypos = list(df['columnname'].unique())
print(listoftypos)

然后以半手动方式再次创建字典：

typodict= {'terma':'term','thaaat':'that','arree':'are','speelt':'spelt','cooorrectly':'correct'}

然后遍历您的原始数据框，如果列中的一行包含拼写错误列表中的关键字，则将其替换为字典中的正确键，如下所示：

for index,row in df.itterows():
    if any(row['columnname'] in s for s in listoftypos):
        correctspelling = list(typodict.keys())[list(typodict.values()).index(row['columnname'])]) 
    df.at[index,'columnname'] = correctspelling

这里有一个强烈的警告 - 当然，如果数据框非常大，这将是必须迭代完成的事情。

【讨论】：

【解决方案2】：

请记住，通用拼写检查的要求相当高，但我相信此解决方案将满足您的需求，并且错误匹配的可能性最低：

设置：

import difflib
import re
from itertools import permutations

cardinal_directions=['north', 'south', 'east', 'west']
regions=['coast', 'central', 'international', 'mid']

p_lst=list(permutations(cardinal_directions+regions,2)) 
area=[''.join(i) for i in p_lst]+cardinal_directions+regions

df=pd.DataFrame({"ID":list(range(0,9)), "region":['Midwest', 'Northwest', 'West', 'Northeast', 'East coast', 'Central', 'South', 'International', 'Centrall']})

初始 DF：

ID	region
0	Midwest
1	Northwest
2	West
3	Northeast
4	East coast
5	Central
6	South
7	International
8	Centrall

功能：

def spell_check(my_str, name_bank):
    prcnt=[]
    for y in name_bank:
        prcnt.append(difflib.SequenceMatcher(None, y, my_str.lower().strip()).ratio())
    return name_bank[prcnt.index(max(prcnt))]

将函数应用于 DF：

df.region=df.region.apply(lambda x: spell_check(x, area))

结果 DF：

ID	region
0	midwest
1	northwest
2	west
3	northeast
4	eastcoast
5	central
6	south
7	international
8	central

我希望这能回答您的问题并祝您好运。

【讨论】：