R函数通过更接近单词的频率来纠正单词答案

【问题标题】：R function to correct words by frequency of more proximate wordR函数通过更接近单词的频率来纠正单词
【发布时间】：2014-09-09 19:36:33
【问题描述】：

我有一张拼错单词的表格。我需要更正那些使用与那个更相似的词，频率更高的词。

例如，在我运行之后

aggregate(CustomerID ~ Province, ventas2, length)

我明白了

1                             
2                     AMBA         29
    3                   BAIRES          1
    4              BENOS AIRES          1

    12            BUENAS AIRES          1

    17           BUENOS  AIRES          4
    18            buenos aires          7
    19            Buenos Aires          3
    20            BUENOS AIRES      11337
    35                 CORDOBA       2297
    36                cordoba           1
    38               CORDOBESA          1
    39              CORRIENTES        424

所以我需要用 BUENOS AIRES 替换 buenos aires, Buenos Aires, Baires, BUENOS AIRES，但不应替换 AMBA。 CORDOBESA 和 cordoba 也应替换为 CORDOBA，而不是 CORRIENTES。

如何在 R 中做到这一点？

谢谢！

【问题讨论】：

您可以使用例如adist函数（或stringdist包中的其他方法）测量字符串距离，将更相似的字符串分组（使用距离阈值），然后替换所有组中的字符串与组中频率最高的字符串。问题是，这种方法确实受到距离计算和选择的阈值的影响。例如，请注意，对于可能所有的字符串距离方法，AMBA 更类似于 BAIRES 而不是 BUENOS AIRES...
你能把代码贴出来吗？

标签： r dictionary replace misspelling

【解决方案1】：

这是一个可能的解决方案。

免责声明：
此代码似乎适用于您当前的示例。我不保证当前参数（例如切割高度、集群聚集方法、距离方法等）对您的真实（完整）数据有效。

# recreating your data
data <- 
read.csv(text=
'City,Occurr
AMBA,29
BAIRES,1
BENOS AIRES,1
BUENAS AIRES,1
BUENOS  AIRES,4
buenos aires,7
Buenos Aires,3
BUENOS AIRES,11337
CORDOBA,2297
cordoba,1
CORDOBESA,1
CORRIENTES,424',stringsAsFactors=F)


# simple pre-processing to city strings:
# - removing spaces
# - turning strings to uppercase
cities <- gsub('\\s+','',toupper(data$City))

# string distance computation
# N.B. here you can play with single components of distance costs 
d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
# assign original cities names to distance matrix
rownames(d) <- data$City
# clustering cities
hc <- hclust(as.dist(d),method='single')

# plot the cluster dendrogram
plot(hc)
# add the cluster rectangles (just to see the clusters) 
# N.B. I decided to cut at distance height < 5
#      (read it as: "I consider equal 2 strings needing
#       less than 5 modifications to pass from one to the other")
#      Obviously you can use another value.
rect.hclust(hc,h=4.9)

# get the clusters ids
clusters <- cutree(hc,h=4.9) 
# turn into data.frame
clusters <- data.frame(City=names(clusters),ClusterId=clusters)

# merge with frequencies
merged <- merge(data,clusters,all.x=T,by='City') 

# add CityCorrected column to the merged data.frame
ret <- by(merged, 
          merged$ClusterId,
          FUN=function(grp){
                idx <- which.max(grp$Occur)
                grp$CityCorrected <- grp[idx,'City']
                return(grp)
              })

fixed <- do.call(rbind,ret)

结果：

> fixed
              City Occurr ClusterId CityCorrected
1             AMBA     29         1          AMBA
2.2         BAIRES      1         2  BUENOS AIRES
2.3    BENOS AIRES      1         2  BUENOS AIRES
2.4   BUENAS AIRES      1         2  BUENOS AIRES
2.5  BUENOS  AIRES      4         2  BUENOS AIRES
2.6   buenos aires      7         2  BUENOS AIRES
2.7   Buenos Aires      3         2  BUENOS AIRES
2.8   BUENOS AIRES  11337         2  BUENOS AIRES
3.9        cordoba      1         3       CORDOBA
3.10       CORDOBA   2297         3       CORDOBA
3.11     CORDOBESA      1         3       CORDOBA
4       CORRIENTES    424         4    CORRIENTES

聚类图：

【讨论】：

这是我需要的，但我在最后一部分出现错误（在 # add CityCorrected 列到合并的 data.frame 中）：$<-.data.frame(*tmp*, "CityCorrected ", value = integer(0)) : 替换有 0 行，数据有 4
@GabyP：也许对于某些城市，您的频率列中有 NA。 by 函数内的代码不处理这种情况
我在 Occurr 中没有 NA，因为它是一个频率。
出于某种原因，grp（即具有相同集群 id 的合并子集）会产生一些问题。因此，请在函数（grp）{ ... } 中添加一个 try-catch 并打印 grp 以防出错以验证问题所在...
我没有使用最后一部分。相反，我写道：merged library(plyr) 总计

【解决方案2】：

这是我对您的汇总结果的小型复制您需要更改对数据框的所有调用以适应您的数据结构。

df
#output
#       word freq
#1         a    1
#2         b    2
#3         c    3

#find the max frequency
mostFrequent<-max(df[,2])  #doesn't handle ties

#find the word we will be replacing with
replaceString<-df[df[,2]==mostFrequent,1]
#[1] "c"

#find all the other words to be replaced
tobereplaced<-df[df[,2]!=mostFrequent,1]
#[1] "a" "b"

现在假设您有以下包含整个数据集的数据框，我将只复制带有单词的单个列。

totalData
 #    [,1]
 #[1,] "a" 
 #[2,] "c" 
 #[3,] "b" 
 #[4,] "d" 
 #[5,] "f" 
 #[6,] "a" 
 #[7,] "d" 
 #[8,] "b" 
 #[9,] "c"

我们可以通过以下调用将所有要替换的单词替换为要替换的字符串

totaldata[totaldata%in%tobereplaced]<-replaceString
 #    [,1]
 #[1,] "c" 
 #[2,] "c" 
 #[3,] "c" 
 #[4,] "d" 
 #[5,] "f" 
 #[6,] "c" 
 #[7,] "d" 
 #[8,] "c" 
 #[9,] "c"

如你所见，所有的a和b都被c替换了，其他的词都是一样的

【讨论】：

我不需要用相同的最频繁值替换所有行。我会有很多不同的组。例如，在这种情况下，不应更换 AMBA。
@GabyP AMBA 与被收录有什么区别？显然，这可能不是 Buenos Aires 的拼写错误，但我只能根据您上面提供的内容进行处理。您是否还需要一种将“相似”单词组合在一起的方法？
是的，然后用组内频率最高的单词替换它们
@GabyP 您能否在您的问题中添加更完整的汇总结果图片？现在我只是按第一个字母将所有单词组合在一起，但我敢肯定你有更多的单词
好的，我刚刚从结果中添加了更多数据