如何模糊字符串匹配来自不同大小数据帧的多个字符串？答案

【问题标题】：How can I fuzzy string match multiple strings from different sized data frames?如何模糊字符串匹配来自不同大小数据帧的多个字符串？
【发布时间】：2019-09-12 02:27:00
【问题描述】：

我想将我的第一个数据集中的字符串与所有最接近的常见匹配项进行匹配。

数据如下：

数据集1：

California 
Texas 
Florida 
New York

数据集2：

Californiia 
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york

想要的结果是：

col_1                col_2              col_3            col4
California           Californiia        callifoornia
Texas                T3xas              texas            Te xas
Florida              folrida            Fl0 rida
New York             New york           new york

问题是：

如何在第一个数据集和第一个数据集之间搜索公共字符串第二个数据集，并生成第二个数据集中的术语列表与第一个中的每个术语一致？

提前致谢。

【问题讨论】：

定义“最接近”。你的研究发现什么是相关的？你是如何在你的程序中提供它的？一旦你得到一个包含正确和模糊列的表格，你知道如何执行将多行转换为多列行的单独步骤吗？--你真的在这里问了 2 个问题。两者显然都可能是常见问题解答。在 SO 上发现了什么？你能做什么？
请参阅 stringdist 包和 data.table 中的 dcast。有一种方法可以在 R 中很好地做到这一点，但我现在没有时间编写代码。 stringdist 相对容易使用一些基本的 R 印章。
Stackoverflow 上有很多相关信息，例如：-stackoverflow.com/questions/27975705/…stackoverflow.com/questions/2231993/…stackoverflow.com/questions/16145064/…stackoverflow.com/questions/5721883/…stackoverflow.com/questions/6044112/… 等

标签： r string join stringdist

【解决方案1】：

我读了一些关于 stringdist 的文章并想出了这个。这是一种解决方法，但我喜欢它。绝对可以改进：

library(stringdist)
library(janitor)

ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')

distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)


df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string

for (j in 1:ncol(df)) {
      trigger <- df[j,] < 4
      df[trigger , j] <- names(df)[j]
      df[!trigger , j] <- ""
}


df <- remove_constant(df)

write.csv(df, file="~/Desktop/df.csv")

【讨论】：

【解决方案2】：

library(fuzzyjoin); library(tidyverse)
dataset1 %>%
  stringdist_left_join(dataset2, 
                       max_dist = 3) %>%
  rename(col_1 = "states.x") %>%
  group_by(col_1) %>%
  mutate(col = paste0("col_", row_number() + 1)) %>%
  spread(col, states.y)

#Joining by: "states"
## A tibble: 4 x 4
## Groups:   col_1 [4]
#  col_1      col_2       col_3        col_4
#  <chr>      <chr>       <chr>        <chr>
#1 California Californiia callifoornia NA   
#2 Florida    Fl0 rida    folrida      NA   
#3 New York   New york    new york     NA   
#4 Texas      T3xas       Te xas       texas

数据：

dataset1 <- data.frame(states = c("California",
                                "Texas",
                                "Florida",
                                "New York"), 
                       stringsAsFactors = F)

dataset2 <- data.frame(stringsAsFactors = F,
  states = c(
    "Californiia",
    "callifoornia",
    "T3xas",
    "Te xas",
    "texas",
    "Fl0 rida",
    "folrida",
    "New york",
    "new york"
  )
)

【讨论】：