【问题标题】:How can I fuzzy string match multiple strings from different sized data frames?如何模糊字符串匹配来自不同大小数据帧的多个字符串?
【发布时间】:2019-09-12 02:27:00
【问题描述】:

我想将我的第一个数据集中的字符串与所有最接近的常见匹配项进行匹配。

数据如下:

数据集1:

California 
Texas 
Florida 
New York

数据集2:

Californiia 
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york

想要的结果是:

col_1                col_2              col_3            col4
California           Californiia        callifoornia
Texas                T3xas              texas            Te xas
Florida              folrida            Fl0 rida
New York             New york           new york

问题是:

  • 如何在第一个数据集和第一个数据集之间搜索公共字符串 第二个数据集,并生成第二个数据集中的术语列表 与第一个中的每个术语一致?

提前致谢。

【问题讨论】:

标签: r string join stringdist


【解决方案1】:

我读了一些关于 stringdist 的文章并想出了这个。这是一种解决方法,但我喜欢它。绝对可以改进:

library(stringdist)
library(janitor)

ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')

distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)


df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string

for (j in 1:ncol(df)) {
      trigger <- df[j,] < 4
      df[trigger , j] <- names(df)[j]
      df[!trigger , j] <- ""
}


df <- remove_constant(df)

write.csv(df, file="~/Desktop/df.csv")

【讨论】:

    【解决方案2】:
    library(fuzzyjoin); library(tidyverse)
    dataset1 %>%
      stringdist_left_join(dataset2, 
                           max_dist = 3) %>%
      rename(col_1 = "states.x") %>%
      group_by(col_1) %>%
      mutate(col = paste0("col_", row_number() + 1)) %>%
      spread(col, states.y)
    
    #Joining by: "states"
    ## A tibble: 4 x 4
    ## Groups:   col_1 [4]
    #  col_1      col_2       col_3        col_4
    #  <chr>      <chr>       <chr>        <chr>
    #1 California Californiia callifoornia NA   
    #2 Florida    Fl0 rida    folrida      NA   
    #3 New York   New york    new york     NA   
    #4 Texas      T3xas       Te xas       texas
    

    数据:

    dataset1 <- data.frame(states = c("California",
                                    "Texas",
                                    "Florida",
                                    "New York"), 
                           stringsAsFactors = F)
    
    dataset2 <- data.frame(stringsAsFactors = F,
      states = c(
        "Californiia",
        "callifoornia",
        "T3xas",
        "Te xas",
        "texas",
        "Fl0 rida",
        "folrida",
        "New york",
        "new york"
      )
    )
    

    【讨论】:

      猜你喜欢
      • 2014-12-11
      • 2015-02-07
      • 2018-06-12
      • 2020-07-14
      • 2012-02-14
      • 2014-11-02
      • 2023-03-10
      • 1970-01-01
      • 2017-08-03
      相关资源
      最近更新 更多