【问题标题】:Joining by multiple columns with stringdist_join使用 stringdist_join 通过多列连接
【发布时间】:2020-12-27 14:52:26
【问题描述】:

我有两个数据框,其中x 列可能有拼写错误,y 列始终正确。 我不明白为什么用stringdist 加入多个列会给出这些对:

library(dplyr)
library(fuzzyjoin)
a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2", "3", "1","6"))

b <- data.frame(x = c("season", "seson", "seson", "package", "pakkage"), y = c("1","2", "3", "2","6"))

c <- a %>%
  stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0))

      x.x y.x     x.y  y.y
1  season   1  season    1
2  season   1   seson    2
3  season   1   seson    3
4  season   2   seson    2
5  season   3  season    1
6  season   3   seson    2
7  season   3   seson    3
8 package   1 package    2
9 package   6    <NA> <NA>

我想得到

      x.x y.x     x.y  y.y
1  season   1  season    1
2  season   2   seson    2
3  season   3   seson    3
4 package   1    <NA> <NA>
5 package   6 pakkage    6

【问题讨论】:

    标签: r join left-join


    【解决方案1】:

    我们可以通过根据两个数据集中“x”列中列值的相似性创建一个新列来完成这项工作,然后执行left_join

    library(stringdist)
    library(dplyr)
    a %>%
        mutate(grp = phonetic(x)) %>%
       left_join(b %>% mutate(grp = phonetic(x), y2 = y), by = c('grp', 'y')) %>% 
       select(-grp)
    

    -输出

    #      x.x y     x.y   y2
    #1  season 1  season    1
    #2  season 2   seson    2
    #3  season 3   seson    3
    #4 package 1    <NA> <NA>
    #5 package 6 pakkage    6
    

    或其他选项是将stringdist_left_join 中的method 从其默认选项(osa -> 最佳字符串对齐,(受限 Damerau-Levenshtein 距离)。)更改为soundex(基于 soundex 编码的距离)

    library(fuzzyjoin)
    a %>%
       stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0), 
                method = "soundex")
    #      x.x y.x     x.y  y.y
    #1  season   1  season    1
    #2  season   2   seson    2
    #3  season   3   seson    3
    #4 package   1    <NA> <NA>
    #5 package   6 pakkage    6
    

    根据?"stringdist-metrics"

    对于 soundex 距离 (method='soundex'),字符串被转换为 soundex 代码(有关规范,请参阅语音)。当它们具有相同的 soundex 代码时,字符串之间的距离为 0,否则为 1。请注意,soundex 重新编码仅对 a-z 和 A-Z 范围内的字符有意义。遇到不可打印或非 ASCII 字符时会发出警告。

    【讨论】:

    • 您也碰巧知道为什么在stringdist_left_join 中设置max_dist = c(1,0) 不起作用?
    • @Maya 您可以更改 method 更新后的帖子
    【解决方案2】:

    cbind 可以重现您想要的输出。

    cbind(a,b)
           x y      x y
    1 season 1 season 1
    2 season 2  seson 2
    3 season 3  seson 3
    4 season 4  seson 4
    5 season 6  seson 6
    

    编辑

    如果ab 的行数不同,您可以尝试full_join from dplyr

    full_join(a,b, by = "y")
         x.x y    x.y
    1 season 1 season
    2 season 2  seson
    3 season 3  seson
    4 season 4  seson
    5 season 6  seson
    

    【讨论】:

    • 我正在使用的数据帧的排序方式不同/长度相同,这就是我不能使用 cbind 的原因
    • 请检查我编辑的答案。它是否按您的预期工作?
    • 我认为我的可重现示例不是一个好示例。抱歉,我编辑了我的问题!
    猜你喜欢
    • 2018-08-01
    • 1970-01-01
    • 2017-02-19
    • 2012-01-12
    • 2017-06-20
    • 1970-01-01
    • 1970-01-01
    • 2018-06-14
    • 2014-10-27
    相关资源
    最近更新 更多