加入R中键的子字符串答案

【问题标题】：Join on substring of a key in R加入R中键的子字符串
【发布时间】：2020-07-31 08:32:22
【问题描述】：

我正在尝试使用一些代码连接两个表，其中一个列中的键可能是原始键的子集。

Event
id  date  ProductId  quantity
a   xyz   1234567    30
a   abc   5826811    20
b   def   3619100    10
b   ghi   9268420    50

ProductDimension
code     name  type
234-567  p1    c1
826-81   p2    c2
61-9100  p3    c3  


Result should be:
eventAU
id date ProductId quantity name  type
a   xyz   1234567    30    p1     c1
a   abc   5826811    20    p2     c2
b   def   3619100    10    p3     c3

从question 中得到提示，我正在尝试使用以下方法进行模糊连接：

ProductDimension$regex <- gsub("-", "", ProductDimension$code)

eventTbl <- tbl_df(Events)
prodcutTbl <- tbl_df(ProductDimension)

eventsAU <- regex_left_join(eventTbl , prodcutTbl , by = c(ProductId = "regex"))

但我收到以下异常：

Error: All columns in a tibble must be 1d or 2d objects: * Column `col` is NULL

【问题讨论】：

标签： r join fuzzyjoin

【解决方案1】：

dplyr 和 fuzzyjoin 选项可以是：

stringdist_inner_join(df1, 
                      df2 %>%
                       mutate(code = sub("-", "", code)),
                      method = "lv",
                      by = c("ProductId" = "code"))

  id    date  ProductId quantity code   name  type 
  <chr> <chr>     <int>    <int> <chr>  <chr> <chr>
1 a     xyz     1234567       30 234567 p1    c1   
2 a     abc     5826811       20 82681  p2    c2   
3 b     def     3619100       10 619100 p3    c3

或者如果指定最大距离，可以跳过sub()部分，使用dplyr：

stringdist_inner_join(df1, 
                      df2,
                      method = "lv",
                      max_dist = 3,
                      by = c("ProductId" = "code"))

【讨论】：