【问题标题】:Mach a string with the same string from another data frame用来自另一个数据帧的相同字符串来处理一个字符串
【发布时间】:2021-02-02 13:25:19
【问题描述】:

我有这个数据框 (DF1)

structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK")), class = "data.frame", row.names = c(NA, -3L)) 

ID TEXT
1  "there was not clostridium"
2  "clostridium difficile positive"
3  "test was OK"

和数据框(DF2)

structure(list(ID = 1:3, Microorganisms = c("ESCHERICHIA COLI", "CLOSTRIDIUM DIFFICILE", "FUNGI")), class = "data.frame", row.names = c(NA, -3L))

ID Microorganisms
1  ESCHERICHIA COLI
2  CLOSTRIDIUM DIFFICILE
3  FUNGI

我想用正则表达式找到匹配的 DF1 和 DF2 并将它们放到这样的新列中

ID TEXT                                Microorganism
1  "there was not clostridium"         CLOSTRIDIUM DIFFICILE
2  "clostridium difficile positive"    CLOSTRIDIUM DIFFICILE
3  "test was OK"                       no

我尝试过类似的方法

DF1 %>% mutate(Mikroorganism = ifelse(grepl(DF2$Microorganisms, TEXT), str_extract(TEXT, DF2$Microorganisms), "no"))

但事实并非如此。

【问题讨论】:

  • 一个简单的正则表达式不适用于您的第一行:没有"difficile"。您是否正在寻找与 DF2 中的任何单词匹配的匹配项,而不是整个字符串?
  • 是的,我想匹配 DF2 中的任何单词。有可能吗?

标签: r matching string-matching


【解决方案1】:

一种方法是使用fuzzyjoin 包。

DF1 %>%
  fuzzyjoin::regex_left_join(
    transmute(DF2, Microorganisms, ptn = gsub("\\s+", "|", Microorganisms)),
    by = c("Text" = "ptn"), ignore_case = TRUE) %>%
  select(-ptn)
#   ID                           Text        Microorganisms
# 1  1      there was not clostridium CLOSTRIDIUM DIFFICILE
# 2  2 clostridium difficile positive CLOSTRIDIUM DIFFICILE
# 3  3                    test was OK                  <NA>

【讨论】:

  • 当 DF 1 中的 Text 列中的字符串不止一个时,它会起作用吗?我的意思不仅仅是“没有梭菌”,而是 c(“没有梭菌”、“一些文本”、“一些文本)
猜你喜欢
  • 2020-06-03
  • 2017-09-12
  • 2012-07-20
  • 2022-10-14
  • 2020-07-19
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-01-05
相关资源
最近更新 更多