在R中逐个字符匹配字符串答案

【问题标题】：String matching character by character in R在R中逐个字符匹配字符串
【发布时间】：2019-03-12 05:06:55
【问题描述】：

我有这个包含字符串和 ID 行的数据框。我称之为历史。

history
ID   string
1.1  a b b b c c s d s ....
1.2  a b b b b c s s d ....
2.1  a c c s s d b d b ....
2.2  a c s c s d b d b ....
3.1  a z z x d b d d f ....
3.2  a z x z d d f b d ....
...

每一行的字符串都很长。属于相同编号的 ID（如 1.1 和 1.2）具有相似的字符串，但存在细微差别。 1.1 和 2.2 虽然它们之间会有更大的差异。原始数据大约有 70 行。

test
string
a c c c s s d b d b....

我的目标是，如果我有另一个包含“历史”中不存在的字符串的数据框，我想在“历史”中找到它最匹配的 ID。我知道有很多文本匹配方法可以做到这一点。我的问题来自于我无法将“test”中的整个字符串与“history”相匹配。

这样做的重点是看看我是否可以在不必匹配整个字符串的情况下找出“test”中的字符串属于哪个 ID。我想到的一个想法是在我们在测试中进行更多匹配时过滤掉历史。

我的预期输出：在这里，我假设匹配从“test”中字符串的第一个字符与“history”中字符串的第一个字符开始。我们一个接一个地去。这两个假设都不是固定的。 “历史”和“测试”中字符串的长度也可以不同。

“test”中的第一个字符“a”与“history”中的所有字符匹配。所以在这种情况下不会发生过滤。

test
string
a

结果：

history
    ID   string
    1.1  a b b b c c s d s ....
    1.2  a b b b b c s s d ....
    2.1  a c c s s d b d b ....
    2.2  a c s c s d b d b ....
    3.1  a z z x d b d d f ....
    3.2  a z x z d d f b d ....
    ...

第二个字符是“c”。在这里确保我们没有匹配“历史”中某个地方的随机“c”，我认为建立规则会有所帮助。如果“a”然后是“c”，就会出现类似匹配的情况。

test
    string
    a c

结果：

history
    ID   string
    2.1  a c c s s d b d b ....
    2.2  a c s c s d b d b ....

这已经将匹配范围缩小到历史 ID 2.1 和 2.2。坦率地说，正如我之前所说，我们甚至可以在这里停下来，这两者之间的差异很小。总之，一旦历史被过滤到只有一个 ID，它应该输出哪个 ID 与“测试”字符串最匹配。

【问题讨论】：

您的数据有多大？在我看来，最好的效率没有客观的答案，因为它实际上取决于整个字符串的大小以及通常需要多少元素来缩小范围。你的字符串真的都是空格分隔的字母序列吗？

标签： r dataframe string-matching

【解决方案1】：

以 AntoniosK 上面给出的优秀示例为基础：

您可以为每列应用一些加权因子。因此，如果第 1 列非常重要，则将其与 10.000 相乘，第二列仅与 1.000 相乘。然后将值逐行求和并找到最高和以获得最佳拟合字符串。

（又名 a b c d e f 匹配 a b c x x x 比 a b x d e f 匹配）

library(tidyverse)
df = data.frame(ID = c(1.1,1.2,2.1,2.2,3.1,3.2),
                string = c("a b b b c c s d s",
                           "a b b b b c s s d",
                           "a c c s s d b d b",
                           "a c s c s d b d b",
                           "a z z x d b d d f",
                           "a z x z d d f b d"), 
                stringsAsFactors = F)
# string to test
test <-  "a c c c s s"

weights <- c(1000,100,10,10,10,10,10,10,10)

df_answer <- df %>%
  separate_rows(string) %>%
  group_by(ID) %>%
  mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>% 
  mutate(scores = (string == test) * weights) %>% 
  summarise(scores = sum(scores, na.rm = TRUE)) %>%
  filter(scores == max(scores))

# A tibble: 2 x 2
#     ID scores
#  <dbl>  <dbl>
#1   2.1   1120
#2   2.2   1120

【讨论】：

有趣的想法。我知道如何使用单独的分割字符串。但是我该如何匹配测试字符串并包含权重因子？

【解决方案2】：

这里有两个 tidyverse 解决方案将返回 ID 值，其中包含与您的测试字符串匹配的最大数量和匹配数量：

df = data.frame(ID = c(1.1,1.2,2.1,2.2,3.1,3.2),
                string = c("a b b b c c s d s",
                           "a b b b b c s s d",
                           "a c c s s d b d b",
                           "a c s c s d b d b",
                           "a z z x d b d d f",
                           "a z x z d d f b d"), 
                stringsAsFactors = F)

library(tidyverse)

# string to test
test = "a c c c s s"

选项 1（考虑任何位置的匹配项）

df %>%
  separate_rows(string) %>%
  group_by(ID) %>%
  mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>%
  na.omit() %>%
  summarise(matches = sum(string == test)) %>%
  filter(matches == max(matches))

# # A tibble: 2 x 2
#      ID matches
#   <dbl>   <int>
# 1   2.1       4
# 2   2.2       4

选项 2（考虑连续匹配）

df %>%
  separate_rows(string) %>%
  group_by(ID) %>%
  mutate(test = unlist(strsplit(test, split = " "))[row_number()]) %>%
  na.omit() %>%
  summarise(matches = sum(cumprod(string == test))) %>%
  filter(matches == max(matches))

# # A tibble: 1 x 2
#        ID matches
#     <dbl>   <dbl>
#   1   2.1       3

【讨论】：

如果字符串的长度不同？