使用滑动窗口匹配两个数据帧的字符串答案

【问题标题】：String matching two dataframes with a sliding window使用滑动窗口匹配两个数据帧的字符串
【发布时间】：2019-01-23 01:01:54
【问题描述】：

我有两个 df。

df1
       col1
    1    a
    2    b
    3    c
    4    c
df2
   setID col1
1      1    a
2      1    b
3      1    b
4      1    a
5      2    w
6      2    v
7      2    c
8      2    b
9      3    a
10     3    a
11     3    b
12     3    a
13     4    a
14     4    b
15     4    c
16     4    a

我正在使用以下代码来匹配它们。

scorematch <- function () 
{
      require("dplyr")
      #to make sure every element is preceded by the one before that element
      combm <- rev(sapply(rev(seq_along(df1$col1)), function(i) paste0(df1$col1[i-1], df1$col1[i])));
      tempdf <- df2
      #group the history by their ID
      tempdf <- group_by(tempdf, setID)
      #collapse strings in history
      tempdf <- summarise(tempdf, ss = paste(col1, collapse = ""))
      tempdf <- rowwise(tempdf)
      #add score based on how it matches compared to path
      tempdf <- mutate(tempdf, score = sum(sapply(combm, function(x) sum(grepl(x, ss)))))
      tempdf <- ungroup(tempdf)
      #filter so that only IDs with scores more than 0 are available
      tempdf <- filter(tempdf, score != 0)
      tempdf <- pull(tempdf, setID)
      #filter original history to reflect new history
      tempdf2 <- filter(df2, setID %in% tempdf)
      tempdf2
    }

这段代码很好用。但我想更进一步。我想应用一个滑动窗口函数来获取我想要与 df2 匹配的 df1 值。到目前为止，我使用这个函数作为我的滑动窗口。

slidingwindow <- function(data, window, step)
{
  #data is dataframe with colname
  total <- length(data)
  #spots are start of each window
  spots <- seq(from=1, to=(total-step), by=step)
  result <- vector(length = length(spots))
  for(i in 1:length(spots)){
    ...
  }
  return(result)
}

scorematch 函数将嵌套在slidingwindow 函数中。我不确定如何从那里开始。理想情况下 df1 将被拆分为多个窗口。从第一个窗口开始，它将使用 scorematch 函数与 df2 进行匹配，以获得过滤掉的 df2。然后我希望 df1 的第二个窗口与新过滤的 df2 匹配，依此类推。过滤掉 df2 后，循环应该结束，因此它只包含 1 个不同的 setID 值。最终输出可以是整个过滤后的 df2，也可以是剩余的 setID。理想的输出是

  setID col1
1     4    a
2     4    b
3     4    c
4     4    a

或

[1] "4"

【问题讨论】：

你能解释一下，为什么最后一个 a 也匹配？我不希望它匹配，因为 df1 中序列中的下一个字母是 c。那么比赛的顺序不重要吗？此外，如果我只运行分数匹配函数，我也会得到 1 与 a b b a 的匹配，我希望这是第二部分。您能否进一步详细说明匹配的目的？我认为粘贴和正则表达式可能有更短的方法......
tempdf <- pull(tempdf, setID) ::: Error in eval_tidy(enquo(var), var_env) : object 'setID' not found
@kath 匹配函数是基于分数的。我已经设置了它，所以任何不是 0 的分数都将被返回。它不是在寻找完美的匹配。只是最接近的比赛。目的是我想找出我在 df2 中的哪一组与 df1 最接近。在你给我一个更简单的解释来匹配整个 df1 和 df2 之前，这不是我需要的。我要求 df1 逐个窗口与 df2 匹配。是的，顺序很重要。

标签： r dataframe dplyr sliding-window

【解决方案1】：

这是一个不使用for-loop 的解决方案。我使用stringr 是因为它具有良好的一致语法，purrr 代表map（尽管在这种情况下lapply 就足够了）和dplyr 到group_by setID 并折叠每个组的字符串。

library(dplyr)
library(purrr)
library(stringr)

首先我折叠每个组的字符串。这使得使用str_detect-later 的模式匹配更容易：

df2_collapse <- df2 %>% 
  group_by(setID) %>% 
  summarise(string = str_c(col1, collapse = ""))

df2_collapse
# A tibble: 4 x 2
#   setID string
#   <int> <chr> 
# 1     1 abba  
# 2     2 wvcb  
# 3     3 aaba  
# 4     4 abca

“查找”字符串也被折叠，然后用str_sub 提取子字符串（即滑动窗口）。在这里，我沿着字符串str_length 的长度工作，并提取字符串中每个字母后面的所有可能组。

string <- str_c(df1$col1, collapse = "")

string
# [1] "abcc"

substrings <- 
  unlist(map(1:str_length(string), ~ str_sub(string, start = .x, end = .x:str_length(string))))

将子字符串存储在 tibble 中，并将其长度作为分数。

substrings
# [1] "a"    "ab"   "abc"  "abcc" "b"    "bc"   "bcc"  "c"    "cc"   "c" 

substrings <- tibble(substring = substrings, 
                     score = str_length(substrings))

substrings
# A tibble: 10 x 2
#    substring score
#    <chr>     <int>
#  1 a             1
#  2 ab            2
#  3 abc           3
#  4 abcc          4
#  5 b             1
#  6 bc            2
#  7 bcc           3
#  8 c             1
#  9 cc            2
# 10 c             1

对于每个 setID，提取它在子字符串数据中匹配的最高分数，并过滤掉所有 setID 中最高分数的行。

df2_collapse %>%
  mutate(score = map_dbl(string, 
                          ~ max(substrings$score[str_detect(.x, substrings$substring)]))) %>% 
  filter(score == max(score))

# A tibble: 1 x 3
#     setID string score
#     <int> <chr>  <dbl>
#   1     4 abca       3

数据

df1 <- structure(list(col1 = c("a", "b", "c", "c")), 
                 class = "data.frame", row.names = c("1", "2", "3", "4"))
df2 <- 
  structure(list(setID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), 
                 col1 = c("a", "b", "b", "a", "w", "v", "c", "b", "a", "a", "b", "a", "a", "b", "c", "a")), 
            class = "data.frame", 
            row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16"))

【讨论】：

嗨。你能告诉我这是否同时将 df1 的所有元素与 df2 匹配？我想要滑动窗口方法的原因是我想找到匹配而不必使用所有 df1.假设这个样本数据不够长，不足以说明这一点。但在实际数据中，我在 df1 中有超过 1000 行，在 df2 中有超过 1000 万行。因此，我正在寻找一种在遍历 df1 的所有行之前将 df1 与 df2 的 setID 匹配的解决方案。希望这是有道理的。
是的，这将所有 df1（实际上是所有子集）与 df2 匹配，这可能需要很长时间，但也许您可以在数据的子集上尝试一下，看看需要多长时间