从与另一列最接近的多列中查找值答案

【问题标题】：Lookup value from multiple columns that is the closest match to another column从与另一列最接近的多列中查找值
【发布时间】：2018-08-16 20:21:02
【问题描述】：

当访问网站时，我试图从一系列其他列中找到与 lastseen 的值最接近的匹配项 (d1,d2,d3,d4,d5 ) 为了创建一个新列 nextvisit，其值来自 d1、d2、d3、d4 或 d5，它是 lastseen 中值的下一个最大值（即, 个人最后一次出现后的下一次访问）。

一个可重现的例子：

indiv lastseen d1  d2  d3  d4   d5
A     2         2   4   5   8   10
B     5         2   3   5   7    9
C     9         1   6   9  11   15

所以我正在寻找的答案是：

indiv lastseen d1  d2  d3  d4   d5  nextvisit
A     2         2   4   5   8   10          4
B     5         2   3   5   7    9          7
C     9         1   6   9  11   15         11

如，4 是列 d1:d5 中的第二大数字，高于 2 的单个 A。

我尝试过使用 tidyr 和 dplyr，但无法有效地找到下一个最大的匹配项。

谢谢

【问题讨论】：

标签： r

【解决方案1】：

考虑df 是你的data.frame。这是一个完整的 R 基础方法

> ind <- (df[, -c(1,2)]- df[, 2])>0
> df$nextvisit <- apply(df[, -c(1,2)]*ind, 1, function(x) min(x[x!=0]))
> df
  indiv lastseen d1 d2 d3 d4 d5 nextvisit
1     A        2  2  4  5  8 10         4
2     B        5  2  3  5  7  9         7
3     C        9  1  6  9 11 15        11

【讨论】：

【解决方案2】：

另一个base R 选项

idx <- which(DF[-c(1, 2)] == DF$lastseen, arr.ind = TRUE)
idx[, "col"] <- idx[, "col"] + 1 # lastseen + 1 = next visit (in terms of column positions)
DF$nextvisit <- DF[-c(1, 2)][idx]
DF
#  indiv lastseen d1 d2 d3 d4 d5 nextvisit
#1     A        2  2  4  5  8 10         4
#2     B        5  2  3  5  7  9         7
#3     C        9  1  6  9 11 15        11

数据

DF <- structure(list(indiv = c("A", "B", "C"), lastseen = c(2L, 5L, 
9L), d1 = c(2L, 2L, 1L), d2 = c(4L, 3L, 6L), d3 = c(5L, 5L, 9L
), d4 = c(8L, 7L, 11L), d5 = c(10L, 9L, 15L)), .Names = c("indiv", 
"lastseen", "d1", "d2", "d3", "d4", "d5"), class = "data.frame", row.names = c(NA, 
-3L))

【讨论】：

【解决方案3】：

使用tidyverse，我们将gather 'd1' 到'd5' 列转换为'long' 格式，按'indiv' 分组，在'val' 和'last seen' 之间创建一个差异列，@987654323 @ 具有最小正值的行，select 感兴趣的列并与原始数据集进行连接

library(tidyverse)
df1 %>% 
   gather(key, val, d1:d5) %>%
   group_by(indiv) %>% 
   mutate(Diff = val -lastseen, 
          Diff = replace(Diff, Diff <=0, NA)) %>% 
   slice(which.min(Diff)) %>% 
   select(indiv, val) %>% 
   right_join(df1) %>%
   select(names(df1), everything())
# A tibble: 3 x 8
# Groups:   indiv [3]
#  indiv lastseen    d1    d2    d3    d4    d5   val
#  <chr>    <int> <int> <int> <int> <int> <int> <int>
#1 A            2     2     4     5     8    10     4
#2 B            5     2     3     5     7     9     7
#3 C            9     1     6     9    11    15    11

另一个选项是使用来自base R 的max.col。将 'd' 列与 'last seen' 的差异放入一个对象（'m1'）中，将小于等于 0 的值替换为一个非常大的数字，使用max.col 获取该列具有最大值的每一行的索引（反向逻辑 - 将其更改为负数），cbind 与行索引并从与其对应的“d”列中提取值。

m1 <- df1[3:7] -df1$lastseen
m1[m1 <=0] <- 999
df1$val <- df1[3:7][cbind(seq_len(nrow(df1)), max.col(-m1, 'first'))]
df1$val
#[1]  4  7 11

数据

df1 <- structure(list(indiv = c("A", "B", "C"), lastseen = c(2L, 5L, 
9L), d1 = c(2L, 2L, 1L), d2 = c(4L, 3L, 6L), d3 = c(5L, 5L, 9L
), d4 = c(8L, 7L, 11L), d5 = c(10L, 9L, 15L)), 
class = "data.frame", row.names = c(NA, -3L))

【讨论】：

【解决方案4】：

df = read.table(text = "
indiv lastseen d1  d2  d3  d4   d5
                A     2         2   4   5   8   10
                B     5         2   3   5   7    9
                C     9         1   6   9  11   15
                ", header=T)

library(tidyverse)

df %>%
  group_by(indiv, lastseen) %>%  # for each combination
  nest() %>%                     # nest data
  mutate(nextvisit = map2(lastseen, data, ~{vec = unlist(.y); min(vec[vec > .x])})) %>%  # get the minimum value higher than the corresponding lastseen value
  unnest()                       # unnest data

# # A tibble: 3 x 8
#   indiv lastseen nextvisit    d1    d2    d3    d4    d5
#   <fct>    <int>     <int> <int> <int> <int> <int> <int>
# 1 A            2         4     2     4     5     8    10
# 2 B            5         7     2     3     5     7     9
# 3 C            9        11     1     6     9    11    15

【讨论】：

好好利用map2
谢谢。正在寻找一个“更好”的功能来提供，但似乎我必须unlist。

【解决方案5】：

另一种基本 R 方式：

df$lastvisit <- apply(df[,-1], 1, function(x) min(tail(x,5)[tail(x,5)>head(x,1)]))

或可读性较差但更短：

df$lastvisit <- apply(df[,-1], 1, function(x) min(x[-1][x[-1]>x[1]]))

【讨论】：

【解决方案6】：

这是一个结合data.table和Find()函数的解决方案：

library(data.table)
setDT(df1)[, nextvisit := Find(function(x) x > lastseen, .SD), .SDcols = d1:d5, by = indiv]
df1[]

   indiv lastseen d1 d2 d3 d4 d5 nextvisit
1:     A        2  2  4  5  8 10         4
2:     B        5  2  3  5  7  9         7
3:     C        9  1  6  9 11 15        11

Find() 在每行中从左到右对.SD 列应用过滤函数function(x) x > lastseen，并返回满足条件的第一个元素。 .SD 列由 .SDcols = d1:d5 指定。

请注意，每行中的值必须已经按从左到右的升序排序。如果不确定，请使用 .SD。可以替换为sort(.SD)。

【讨论】：

【解决方案7】：

一个 tidyverse 解决方案，完全向量化，这里没有转换为矩阵：

library(tidyverse)
df1 %>% 
  transmute_at(vars(starts_with("d")), ~ ifelse(.x>.y, .x, Inf), .$lastseen) %>%
  invoke(pmin,.) %>%
  bind_cols(df1,nextvisit=.)

#      indiv lastseen d1 d2 d3 d4 d5 nextvisit
#    1     A        2  2  4  5  8 10         4
#    2     B        5  2  3  5  7  9         7
#    3     C        9  1  6  9 11 15        11

【讨论】：