【问题标题】：R — Assign value to vector based on first episodeR - 根据第一集为向量赋值
【发布时间】：2018-06-15 15:03:41
【问题描述】：

所以我有一个看起来像这样的序列数据集

  id epnum clockst
1  1     1       0
2  1     2       1
3  1     3       2
4  2     1       4
5  2     2       5
6  2     3       6
7  3     1       4
8  3     2       5
9  3     3       6

我想要的是基于epnum == 1 创建一个clockst 的向量。

所以，我基本上想要这个

  id epnum clockst ep_start
1  1     1       0        0
2  1     2       1        0
3  1     3       2        0
4  2     1       4        4
5  2     2       5        4
6  2     3       6        4
7  3     1       4        4
8  3     2       5        4
9  3     3       6        4

但是，我很难做到这一点。

我想出了这个，但它并不完全有效。

dt$ep_start = ifelse(dt$epnum == 1 & dt$clockst == 0, 0, 
    ifelse(dt$epnum == 1 & dt$clockst == 4, 4, -9))

有什么想法吗？

数据

dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L), .Label = c("1", "2", "3"), class = "factor"), epnum = structure(c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2", "3"), class = "factor"), 
clockst = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 4L, 5L, 6L), .Label = c("0", 
"1", "2", "4", "5", "6"), class = "factor")), .Names = c("id", 
"epnum", "clockst"), row.names = c(NA, -9L), class = "data.frame")

【问题讨论】：

不完全清楚你在问什么。我不明白ep_start 是什么。
尽管down 不值得投票

标签： r vector sequence

【解决方案1】：

这是一个使用 tidyverse 的解决方案：

首先检查条件epnum == 1，如果TRUE，如果不是NA，则使用clockst值。然后用以前的值填充NA。

由于clockst 是一个因素，因此需要将其转换为数字，同时保持相同的值，因此需要使用as.numeric(as.character(。

library(tidyverse)
dt %>%
  mutate(ep_start = ifelse(epnum == 1, as.numeric(as.character(clockst)), NA)) %>%
  fill(ep_start, .direction = "down")
#output:
  id epnum clockst ep_start
1  1     1       0        0
2  1     2       1        0
3  1     3       2        0
4  2     1       4        4
5  2     2       5        4
6  2     3       6        4
7  3     1       4        4
8  3     2       5        4
9  3     3       6        4

这里是可用答案的快速比较。我选择使用 90 k 行数据集：

df <- df[rep(1:nrow(df), times = 10000),] #where df = dt

dt <- data.table(df)

library(microbenchmark)
bench <- microbenchmark(SunBee = dt[, ep_start := .SD[1]$clockst, by = "id"],
                        missuse = df %>%
                          mutate(ep_start = ifelse(epnum == 1, as.numeric(as.character(clockst)), NA)) %>%
                          fill(ep_start, .direction = "down"),
                        d.b. = df$clockst[rep(which(df$epnum == 1), rle(cumsum(df$epnum == 1))$lengths)],
                        www = df %>%
                          arrange(id, epnum) %>%
                          group_by(id) %>%
                          mutate(ep_start = first(clockst)) %>%
                          ungroup())

plot(bench)

使用 900 k 行数据集：

天哪，我真的需要学习 DT。

【讨论】：

@giacomo 添加了可用解决方案的基准。
谢谢，太好了！现在我正在处理大型数据库，data.table 是必不可少的！ dplyr 太慢了，尽管它的合成器很漂亮！

【解决方案2】：

另一个tidyverse 解决方案。如果您确定行的顺序正确，则不需要arrange。

library(dplyr)

dt2 <- dt %>%
  arrange(id, epnum) %>%
  group_by(id) %>%
  mutate(ep_start = first(clockst)) %>%
  ungroup()
dt2
# # A tibble: 9 x 4
#   id     epnum  clockst ep_start
#   <fctr> <fctr> <fctr>  <fctr>  
# 1 1      1      0       0       
# 2 1      2      1       0       
# 3 1      3      2       0       
# 4 2      1      4       4       
# 5 2      2      5       4       
# 6 2      3      6       4       
# 7 3      1      4       4       
# 8 3      2      5       4       
# 9 3      3      6       4

【讨论】：

有趣的谢谢，我不知道function first
我回答的另一个假设是所有 id 组在 epnum 中都有 1。如果不是这样，其他人的答案会更好。
是的，实际上就是这种情况，但感谢您提出这一点。一个问题是我正在处理一个庞大的数据集，所以data.table 在这里效率更高。谢谢

【解决方案3】：

您可以使用library(data.table) 执行此操作，如下所示

T <- data.table(T)
T[, ep_start := .SD[1]$clockst, by = "id"]

这给出了：

   id epnum clockst ep_start
1:  1     1       0        0
2:  1     2       1        0
3:  1     3       2        0
4:  2     1       4        4
5:  2     2       5        4
6:  2     3       6        4
7:  3     1       4        4
8:  3     2       5        4
9:  3     3       6        4

【讨论】：

【解决方案4】：

dt$ep_start = dt$clockst[rep(which(dt$epnum == 1), rle(cumsum(dt$epnum == 1))$lengths)]
dt
#  id epnum clockst ep_start
#1  1     1       0        0
#2  1     2       1        0
#3  1     3       2        0
#4  2     1       4        4
#5  2     2       5        4
#6  2     3       6        4
#7  3     1       4        4
#8  3     2       5        4
#9  3     3       6        4

【讨论】：

【解决方案5】：

使用match

clock = dt[dt$epnum == 1, ]
dt$ep_start = clock$clockst[match(dt$id, clock$id)]

【讨论】：