【发布时间】:2022-01-23 19:12:46
【问题描述】:
我有这个数据集,想根据某些条件执行一些计算:
library(tidyverse)
library(lubridate)
filas <- structure(list(Año = c(rep(2020,4),rep(2021,4),2022),
Mes = c(2:5,3:4,9,11,1),
Id = c(rep(1,7),2,2)),
row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame")) %>%
mutate(fecha = make_date(Año,Mes,1),
meses_imp = make_date(2999,1,1))
| Año | Mes | Id | fecha | meses_imp |
|---|---|---|---|---|
| 2020 | 2 | 1 | 2020-02-01 | 2999-01-01 |
| 2020 | 3 | 1 | 2020-03-01 | 2999-01-01 |
| 2020 | 4 | 1 | 2020-04-01 | 2999-01-01 |
| 2020 | 5 | 1 | 2020-05-01 | 2999-01-01 |
| 2021 | 3 | 1 | 2021-03-01 | 2999-01-01 |
| 2021 | 4 | 1 | 2021-04-01 | 2999-01-01 |
| 2021 | 9 | 1 | 2021-09-01 | 2999-01-01 |
| 2021 | 11 | 2 | 2021-11-01 | 2999-01-01 |
| 2022 | 1 | 2 | 2022-01-01 | 2999-01-01 |
当两个连续的“洞”之间存在“洞”时,我需要为每个“Id”添加行,然后计算这些添加的行。我已经使用“while”循环实现了这一点:
i <- 2
while(!is.na(filas[i,]$Id)) {
if (as.double(difftime(filas[i,]$fecha,filas[i-1,]$fecha)) > 31 &
filas[i,]$Id == filas[i-1,]$Id) {
filas <- add_row(filas,
Id = filas[i,]$Id,
fecha = filas[i-1,]$fecha + months(1),
meses_imp = pmin(filas[i-1,]$fecha,
filas[i-1,]$meses_imp),
.after = i-1)}
i=i+1}
filas2 <- filas %>%
group_by(Id,meses_imp) %>%
summarise(cant_meses_imp = n()) %>%
ungroup() %>%
filter(meses_imp != "2999-01-01")
filas <- left_join(filas,
filas2,
by=c("Id","meses_imp"))
| Año | Mes | Id | fecha | meses_imp | cant_meses_imp |
|---|---|---|---|---|---|
| 2020 | 2 | 1 | 2020-02-01 | 2999-01-01 | NA |
| 2020 | 3 | 1 | 2020-03-01 | 2999-01-01 | NA |
| 2020 | 4 | 1 | 2020-04-01 | 2999-01-01 | NA |
| 2020 | 5 | 1 | 2020-05-01 | 2999-01-01 | NA |
| NA | NA | 1 | 2020-06-01 | 2020-05-01 | 9 |
| NA | NA | 1 | 2020-07-01 | 2020-05-01 | 9 |
| NA | NA | 1 | 2020-08-01 | 2020-05-01 | 9 |
| NA | NA | 1 | 2020-09-01 | 2020-05-01 | 9 |
| NA | NA | 1 | 2020-10-01 | 2020-05-01 | 9 |
| NA | NA | 1 | 2020-11-01 | 2020-05-01 | 9 |
| NA | NA | 1 | 2020-12-01 | 2020-05-01 | 9 |
| NA | NA | 1 | 2021-01-01 | 2020-05-01 | 9 |
| NA | NA | 1 | 2021-02-01 | 2020-05-01 | 9 |
| 2021 | 3 | 1 | 2021-03-01 | 2999-01-01 | NA |
| 2021 | 4 | 1 | 2021-04-01 | 2999-01-01 | NA |
| NA | NA | 1 | 2021-05-01 | 2021-04-01 | 4 |
| NA | NA | 1 | 2021-06-01 | 2021-04-01 | 4 |
| NA | NA | 1 | 2021-07-01 | 2021-04-01 | 4 |
| NA | NA | 1 | 2021-08-01 | 2021-04-01 | 4 |
| 2021 | 9 | 1 | 2021-09-01 | 2999-01-01 | NA |
| 2021 | 11 | 2 | 2021-11-01 | 2999-01-01 | NA |
| NA | NA | 2 | 2021-12-01 | 2021-11-01 | 1 |
| 2022 | 1 | 2 | 2022-01-01 | 2999-01-01 | NA |
由于我想将其应用于更大的数据集(约 300k 行),我如何以矢量化方式重写它以使其更高效(也许更优雅)?
谢谢!
【问题讨论】:
-
你可以使用 SQL 风格的连接来创建额外的行,避免复杂的逻辑和循环
标签: r loops vectorization