【问题标题】:Turn a loop based code into a vectorised one in R?将基于循环的代码转换为 R 中的矢量化代码?
【发布时间】:2022-01-23 19:12:46
【问题描述】:

我有这个数据集,想根据某些条件执行一些计算:

library(tidyverse)
library(lubridate)

filas <- structure(list(Año = c(rep(2020,4),rep(2021,4),2022), 
                        Mes = c(2:5,3:4,9,11,1), 
                        Id = c(rep(1,7),2,2)), 
                   row.names = c(NA, -9L),
                   class = c("tbl_df", "tbl", "data.frame")) %>% 
   mutate(fecha = make_date(Año,Mes,1),
          meses_imp = make_date(2999,1,1))
Año Mes Id fecha meses_imp
2020 2 1 2020-02-01 2999-01-01
2020 3 1 2020-03-01 2999-01-01
2020 4 1 2020-04-01 2999-01-01
2020 5 1 2020-05-01 2999-01-01
2021 3 1 2021-03-01 2999-01-01
2021 4 1 2021-04-01 2999-01-01
2021 9 1 2021-09-01 2999-01-01
2021 11 2 2021-11-01 2999-01-01
2022 1 2 2022-01-01 2999-01-01

当两个连续的“洞”之间存在“洞”时,我需要为每个“Id”添加行,然后计算这些添加的行。我已经使用“while”循环实现了这一点:

i <- 2
while(!is.na(filas[i,]$Id)) {
  if (as.double(difftime(filas[i,]$fecha,filas[i-1,]$fecha)) > 31 &
      filas[i,]$Id == filas[i-1,]$Id) {
    filas <- add_row(filas,
                     Id = filas[i,]$Id,
                     fecha = filas[i-1,]$fecha + months(1),
                     meses_imp = pmin(filas[i-1,]$fecha,
                                      filas[i-1,]$meses_imp),
                     .after = i-1)}
  i=i+1}

filas2 <- filas %>%
  group_by(Id,meses_imp) %>% 
  summarise(cant_meses_imp = n()) %>%
  ungroup() %>% 
  filter(meses_imp != "2999-01-01")

filas <- left_join(filas,
                   filas2,
                   by=c("Id","meses_imp"))
Año Mes Id fecha meses_imp cant_meses_imp
2020 2 1 2020-02-01 2999-01-01 NA
2020 3 1 2020-03-01 2999-01-01 NA
2020 4 1 2020-04-01 2999-01-01 NA
2020 5 1 2020-05-01 2999-01-01 NA
NA NA 1 2020-06-01 2020-05-01 9
NA NA 1 2020-07-01 2020-05-01 9
NA NA 1 2020-08-01 2020-05-01 9
NA NA 1 2020-09-01 2020-05-01 9
NA NA 1 2020-10-01 2020-05-01 9
NA NA 1 2020-11-01 2020-05-01 9
NA NA 1 2020-12-01 2020-05-01 9
NA NA 1 2021-01-01 2020-05-01 9
NA NA 1 2021-02-01 2020-05-01 9
2021 3 1 2021-03-01 2999-01-01 NA
2021 4 1 2021-04-01 2999-01-01 NA
NA NA 1 2021-05-01 2021-04-01 4
NA NA 1 2021-06-01 2021-04-01 4
NA NA 1 2021-07-01 2021-04-01 4
NA NA 1 2021-08-01 2021-04-01 4
2021 9 1 2021-09-01 2999-01-01 NA
2021 11 2 2021-11-01 2999-01-01 NA
NA NA 2 2021-12-01 2021-11-01 1
2022 1 2 2022-01-01 2999-01-01 NA

由于我想将其应用于更大的数据集(约 300k 行),我如何以矢量化方式重写它以使其更高效(也许更优雅)?

谢谢!

【问题讨论】:

  • 你可以使用 SQL 风格的连接来创建额外的行,避免复杂的逻辑和循环

标签: r loops vectorization


【解决方案1】:

您可以使用padrzoo 包应用以下代码。

这个想法是:

  1. 使用padr::pad() 函数添加缺失的日期。
  2. 删除不需要的行(非整数 Id 值)
  3. 创建nagrp 列以标识在1. 中添加的行
  4. grp分组并创建一列cant_meses_imp来统计每组中连续na的个数
  5. 仅选择所需的列
library(dplyr)
library(padr)
library(zoo)

filas %>% 
  pad(by = "fecha") %>% # add missing dates
  mutate(Id = na.approx(Id)) %>% # interpolate NA values in Id column
  subset(Id%%1 == 0) %>% # Keep only Id interger

# This part is for generating the cant_meses_imp column
  mutate(na = ifelse(is.na(Mes), 1, 0),
         grp = rle(na)$lengths %>% {rep(seq(length(.)), .)}) %>% 
  group_by(grp) %>% 
  mutate(cant_meses_imp = ifelse(na == 0, NA, n())) %>% 
  ungroup() %>% 
  select(-c(na, grp))

该代码没有完全重现 fecha 列,因为它的值没有指导方针。

【讨论】:

    猜你喜欢
    • 2010-11-26
    • 1970-01-01
    • 2011-11-19
    • 2017-01-04
    • 1970-01-01
    • 2012-04-25
    • 2018-11-05
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多