有没有办法根据来自单独数据帧的值从行子集计算中位数？答案

【问题标题】：Is there a way to calculate a median from a subset of rows based on a value from a separate data frame?有没有办法根据来自单独数据帧的值从行子集计算中位数？
【发布时间】：2021-03-26 13:12:45
【问题描述】：

我在一些数据分析方面遇到了困难（我已经退出 R 游戏几年了，重新加入很难）。

我有两组信息，需要根据另一组中的值提取一个数据帧的部分内容。我需要从中提取信息的是具有相关时间的 GPS 点数据帧（每秒一个 GPS 点）。文件中列出了四个单独的 GPS 单元，因此对于每个时间点，有四个位置。

我有第二个数据框，其中包含时间列表（超过 1,000 个时间点）。我需要在我的第二个数据帧中列出的时间周围计算每个 GPS 单元的 GPS 中值坐标一分钟。这很容易实现吗？

以下是 GPS 文件中的几行（颜色代表不同的 GPS 单位）：

UNIT    LONG        LAT         TIME    
BLUE    -133.528    57.317723   11:00:00    
ORANGE  -133.546435 57.316681   11:00:00    
PURPLE  -133.54297  57.3112     11:00:00    
YELLOW  -133.53807  57.319616   11:00:00    
BLUE    -133.527995 57.317725   11:00:01    
ORANGE  -133.546425 57.316681   11:00:01    
PURPLE  -133.542961 57.311201   11:00:01    
YELLOW  -133.538061 57.319616   11:00:01    
BLUE    -133.527991 57.317725   11:00:02    
ORANGE  -133.546415 57.316681   11:00:02    
PURPLE  -133.542955 57.311203   11:00:02    
YELLOW  -133.538053 57.319615   11:00:02

其他数据文件只是时间列表

StartTime
11:00:00
11:51:25
12:15:17

等等

我很想告诉你我尝试过的事情，但老实说，我还没有想出任何在 R 中可行的东西，只是在导入之前尝试在 excel 中操作数据。如果您需要我提供更多信息以提供帮助，请告诉我 - 长期读者，第一次发帖。

谢谢！

【问题讨论】：

如果您提供一些数据会有所帮助。每个数据框中的示例行，作为纯文本（不是图像），以及一些示例输出。
我将编辑帖子以包含一些数据 - 谢谢！

标签： r dataframe subset median

【解决方案1】：

这个想法是将这些数据转换为“POSIXct”类对象，这是基础 R 存储日期的方式。从那里，我们可以获取时间差异并找到可接受时间窗口内的行。最后，将数据按 UNIT 拆分，计算中位数，将数据绑定回一个看起来像原始数据的 data.frame。

创建示例数据

UNIT <- c("BLUE", "ORANGE", "PURPLE", "YELLOW")
TIME <- paste(rep(10:12, each = 3600), rep(0:59, each = 60), 0:59, sep = ":")
x <- data.frame(
    UNIT = UNIT,
    LONG = NA, LAT = NA,
    TIME = rep(TIME, each = length(UNIT))
)
x$LONG <- stats::rnorm(nrow(x), mean = -133.53, sd = 0.1)
x$LAT  <- stats::rnorm(nrow(x), mean =   57.31, sd = 0.1)


StartTime <- c("11:00:00", "11:51:25", "12:15:17")

实际计算

# the "2000-01-01" is arbitrary, just used to convert our times to class "POSIXct" 
#     which can be used to calculate time differences
.StartTime <- as.POSIXct(paste0("2000-01-01 ", StartTime))
.TIME      <- as.POSIXct(paste0("2000-01-01 ", x$TIME))


# loop through each StartTime
median.dat <- lapply(.StartTime, function(.StartTimei) {
    
    
    # figure out which TIME values are within the 1 minute window
    within <- abs(.TIME - .StartTimei) <= .difftime(30, units = "secs")
    
    
    # select the rows corresponding to times within the acceptable range
    dat <- x[within, , drop = FALSE]
    
    
    # select columns "LONG" and "LAT", then split the rows by "UNIT"
    dat <- split(dat[c("LONG", "LAT")], dat$UNIT)
    
    
    # for each "UNIT", ...    
    dat <- lapply(X = dat, FUN = function(dati) {
        
        
        # find the median of "LONG" and "LAT"
        vapply(X = dati, FUN = "median", FUN.VALUE = NA_real_)
    })
    
    
    # bind this list of data into a matrix
    dat <- do.call("rbind", dat)
    
    
    # attach the "UNIT" and "TIME" so it looks like the original data.frame
    return(data.frame(
        UNIT = rownames(dat),
        dat,
        TIME = .StartTimei,
        row.names = NULL
    ))
})


# bind this list of data into a data.frame
median.dat <- do.call("rbind", median.dat)


print(median.dat)

【讨论】：