对数据框进行子集化以包含前后 20 行答案

【问题标题】：Subsetting a data frame to include 20 rows before and after对数据框进行子集化以包含前后 20 行
【发布时间】：2013-08-05 16:53:05
【问题描述】：

我知道这个标题有点蹩脚，但我想不出还有什么可以称呼它的。我正在尝试使用 lon（经度列）中出现的值对大型数据框进行子集化。我拥有的当前子集脚本有效，它会在出现 -180（n/a 值）的任何时候创建子集，并在一个或多个 -180 出现之前和之后包含第一个非 -180 数字。我的问题是我希望子集由任何 -180 之前的 20 个经度和之后的 20 个经度组成。由于我的许多文件以 -180s 开头并以 -180s 结尾，因此这是创建和错误。我只是不知道如何告诉脚本子集-180s，但也忽略任何可能出现在第一行或最后一行的内容。理想情况下，该脚本将仅对 -180s 进行子集，该子集之前有 20 个经度，之后有 20 个经度。此外，我永远不会知道文件的开头和结尾会出现多少个 -180，这是我自己解决这个问题的最大问题。下面是我的数据示例和我当前的子集代码。预先感谢您的帮助！ 编辑： 行保持相同的顺序并且不以任何方式排序也很重要，因为这是按时间顺序排列的数据。我的数据框有 4461 行和 7 列。 编辑：下面是我的数据框的一个小样本。

 cols <- structure(list(fixType = structure(c(39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L), .Label = c("firstfix +indoors +startpoint", "firstfix +indoors +startpoint +cluster_center", "firstfix +indoors +stationary", "firstfix +indoors +stationary +cluster_center", "firstfix +invehicle +startpoint", "firstfix +invehicle +startpoint +cluster_center", "firstfix +invehicle +stationary +cluster_center", "firstfix +outdoors +startpoint", "firstfix +outdoors +startpoint +cluster_center", "firstfix +outdoors +stationary", "firstfix +outdoors +stationary +cluster_center", "inserted +indoors +midpoint", "inserted +indoors +pausepoint", "inserted +indoors +stationary", "inserted +indoors +stationary +cluster_center", "inserted +invehicle +midpoint", "inserted +invehicle +pausepoint", "inserted +invehicle +stationary", "inserted +invehicle +stationary +cluster_center", "inserted +outdoors +midpoint", "inserted +outdoors +pausepoint", "inserted +outdoors +stationary", "inserted +outdoors +stationary +cluster_center", "lastfix +indoors +endpoint", "lastfix +indoors +endpoint +cluster_center", "lastfix +indoors +stationary", "lastfix +indoors +stationary +cluster_center", "lastfix +invehicle +endpoint", "lastfix +invehicle +endpoint +cluster_center", "lastfix +outdoors +endpoint", "lastfix +outdoors +endpoint +cluster_center", "lastfix +outdoors +stationary", "lastvalidfix +indoors +stationary", "lastvalidfix +indoors +stationary +cluster_center", "lastvalidfix +invehicle +stationary", "lastvalidfix +invehicle +stationary +cluster_center", "lastvalidfix +outdoors +stationary", "lastvalidfix +outdoors +stationary +cluster_center", "unknown", "valid +indoors +endpoint", "valid +indoors +endpoint +cluster_center", "valid +indoors +midpoint", "valid +indoors +pausepoint", "valid +indoors +pausepoint +cluster_center", "valid +indoors +startpoint", "valid +indoors +startpoint +cluster_center", "valid +indoors +stationary", "valid +indoors +stationary +cluster_center", "valid +invehicle +endpoint", "valid +invehicle +endpoint +cluster_center", "valid +invehicle +midpoint", "valid +invehicle +pausepoint", "valid +invehicle +startpoint", "valid +invehicle +startpoint +cluster_center", "valid +invehicle +stationary", "valid +invehicle +stationary +cluster_center", "valid +outdoors +endpoint", "valid +outdoors +endpoint +cluster_center", "valid +outdoors +midpoint", "valid +outdoors +pausepoint", "valid +outdoors +pausepoint +cluster_center", "valid +outdoors +startpoint", "valid +outdoors +startpoint +cluster_center", "valid +outdoors +stationary", "valid +outdoors +stationary +cluster_center"), class = "factor"), lon = c(-180, -180, -180, -180, -180, -180, -180, -180, -180, -180), lat = c(-180, -180, -180, -180, -180, -180, -180, -180, -180, -180), activityIntensity = c(2L, 2L, 1L, 2L, 2L, 2L, 0L, 2L, 1L, 0L), Impute = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 4352L, 4353L, 4354L), subsetNum = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("fixType", "lon", "lat", "activityIntensity", "Impute", "ID", "subsetNum"), row.names = c(4462L, 4463L, 4464L, 4465L, 4466L, 4467L, 4468L, 8813L, 8814L, 8815L), class = "data.frame")

子集代码：

set.seed(5)
n <- length(df) #make it equal to the length of whatever the input file is

impCols <- df[ , c("fixType", "lon", "lat", "activityIntensity", "Impute", "ID", "subsetNum")]

test.df <- data.frame(impCols)

df <- test.df
obs <- dim(df)[1]

counter <- 1
subM.List <- list()

start.idx <- NA

for(i in 1:obs){
    if (is.na(start.idx) & (substr(df[i,"lon"], 1, 4) == -180)){
      start.idx <- i-1
  }
  else if (!is.na(start.idx) & (substr(df[i,"lon"], 1, 4) != -180)){
    end.idx <- i+1 #the plus one will give you the first two instances of signal instead of just the first
    subMat <- df[start.idx:end.idx,]
    subM.List[[counter]] <- subMat
    start.idx <- NA
    counter <- counter + 1
  }
}

【问题讨论】：

考虑使用 max(i, 20) 和 min(i, nrow(data)) 之类的东西来开始/结束 20 行的选择。
@EllisValentiner 我还应该提到行的顺序是按时间顺序排列的，对于我的应用程序来说，保留行的顺序非常重要。我目前正在尝试最小和最大，但我认为它可能会改变我的行顺序。
这里，diff(lon==180) 是你的朋友。如果您可以专门针对df 数据发布a reproducible example，那将非常有帮助。
您的示例 lon 既没有 7 也没有 4461 值。它是您数据集中的部分列吗？在任何情况下，Ricardo 的diff 或类似的which(df$lon !=180) 都可以帮助您入门。一旦您知道从“180”到“非 180”的转换的行索引，就可以轻松地从这些索引中添加或减去 20 以获得您想要的子集。
@CarlWitthoft 是的，抱歉，这只是我数据框中 7 列中的一列，我现在正在努力添加一个可重现的示例。

标签： r dataframe subset

【解决方案1】：

indicators <-  df$lon == -180

# the first and last non-zero indicators are your index boundaries
indx.min <- min(which(!indicators))    # will issue warning if lon is nothing but '-180'
indx.max <- max(which(!indicators))    # will issue warning if lon is nothing but '-180'

"My problem is that I would like the subsets to be comprised of 
    the 20 longitudes before any -180s, and 20 after"


# `inPlay` are the indicators that are not at the extreme ends
inPlay <- which(indicators)
inPlay <- inPlay[inPlay > indx.min & inPlay < indx.max]

# Sample Size
S <- 20  # use a variable so you can change it as needed

diffPlay <- diff(inPlay)
stop <- c(which(diffPlay !=1 ), length(inPlay))
start <- c(1,   which(diffPlay !=1 )+1)

# these are your rowranges of `180s`.  You can have a look if youd like
rbind(inPlay[start], inPlay[stop])

# we are going to take the 20 rows before each "start"
#   and the 20 rows after each "start" + "plus"
inPlayPlus <- inPlay[stop] - inPlay[start]
inPlayStart <- inPlay[start]

## The names given to `inPlay` will be the name of your subsetted list
names(inPlayStart) <- ifelse(inPlayPlus > 0, paste0("Rows", inPlayStart, "_to_", inPlayStart+inPlayPlus), paste0("Row", inPlayStart))

subsetsList <- 
  lapply(seq_along(inPlayStart), function(i) {
      # This can be one line.  Broken up so you can see what's happening
      from <- max(indx.min, inPlayStart[[i]]-S) # notice, calling max on the min
      to   <- min(indx.max, inPlayStart[[i]] + inPlayPlus[[i]] +S) #    and  min on the max

      cat("i is ", i, "\tPlus=", inPlayPlus[[i]], "\t(from, to) = (", from, ", ", to, ")\tDIFF=", to-from, "\n", sep="")
      indx <- if (inPlayPlus[[i]] == 0) from:to else setdiff(from:to, inPlayStart[[i]]+(1:inPlayPlus[[i]]) )
      df[indx, ] 
    })


## Have a look at the results
subsetsList

【讨论】：

是的，对：从我的评论中告发 :-) 。但是，从严重的方面来说，这个答案只能找到一个以 -180 为界的行，不是吗？我猜在 4600 行中，有几个这样的块，所以你需要找到的不仅仅是 indicators 的 min 和 max。
@CarlWitthoft，看看lapply 声明。它迭代每次出现的 -180。最小值/最大值的唯一目的是排除任何开始/结束系列。
@RicardoSaporta 这跑了，但它给了我 335 个子集（4461 行中）太多了，每个子集由 41 行组成，它们只是一系列 na。不知道出了什么问题，但我正在尝试一些不同的东西。
@KMess.，我很困惑。您要求每个子集包含 41 行（20 之前 + 1 + 20 之后 = 41）。顺便说一句，我的回答中引用了您的确切要求您期望什么输出？听起来您在 4461 个值中有 335 个-180。
@RicardoSaporta，很抱歉，我以为我说得很清楚。通常 -180 是成组出现的，所以我希望将整个 -180 组与之前的 20 个非 180 和之后的 20 个非 180 一起进行子集化。的子集的一个例子是，-117.2229 -117.2232 -117.2232 -117.2232 -117.2232 -117.2232 -117.2231 -117.2226 -117.2228 -117.2229 -117.2227 -117.2229 -117.2229 -117.2229 -117.2229 -117.2229 -117.2229 -117.2229 -117.2229 -117.2229 -180.0000 -180.0000 -180.0000 -117.2229 -117.2232 -117.2232 -117.2232 -117.2232 -117.2231 -117.2226 -117.2229 -117.2229（最多 20 个）