【发布时间】:2013-08-05 16:53:05
【问题描述】:
我知道这个标题有点蹩脚,但我想不出还有什么可以称呼它的。我正在尝试使用 lon(经度列)中出现的值对大型数据框进行子集化。我拥有的当前子集脚本有效,它会在出现 -180(n/a 值)的任何时候创建子集,并在一个或多个 -180 出现之前和之后包含第一个非 -180 数字。我的问题是我希望子集由任何 -180 之前的 20 个经度和之后的 20 个经度组成。由于我的许多文件以 -180s 开头并以 -180s 结尾,因此这是创建和错误。我只是不知道如何告诉脚本子集-180s,但也忽略任何可能出现在第一行或最后一行的内容。理想情况下,该脚本将仅对 -180s 进行子集,该子集之前有 20 个经度,之后有 20 个经度。此外,我永远不会知道文件的开头和结尾会出现多少个 -180,这是我自己解决这个问题的最大问题。下面是我的数据示例和我当前的子集代码。预先感谢您的帮助! 编辑: 行保持相同的顺序并且不以任何方式排序也很重要,因为这是按时间顺序排列的数据。我的数据框有 4461 行和 7 列。 编辑:下面是我的数据框的一个小样本。
cols <- structure(list(fixType = structure(c(39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L), .Label = c("firstfix +indoors +startpoint", "firstfix +indoors +startpoint +cluster_center", "firstfix +indoors +stationary", "firstfix +indoors +stationary +cluster_center", "firstfix +invehicle +startpoint", "firstfix +invehicle +startpoint +cluster_center", "firstfix +invehicle +stationary +cluster_center", "firstfix +outdoors +startpoint", "firstfix +outdoors +startpoint +cluster_center", "firstfix +outdoors +stationary", "firstfix +outdoors +stationary +cluster_center", "inserted +indoors +midpoint", "inserted +indoors +pausepoint", "inserted +indoors +stationary", "inserted +indoors +stationary +cluster_center", "inserted +invehicle +midpoint", "inserted +invehicle +pausepoint", "inserted +invehicle +stationary", "inserted +invehicle +stationary +cluster_center", "inserted +outdoors +midpoint", "inserted +outdoors +pausepoint", "inserted +outdoors +stationary", "inserted +outdoors +stationary +cluster_center", "lastfix +indoors +endpoint", "lastfix +indoors +endpoint +cluster_center", "lastfix +indoors +stationary", "lastfix +indoors +stationary +cluster_center", "lastfix +invehicle +endpoint", "lastfix +invehicle +endpoint +cluster_center", "lastfix +outdoors +endpoint", "lastfix +outdoors +endpoint +cluster_center", "lastfix +outdoors +stationary", "lastvalidfix +indoors +stationary", "lastvalidfix +indoors +stationary +cluster_center", "lastvalidfix +invehicle +stationary", "lastvalidfix +invehicle +stationary +cluster_center", "lastvalidfix +outdoors +stationary", "lastvalidfix +outdoors +stationary +cluster_center", "unknown", "valid +indoors +endpoint", "valid +indoors +endpoint +cluster_center", "valid +indoors +midpoint", "valid +indoors +pausepoint", "valid +indoors +pausepoint +cluster_center", "valid +indoors +startpoint", "valid +indoors +startpoint +cluster_center", "valid +indoors +stationary", "valid +indoors +stationary +cluster_center", "valid +invehicle +endpoint", "valid +invehicle +endpoint +cluster_center", "valid +invehicle +midpoint", "valid +invehicle +pausepoint", "valid +invehicle +startpoint", "valid +invehicle +startpoint +cluster_center", "valid +invehicle +stationary", "valid +invehicle +stationary +cluster_center", "valid +outdoors +endpoint", "valid +outdoors +endpoint +cluster_center", "valid +outdoors +midpoint", "valid +outdoors +pausepoint", "valid +outdoors +pausepoint +cluster_center", "valid +outdoors +startpoint", "valid +outdoors +startpoint +cluster_center", "valid +outdoors +stationary", "valid +outdoors +stationary +cluster_center"), class = "factor"), lon = c(-180, -180, -180, -180, -180, -180, -180, -180, -180, -180), lat = c(-180, -180, -180, -180, -180, -180, -180, -180, -180, -180), activityIntensity = c(2L, 2L, 1L, 2L, 2L, 2L, 0L, 2L, 1L, 0L), Impute = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 4352L, 4353L, 4354L), subsetNum = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("fixType", "lon", "lat", "activityIntensity", "Impute", "ID", "subsetNum"), row.names = c(4462L, 4463L, 4464L, 4465L, 4466L, 4467L, 4468L, 8813L, 8814L, 8815L), class = "data.frame")
子集代码:
set.seed(5)
n <- length(df) #make it equal to the length of whatever the input file is
impCols <- df[ , c("fixType", "lon", "lat", "activityIntensity", "Impute", "ID", "subsetNum")]
test.df <- data.frame(impCols)
df <- test.df
obs <- dim(df)[1]
counter <- 1
subM.List <- list()
start.idx <- NA
for(i in 1:obs){
if (is.na(start.idx) & (substr(df[i,"lon"], 1, 4) == -180)){
start.idx <- i-1
}
else if (!is.na(start.idx) & (substr(df[i,"lon"], 1, 4) != -180)){
end.idx <- i+1 #the plus one will give you the first two instances of signal instead of just the first
subMat <- df[start.idx:end.idx,]
subM.List[[counter]] <- subMat
start.idx <- NA
counter <- counter + 1
}
}
【问题讨论】:
-
考虑使用
max(i, 20)和min(i, nrow(data))之类的东西来开始/结束 20 行的选择。 -
@EllisValentiner 我还应该提到行的顺序是按时间顺序排列的,对于我的应用程序来说,保留行的顺序非常重要。我目前正在尝试最小和最大,但我认为它可能会改变我的行顺序。
-
这里,
diff(lon==180)是你的朋友。如果您可以专门针对df数据发布a reproducible example,那将非常有帮助。 -
您的示例
lon既没有 7 也没有 4461 值。它是您数据集中的部分列吗?在任何情况下,Ricardo 的diff或类似的which(df$lon !=180)都可以帮助您入门。一旦您知道从“180”到“非 180”的转换的行索引,就可以轻松地从这些索引中添加或减去 20 以获得您想要的子集。 -
@CarlWitthoft 是的,抱歉,这只是我数据框中 7 列中的一列,我现在正在努力添加一个可重现的示例。