有几种方法可以做到这一点,我将提供一种使用基函数的方法。 (另一种方法是使用dplyr,也非常适合这种情况。但是,基本示例应该足够简单。)
生成数据
(这里只是因为我们没有您的任何数据。)
n <- 10
for (ii in 1:3) {
dat <- runif(n)
writeLines(paste(dat, collapse = ','),
con = sprintf('user2062207-file%s.txt', ii))
}
readLines('user2062207-file1.txt')
## [1] "0.929472318384796,0.921938128070906,0.707776406314224,0.236701443558559,0.271322417538613,0.388766387710348,0.422867075540125,0.324589917669073,0.92406965768896,0.171326051233336"
读取数据
假设您有一个查找文件的简单模式,您将从这里开始。
fnames <- list.files(pattern = 'user2062207-file.*.txt')
allData <- unlist(sapply(fnames, read.table, sep = ','))
allRange <- range(allData)
df <- data.frame(x = allData)
head(df)
## x
## 1 0.9294723
## 2 0.9219381
## 3 0.7077764
## 4 0.2367014
## 5 0.2713224
## 6 0.3887664
dim(df)
## [1] 30 1
设置垃圾箱
下面的 {floor,ceiling} +/- binSize 是因为 bin 仅包含范围的一侧(默认值:右侧),因此不会对最小值进行 bin 。它还确保 bin 位于圆形边界上。
binSize <- 0.05
allBins <- seq(floor(allRange[1] / binSize) * binSize,
ceiling(allRange[2] / binSize) * binSize,
by = binSize)
## bin the data
df$bin <- cut(df$x, breaks = allBins)
head(df)
## x bin
## 1 0.9294723 (0.9,0.95]
## 2 0.9219381 (0.9,0.95]
## 3 0.7077764 (0.7,0.75]
## 4 0.2367014 (0.2,0.25]
## 5 0.2713224 (0.25,0.3]
## 6 0.3887664 (0.35,0.4]
每个 Bin 的统计数据
sapply(levels(df$bin), function(lvl) median(df$x[df$bin == lvl], na.rm = TRUE))
## (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35]
## 0.03802277 NA 0.11528715 0.18195392 0.22918094 0.27132242 0.33626971
## (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7]
## 0.38009637 0.42184059 NA 0.53826028 0.57820253 0.64165116 0.67825992
## (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9] (0.9,0.95] (0.95,1]
## 0.74243926 NA 0.80759621 0.88974267 0.92406966 0.95691077
这是一个可以有许多其他选择的领域。例如,基本函数by 可以工作,尽管处理它的数据结构并不总是直观的,即使函数调用本身很容易阅读:
head(by(df$x, df$bin, median, na.rm = TRUE))
## df$bin
## (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3]
## 0.03802277 NA 0.11528715 0.18195392 0.22918094 0.27132242
您也可以轻松使用dplyr。这个例子以原来的allData和allBins开头:
library(dplyr)
data.frame(x = allData) %>%
mutate(bin = cut(x, breaks = allBins)) %>%
group_by(bin) %>%
summarise(median(x))
## Source: local data frame [17 x 2]
## bin median(x)
## 1 (0,0.05] 0.03802277
## 2 (0.1,0.15] 0.11528715
## 3 (0.15,0.2] 0.18195392
## 4 (0.2,0.25] 0.22918094
## 5 (0.25,0.3] 0.27132242
#### ..snip..
第一个示例保留空箱,而其他方法不知道空箱。可能还有其他使用 by 和 dplyr 的方法,包括这些空垃圾箱,但这似乎就足够了。
编辑
聊了一会,确定数据的范围太宽了,bin宽度为0.0005。设计了一个更好的解决方案。 (没有样本数据可提供,抱歉,不是我要提供的......)我将使用随机数据来模拟这个过程:
set.seed(42)
x <- 5e7 * runif(5e5)
library(dplyr)
binSize <- 0.0005
df <- data.frame(dat = sort(x))
df$bin <- floor(df$dat / binSize) * binSize
head(df)
## dat bin
## 1 410.9577 410.9575
## 2 456.6275 456.6270
## 3 552.3674 552.3670
## 4 875.4898 875.4895
## 5 1018.6806 1018.6805
## 6 1102.2436 1102.2435
system.time(results <- df %>% group_by(bin) %>% summarize(med = median(dat)))
## user system elapsed
## 12.08 0.00 12.11
head(results)
## Source: local data frame [6 x 2]
## bin med
## 1 410.9575 410.9577
## 2 456.6270 456.6275
## 3 552.3670 552.3674
## 4 875.4895 875.4898
## 5 1018.6805 1018.6806
## 6 1102.2435 1102.2436