计算直方图或密度函数中的峰值答案

【问题标题】：Calculating peaks in histograms or density functions计算直方图或密度函数中的峰值
【发布时间】：2012-10-19 10:57:03
【问题描述】：

似乎已经有很多“密度函数的峰值”线程，但我没有看到一个专门解决这一点的问题。如果我错过了，很抱歉重复。

我的问题：给定一个包含 1000 个值的向量（附样本），我想识别直方图中的峰值或数据的密度函数。从下面的示例数据图像中，我可以看到直方图中在 ~0、6200 和 8400 处的峰值。但我需要获得这些峰值的确切值，最好是通过一个简单的过程，因为我有几千个这些向量处理。

我最初开始使用直方图输出本身，但无法让任何寻峰命令正常工作（就像，根本没有）。我什至不确定如何从 splus2R 包中获取 peaks() 命令来处理直方图对象或密度对象。这仍然是我的偏好，因为我想确定每个峰值的最大频率的确切数据值（与密度函数值相反，它略有不同），但我也无法弄清楚。

我会自己发布示例数据，但在这里我看不到这样做的方法（抱歉，如果我错过了它）。

【问题讨论】：

搜索[r] local maximum 会将您带到这里：stackoverflow.com/q/6836409/602276
sm 包中似乎没有 peaks 函数
这是另一个来自 Brian Ripley @ R-help finzi.psych.upenn.edu/R/Rhelp02a/archive/33097.html 并在此处引用的 s.o. stackoverflow.com/questions/6324354/…

标签： r histogram

【解决方案1】：

如果您的 y 值是平滑的（就像在您的示例图中），这应该会发现峰值非常可重复：

peakx <- x[which(diff(sign(diff(y)))==-2)]

【讨论】：

谢谢。请看下文。

【解决方案2】：

寻找密度函数的峰值，正如 cmets 中已经给出的，与 Finding local maxima and minima 相关，您可以在其中找到更多解决方案。 chthonicdaemon 的答案接近峰值，但每个 diff 都在将向量长度减一。

#Create Dataset
x <- c(1,1,4,4,9)

#Estimate Density
d <- density(x)

#Two ways to get highest Peak
d$x[d$y==max(d$y)]  #Gives you all highest Peaks
d$x[which.max(d$y)] #Gives you the first highest Peak

#3 ways to get all Peaks
d$x[c(F, diff(diff(d$y)>=0)<0)] #This detects also a plateau
d$x[c(F, diff(sign(diff(d$y)))<0)]
d$x[which(diff(sign(diff(d$y)))<0)+1]

#In case you also want the height of the peaks
data.frame(d[c("x", "y")])[c(F, diff(diff(d$y)>=0)<0),]

#In case you need a higher "precision"
d <- density(x, n=1e4)

【讨论】：

【解决方案3】：

既然你在考虑直方图，也许你应该直接使用直方图输出？

data <- c(rnorm(100,mean=20),rnorm(100,mean=12))

peakfinder <- function(d){
  dh <- hist(d,plot=FALSE)
  ins <- dh[["intensities"]]
  nbins <- length(ins)
  ss <- which(rank(ins)%in%seq(from=nbins-2,to=nbins)) ## pick the top 3 intensities
  dh[["mids"]][ss]
}

peaks <- peakfinder(data)

hist(data)
sapply(peaks,function(x) abline(v=x,col="red"))

这并不完美——例如，它只会找到顶部的垃圾箱，即使它们是相邻的。也许您可以更准确地定义“峰值”？希望对您有所帮助。

【讨论】：

这基本上就是我想做的。但你是对的，它会抓取每个顶部的 bin，无论它是“噪音”还是真正的数据“峰值”。更精确地定义峰值肯定会有所帮助。我故意含糊其辞，因为我不确定如何客观地做到这一点。我想我可能需要查看类似“一个峰值 = 顶部的垃圾箱，其中每侧的 6 个相邻垃圾箱中至少有 4 个正在下降。”那可能会失控。所以，感谢大家提供这个起始代码。如果我得到更具体的信息，我会从这里开始报告。
我的初始脚本失败，可能是因为“强度”不是直方图对象的一部分。我用“计数”替换了它，代码似乎按预期工作。 peakfinder <- function(d){ dh <- hist(d,plot=FALSE) ins <- dh[["counts"]] #original script has "intensities" nbins <- length(ins) ss <- which(rank(ins)%in%seq(from=nbins-2,to=nbins)) ## pick the top 3 intensities dh[["mids"]][ss]}

【解决方案4】：

经过 8 年多的时间后，这仍然是一个有效且经典的问题。现在这是一个完整的答案，@chthonicdaemon 给出了很好的线索。

library(ggplot)
library(data.table)
### I use a preloaded data.table. You can use any data.table with one numeric column x.
### Extract counts & breaks of the histogram bins. 
### I have taken breaks as 40 but you can take any number as needed.
### But do keep a large number of breaks so that you get multiple peaks.
counts <- hist(dt1$x,breaks = 40)$counts
breaks <- hist(dt1$x, breaks = 40)$breaks
### Note: the data.table `dt1` should contain at least one numeric column, x

### now name the counts vector with the corresponding breaks 
### note: the length of counts is 1 less than the breaks
names(counts) <- breaks[-length(breaks)]

### Find index for those counts that are the peaks 
### (see previous classic clue to take a double diff)
### note: the double diff causes the 2 count shrink, hence
#### I have added a FALSE before and after the results 
### to align the T/F vector with the count vector

peak_indx <- c(F,diff(sign(c(diff(counts))))==-2,F) %>% which()
topcounts <- counts[peak_indx]
topbreaks <- names(topcounts) %>% as.numeric()

### Now let's use ggplot to plot the histogram along with visualised peaks

dt1 %>%     
ggplot() + 
geom_histogram(aes(x),bins = 40,col="grey51",na.rm = T) + 
geom_vline(xintercept = topbreaks + 50,lty = 2) 
# adjust the value 50 to bring the lines in the centre

【讨论】：