【问题标题】:How to find clusters of values over threshold for timeseries如何找到超过时间序列阈值的值集群
【发布时间】:2021-03-13 15:04:48
【问题描述】:

我有时间序列,需要找到超过阈值的值集群并将该集群绘制在单独的图上。

我的代码示例。不幸的是,我不知道如何生成良好的聚类值。

#generate sample data

Sys.setlocale("LC_ALL","English")

set.seed(8)

Values <- sample(0:100,24241,  replace = T)

Values <- rpois(24241, lambda=60)

start <- as.POSIXct("2012-01-15 06:10:00")
interval <- 15
end <- start + as.difftime(4, units="days") + as.difftime(5, units = "hours")

DateTimes <-  seq(from=start, by=interval, to=end) 

my_data_sample <- tibble(datetime = DateTimes, Value =  Values)

threshold <-  82


ggplot(data = my_data_sample, aes(x = datetime, y = Value)) +
  geom_line(size = 1, color = "darkgreen") +
  geom_hline(yintercept=threshold, linetype="dashed", color = "red") +
  theme_bw()  +
  labs(
    x= ""    ,
    y = "",
    title = paste("Threshold:", threshold )
    ) +
  scale_x_datetime(date_breaks = "8 hour", labels = date_format("%b %d - %H:%M")) +
  theme(axis.text.x = element_text(angle = 25, vjust = 1.0, hjust = 1.0))

这是我需要的:

我需要找到超过阈值的值集群 - 连续或彼此接近,使用以秒为单位的集群长度(最长集群)或值总和(最强大的集群)对集群进行排序,并绘制当时的前 3 个不同的地块上的周期。

有什么建议吗?

【问题讨论】:

    标签: r ggplot2 cluster-analysis outliers


    【解决方案1】:

    您可以使用游程长度编码 (RLE) 找到符合预期的运行。在 RLE 级别,您可以过滤掉任一侧过短的运行。您可以使用 run_threshold 值,直到它与您的数据匹配。

    # Put some actual deviating runs in the data
    my_data_sample$Value[5001:5100] <- rpois(100, lambda = 80)
    my_data_sample$Value[10001:11000] <- rpois(1000, lambda = 80)
    
    threshold <-  82
    
    rle <- rle(my_data_sample$Value > threshold)
    # Find sub-threshold values in between super-threshold values,
    # convert these to other class
    run_threshold <- 20
    rle$values[!rle$values & rle$lengths < run_threshold] <- TRUE
    # Restructure rle
    rle <- rle(inverse.rle(rle))
    
    # Find short super-threshold values to filter
    run_threshold <- 5
    rle$values[rle$values & rle$lengths < run_threshold] <- FALSE
    rle <- rle(inverse.rle(rle))
    
    # Find run starts and ends
    rle_start <- {rle_end <- cumsum(rle$lengths)} - rle$lengths + 1
    
    # Format as data.frame for ggplot
    rle_df <- data.frame(
      min = my_data_sample$datetime[rle_start],
      max = my_data_sample$datetime[rle_end],
      value = rle$values
    )
    
    ggplot(data = my_data_sample, aes(x = datetime, y = Value)) +
      geom_line(size = 1, color = "darkgreen") +
      geom_rect(aes(xmin = min, xmax = max, ymin = 0, ymax = 10, fill = value),
                data = rle_df, inherit.aes = FALSE) +
      geom_hline(yintercept=threshold, linetype="dashed", color = "red") +
      theme_bw()  +
      labs(
        x= ""    ,
        y = "",
        title = paste("Threshold:", threshold )
      ) +
      scale_x_datetime(date_breaks = "8 hour", labels = date_format("%b %d - %H:%M")) +
      theme(axis.text.x = element_text(angle = 25, vjust = 1.0, hjust = 1.0))
    

    【讨论】:

    • 如何计算每组值的功率>阈值?像这样的东西:rle_df &lt;- data.frame(min = one_machine_privDataOnly$datetime[rle_start], max = one_machine_privDataOnly$datetime[rle_end], value = rle$values, mean = ??????? ) %&gt;% filter (value == T)
    猜你喜欢
    • 1970-01-01
    • 2014-12-15
    • 2019-07-04
    • 2022-09-27
    • 2013-01-08
    • 1970-01-01
    • 2015-12-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多