Python - 在情节中寻找模式答案

【问题标题】：Python - finding pattern in a plotPython - 在情节中寻找模式
【发布时间】：2017-11-11 12:49:49
【问题描述】：

此图由以下gnuplot 脚本生成。 estimated.csv 文件可在此链接中找到：https://drive.google.com/open?id=0B2Iv8dfU4fTUaGRWMm9jWnBUbzg

# ###### GNU Plot
   set style data lines
   set terminal postscript eps enhanced color "Times" 20

   set output "cubic33_cwd_estimated.eps"

   set title "Estimated signal"

    set style line 99 linetype 1 linecolor rgb "#999999" lw 2
    #set border 1 back ls 11
    set key right top
    set key box linestyle 50
    set key width -2
    set xrange [0:10]
    set key spacing 1.2
    #set nokey

    set grid xtics ytics mytics
    #set size 2
    #set size ratio 0.4

    #show timestamp
    set xlabel "Time [Seconds]"
    set ylabel "Segments"

    set style line 1 lc rgb "#ff0000" lt 1 pi 0 pt 4 lw 4 ps 0

    # Congestion control send window

    plot  "estimated.csv" using ($1):2 with lines title "Estimated";

我想找到与下图接近的上一个图的估计信号模式。我的真实情况（实际信号如下图所示）

这是我最初的方法

#!/usr/bin/env python
import sys

import numpy as np
from shapely.geometry import LineString
#-------------------------------------------------------------------------------
def load_data(fname):
    return LineString(np.genfromtxt(fname, delimiter = ','))
#-------------------------------------------------------------------------------
lines = list(map(load_data, sys.argv[1:]))

for g in lines[0].intersection(lines[1]):
    if g.geom_type != 'Point':
        continue
    print('%f,%f' % (g.x, g.y))

然后在我的gnuplot 中直接调用这个python 脚本，如下所示：

set terminal pngcairo
set output 'fig.png'

set datafile separator comma
set yr [0:700]
set xr [0:10]

set xtics 0,2,10
set ytics 0,100,700

set grid

set xlabel "Time [seconds]"
set ylabel "Segments"

plot \
    'estimated.csv' w l lc rgb 'dark-blue' t 'Estimated', \
    'actual.csv' w l lc rgb 'green' t 'Actual', \
    '<python filter.py estimated.csv actual.csv' w p lc rgb 'red' ps 0.5 pt 7 t ''

这给了我们以下情节。但这似乎没有给我正确的模式，因为 gnuplot 不是完成此类任务的最佳工具。

有什么方法可以通过使用 python 将峰值形成一个图来找到第一个图 (estimated.csv) 的模式？如果我们从最后看，模式实际上似乎是可见的。任何帮助，将不胜感激。

【问题讨论】：

你试过卡尔曼滤波器吗，它应该按照你想要的方式跟随曲线。
基本上，它会尝试以速度“跟随”您的曲线，因此它会平滑您的信号，但现在我认为它不适用于您的情况。：/消除噪音并找到“真实”信号很棒，但对您的情况没有多大帮助，如果您将来需要它，仍然很有趣。 en.wikipedia.org/wiki/Kalman_filter#/media/File:Kalman.png
好的，谢谢。我会读一读。
通过一些想法，你可以使用一些峰值检测算法，然后使用像 DBSCAN 这样的聚类算法来消除 oulier，最后使用卡尔曼滤波器来平滑它。 :Dstackoverflow.com/a/22583761/6341054 scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html scipy-cookbook.readthedocs.io/items/KalmanFiltering.html

标签： python python-3.x numpy scipy time-series

【解决方案1】：

我认为pandas.rolling_max() 是正确的方法。我们将数据加载到 DataFrame 中并计算超过 8500 个值的滚动最大值。之后曲线看起来相似。您可以稍微测试一下参数以优化结果。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.ion()
names = ['actual.csv','estimated.csv']
#-------------------------------------------------------------------------------
def load_data(fname):
    return np.genfromtxt(fname, delimiter = ',')
#-------------------------------------------------------------------------------

data = [load_data(name) for name in names]
actual_data = data[0]
estimated_data = data[1]
df = pd.read_csv('estimated.csv', names=('x','y'))
df['rolling_max'] = pd.rolling_max(df['y'],8500)
plt.figure()
plt.plot(actual_data[:,0],actual_data[:,1], label='actual')
plt.plot(estimated_data[:,0],estimated_data[:,1], label='estimated')
plt.plot(df['x'], df['rolling_max'], label = 'rolling')

plt.legend()
plt.title('Actual vs. Interpolated')
plt.xlim(0,10)
plt.ylim(0,500)
plt.xlabel('Time [Seconds]')
plt.ylabel('Segments')
plt.grid()
plt.show(block=True)

回答cmets的问题：

由于pd.rolling() 正在生成已定义的数据窗口，因此pd.rolling().max 的第一个值将是NaN。要替换这些NaNs，我建议将整个系列翻转并向后计算窗口。之后，我们可以将所有NaNs 替换为向后计算的值。我调整了向后计算的窗口长度。否则我们会得到错误的数据。

此代码有效：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.ion()

df = pd.read_csv('estimated.csv', names=('x','y'))
df['rolling_max'] = df['y'].rolling(8500).max()
df['rolling_max_backwards'] = df['y'][::-1].rolling(850).max()
df.rolling_max.fillna(df.rolling_max_backwards, inplace=True)
plt.figure()
plt.plot(df['x'], df['rolling_max'], label = 'rolling')

plt.legend()
plt.title('Actual vs. Interpolated')
plt.xlim(0,10)
plt.ylim(0,700)
plt.xlabel('Time [Seconds]')
plt.ylabel('Segments')
plt.grid()
plt.show(block=True)

我们得到以下结果：

【讨论】：

嗨弗兰兹。太感谢了。不要使用actual.csv 文件。这是我的基本事实，不应该提供给程序。仅从 estimated.csv 检测到该模式。
它不用于计算。我只是用它来显示相似性。如果您愿意，请删除 actual.csv 的行
我在代码中添加了一个解决方案来处理缺失值。这不像我希望的那样干净，但它有效。
嗨，Franz，您以前与Kalman Filters 合作过吗？我想知道我们是否可以尝试用Kalman Filter 模型拟合同样的数据。谢谢！
嗨，Desta，不幸的是，我并不是很喜欢卡尔曼滤波器。 Ic 刚刚读到它们，但到目前为止还不需要它们。你试过像pykalman:pykalman.github.io这样的东西吗？