如何离散化连续函数以避免产生噪声（见图）答案

【问题标题】：How do I discretize a continuous function avoiding noise generation (see picture)如何离散化连续函数以避免产生噪声（见图）
【发布时间】：2022-01-16 17:43:32
【问题描述】：

我有一个连续输入函数，我想将其离散化为 5-10 个介于 1 和 0 之间的离散箱。现在我正在使用 np.digitize 并将输出箱重新调整为 0-1。现在的问题是，有时数据集（蓝线）会产生如下结果：

我尝试增加离散化箱的数量，但最终保持了相同的噪声并获得了更多的增量。作为算法使用相同设置但另一个数据集的示例：

这是我在那里使用的代码 NumOfDisc = 箱数

intervals = np.linspace(0,1,NumOfDisc)
discretized_Array = np.digitize(Continuous_Array, intervals)

图中的红色线并不重要。连续的蓝线是我尝试离散化的，绿线是离散化的结果。使用以下代码使用 matplotlyib.pyplot 创建图表：

def CheckPlots(discretized_Array, Continuous_Array, Temperature, time, PlotName)
logging.info("Plotting...")

#Setting Axis properties and titles
fig, ax = plt.subplots(1, 1)
ax.set_title(PlotName)
ax.set_ylabel('Temperature [°C]')
ax.set_ylim(40, 110)
ax.set_xlabel('Time [s]')    
ax.grid(b=True, which="both")
ax2=ax.twinx()
ax2.set_ylabel('DC Power [%]')
ax2.set_ylim(-1.5,3.5)

#Plotting stuff
ax.plot(time, Temperature, label= "Input Temperature", color = '#c70e04')
ax2.plot(time, Continuous_Array, label= "Continuous Power", color = '#040ec7')
ax2.plot(time, discretized_Array, label= "Discrete Power", color = '#539600')

fig.legend(loc = "upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)

logging.info("Done!")
logging.info("---")
return

任何想法我可以做些什么来获得像第二种情况那样的合理离散化？

【问题讨论】：

您能添加一个最小的可重现问题吗？
非常抱歉，但我不明白你的意思
没问题，您能否添加一段代码，您可以复制粘贴以获取您在此处显示的图表？这样其他人就更容易尝试和使用它
我更新了问题。现在好点了吗？
请注意，在发帖之前您应该知道minimal reproducible example 是什么。

标签： python numpy discretization

【解决方案1】：

以下解决方案给出了您需要的确切结果。

基本上，该算法会找到一条理想线，并尝试用更少的数据点尽可能地复制它。它从边缘的 2 个点（直线）开始，然后在中心添加一个，然后检查哪一侧的误差最大，并在其中心添加一个点，依此类推，直到达到所需的 bin 计数.简单:)

import warnings
warnings.simplefilter('ignore', np.RankWarning)


def line_error(x0, y0, x1, y1, ideal_line, integral_points=100):
    """Assume a straight line between (x0,y0)->(x1,p1). Then sample the perfect line multiple times and compute the distance."""
    straight_line = np.poly1d(np.polyfit([x0, x1], [y0, y1], 1))
    xs = np.linspace(x0, x1, num=integral_points)
    ys = straight_line(xs)

    perfect_ys = ideal_line(xs)
    
    err = np.abs(ys - perfect_ys).sum() / integral_points * (x1 - x0)  # Remove (x1 - x0) to only look at avg errors
    return err


def discretize_bisect(xs, ys, bin_count):
    """Returns xs and ys of discrete points"""
    # For a large number of datapoints, without loss of generality you can treat xs and ys as bin edges
    # If it gives bad results, you can edges in many ways, e.g. with np.polyline or np.histogram_bin_edges
    ideal_line = np.poly1d(np.polyfit(xs, ys, 50))
    
    new_xs = [xs[0], xs[-1]]
    new_ys = [ys[0], ys[-1]]
    
    while len(new_xs) < bin_count:
        
        errors = []
        for i in range(len(new_xs)-1):
            err = line_error(new_xs[i], new_ys[i], new_xs[i+1], new_ys[i+1], ideal_line)
            errors.append(err)

        max_segment_id = np.argmax(errors)
        new_x = (new_xs[max_segment_id] + new_xs[max_segment_id+1]) / 2
        new_y = ideal_line(new_x)
        new_xs.insert(max_segment_id+1, new_x)
        new_ys.insert(max_segment_id+1, new_y)

    return new_xs, new_ys


BIN_COUNT = 25

new_xs, new_ys = discretize_bisect(xs, ys, BIN_COUNT)

plot_graph(xs, ys, new_xs, new_ys, f"Discretized and Continuous comparison, N(cont) = {N_MOCK}, N(disc) = {BIN_COUNT}")
print("Bin count:", len(new_xs))

此外，这是我测试过的简化绘图功能。

def plot_graph(cont_time, cont_array, disc_time, disc_array, plot_name):
    """A simplified version of the provided plotting function"""
    
    # Setting Axis properties and titles
    fig, ax = plt.subplots(figsize=(20, 4))
    ax.set_title(plot_name)
    ax.set_xlabel('Time [s]')
    ax.set_ylabel('DC Power [%]')

    # Plotting stuff
    ax.plot(cont_time, cont_array, label="Continuous Power", color='#0000ff')
    ax.plot(disc_time, disc_array, label="Discrete Power",   color='#00ff00')

    fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax.transAxes)

最后，这是Google Colab

【讨论】：

非常感谢！！

【解决方案2】：

如果我在 cmets 中描述的是问题，有几个选项可以解决这个问题：

什么都不做：根据您进行离散化的原因，您可能希望离散值准确反映连续值
更改垃圾箱：您可以移动垃圾箱或更改垃圾箱的数量，这样蓝线的相对“平坦”部分就会留在其中一个箱子，因此在这些部分也给出了一条平坦的绿线，这在视觉上会更令人愉悦，就像在您的第二个情节中一样。

【讨论】：

1.不是一个选项，因为值需要离散化 2. 我使用了这个，但由于某种原因增加垃圾箱的数量并没有帮助......现在我正在尝试一个新的想法，我首先在其中对两条常量线进行硬编码开始和结束，然后我尝试仅在两个常量值之间的剩余动态部分上使用 np.digitize 函数
对不起，也许我没有很好地解释第一个选项，但我的意思是离散化你所做的方式，然后什么都不做否则并接受该方法给你一条摇摇欲坠的绿线。我的意思不是：“不要离散化”。
另外，我看到 Morton 的解决方案效果很好，但它与将连续函数的 y 值映射到 X 个 bin 不同。如果 Morton 的解决方案确实是您想要的，那就太好了！如果不是，我可以更新我的答案以更详细地解释我的意思。告诉我！
其实你是对的。 Mortons 解决方案很棒，它非常精细和广泛，但它并没有真正将连续输入映射到离散箱中。我花了一些时间思考如何进一步改进这一点，并且我在开始和结束时对恒定区域进行了硬编码，因为它们总是相同的（根据实验的设计）并应用了我的离散化方法。结果稍微好一点，但仍然不完美。
然后我将代码中的间隔创建方式更改为intervals = np.arange(min,max,0.05)，其中 min 和 max 是最高和最低值，0.05 是步长。