【问题标题】:Smoothing / interpolating categorical data (fast)平滑/插值分类数据(快速)
【发布时间】:2022-01-13 06:59:22
【问题描述】:

我目前正在处理一个包含分类数据的数组。 类别组织如下:无,区域A,区域B 我的数组是传感器的度量,它告诉我传感器是否在任何时候都在 zoneA、zoneB 中。

我的目标是平滑这些值。

例如,传感器可能在 30 次测量中处于区域 A 或 b 之外,但如果发生这种情况,我希望这些测量得到“平滑”。

例如:

数组[zoneA, zoneA, zoneA, None, None, zoneA, zoneA, None, None, None, zoneA]

应该给

数组[zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, None, None, None, zoneA]

阈值为 2。

目前,我正在对数组使用迭代,但它的计算成本太高,可能会导致 1 或 2 分钟的计算。是否有现有的算法可以解决这个问题?

我当前的代码:

 def smooth(self, df: pd.DataFrame) -> pd.DataFrame:
    """
    Args:
        df (pd.DataFrame): dataframe with landlot column to smooth.
    Returns:dataframe smoothed
    """
    df_iter = df
    last = "None"
    last_index = 0
    for num, line in df_iter.iterrows():
        if (
                (line.landlot != "None")
                and (line.landlot == last)
                and (num - last_index <= self.delay)
                and (
                df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")]
                == "None"
        )
        ):
            df_iter.iloc[
            last_index: (num + 1),  # noqa: E203
            df_iter.columns.get_loc("landlot"),
            ] = last
        if line.landlot != "None":
            last = line.landlot
            last_index = num
    return df_iter

【问题讨论】:

  • [zoneA, None, zoneB] 应该变成什么?
  • 你好@timgeb,它应该保持[zoneA,None,zoneB],平滑只发生在两个相等的度量之间。
  • 为什么[zoneA, None, None, zoneA] 被平滑而[zoneA, None, None, None, zoneA] 不是?编辑:啊,这就是阈值 2 的意思?
  • 你能举一个你当前代码的例子吗?
  • @EdgarH 我刚刚编辑了我的帖子并获得了它。地块是“区域”

标签: python algorithm numpy iteration


【解决方案1】:

Python 实现

我喜欢以干净简单的方式开始这些事情。因此,我只是编写了一个简单的类,它完全可以满足需要,而没有过多考虑优化。我称它为Interpolator,因为这在我看来像是分类插值。

class Interpolator:
    def __init__(self, data):
        self.data = data
        self.current_idx = 0
        self.current_nan_region_start = None
        self.result = None
        self.maxgap = 1

    def run(self, maxgap=2):
        # Initialization
        self.result = [None] * len(self.data)
        self.maxgap = maxgap
        self.current_nan_region_start = None
        prev_isnan = 0

        for idx, item in enumerate(self.data):
            isnan = item is None
            self.current_idx = idx
            if isnan:
                if prev_isnan:
                    # Result is already filled with empty data.
                    # Do nothing.
                    continue
                else:
                    self.entered_nan_region()
                    prev_isnan = 1
            else:  # not nan
                if prev_isnan:
                    self.exited_nan_region()
                    prev_isnan = 0
                else:
                    self.continuing_in_categorical_region()

    def entered_nan_region(self):
        self.current_nan_region_start = self.current_idx

    def continuing_in_categorical_region(self):
        self.result[self.current_idx] = self.data[self.current_idx]

    def exited_nan_region(self):

        nan_region_end = self.current_idx - 1
        nan_region_length = nan_region_end - self.current_nan_region_start + 1

        # Always copy the empty region endpoint even if gap is not filled
        self.result[self.current_idx] = self.data[self.current_idx]

        if nan_region_length > self.maxgap:
            # Do not interpolate as exceeding maxgap
            return

        if self.current_nan_region_start == 0:
            # Special case. data starts with "None"
            # ->  Cannot interpolate
            return

        if self.data[self.current_nan_region_start - 1] != self.data[self.current_idx]:
            # Do not fill as both ends of missing data
            # region do not have same value
            return

        # Fill the gap
        for idx in range(self.current_nan_region_start, self.current_idx):
            self.result[idx] = self.data[self.current_idx]


def interpolate(data, maxgap=2):
    """
    Interpolate categorical variables over missing
    values (None's).

    Parameters
    ----------
    data: list of objects
        The data to interpolate. Holds
        categorical data, such as 'cat', 'dog'
        or 108. None is handled as missing data.
    maxgap: int
        The maximum gap to interpolate over.
        For example, with maxgap=2, ['car', None,
        None, 'car', None, None, None, 'car']
        would become  ['car', 'car', 'car' 'car',
        None, None None, 'car'].

    Note: Interpolation will only occur on missing
    data regions where both ends contain the same value.
    For example, [1, None, 2, None, 2] will become
    [1, None, 2, 2, 2].
    """

    interpolator = Interpolator(data)
    interpolator.run(maxgap=maxgap)
    return interpolator.result

这就是使用它的方式(下面是get_data() 的代码):

data = get_data(k=100)
interpolated_data = interpolate(data)

复制粘贴 Cython 实现

很可能 python 实现足够快,因为数组大小为 1000.000,在我的笔记本电脑上处理数据所需的时间为 0.504 秒。无论如何,creating Cython versions 很有趣,并且可能会提供额外的时间奖励。

所需步骤:

  • 将python实现复制粘贴到新文件中,名为fast_categorical_interpolate.pyx
  • 在同一文件夹下创建setup.py,内容如下:
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize(
        "fast_categorical_interpolate.pyx",
        language_level="3",
    ),
)
  • 运行 python setup.py build_ext --inplace 以构建 Cython 扩展。您会在同一文件夹中看到类似 fast_categorical_interpolate.cp38-win_amd64.pyd 的内容。
  • 现在,您可以像这样使用插值器:
import fast_categorical_interpolate as fpi
data = get_data(k=100)
interpolated_data = fpi.interpolate(data)
  • 当然,您可以在 Cython 代码中进行一些优化以使其更快,但在我的机器上,当 N=1000.000 时速度提高了 38%,当 N=10.000 时提高了 126% .

我的机器上的计时

  • 当 N=100(列表中的项目数)时,python 实现大约是 160x,Cython 实现比 smooth 快大约 250x
In [8]: timeit smooth(test_df, delay=2)
10.2 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: timeit interpolate(data)

64.8 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [10]: timeit fpi.interpolate(data)
41.3 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

  • 当 N=10.000 时,时序差异约为 190x (Python) 到 302x (Cython)。
In [5]: timeit smooth(test_df, delay=2)
1.08 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: timeit interpolate(data)
5.69 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: timeit fpi.interpolate(data)
3.57 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

  • 当 N=1000.000 时,python 实现大约快 210 倍,Cython 实现大约快 287 倍。
In [9]: timeit smooth(test_df, delay=2)
1min 45s ± 24.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: timeit interpolate(data)
504 ms ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: timeit fpi.interpolate(data)
365 ms ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

附录

测试数据创建者get_data()

import random
random.seed(0)

def get_data(k=100):
    return random.choices(population=[None, "ZoneA", "ZoneB"], weights=[4, 3, 2], k=k)

用于测试的函数和测试数据smooth()

import pandas as pd

data = get_data(k=1000)
test_df = pd.DataFrame(dict(landlot=data)).fillna("None")


def smooth(df: pd.DataFrame, delay=2) -> pd.DataFrame:
    """
    Args:
        df (pd.DataFrame): dataframe with landlot column to smooth.
    Returns:dataframe smoothed
    """
    df_iter = df
    last = "None"
    last_index = 0
    for num, line in df_iter.iterrows():
        if (
            (line.landlot != "None")
            and (line.landlot == last)
            and (num - last_index <= delay)
            and (df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")] == "None")
        ):
            df_iter.iloc[
                last_index : (num + 1),  # noqa: E203
                df_iter.columns.get_loc("landlot"),
            ] = last
        if line.landlot != "None":
            last = line.landlot
            last_index = num
    return df_iter

注意“当前代码”

我认为某处一定存在一些复制粘贴错误,因为“当前代码”不能正常工作。我用 delay=2 关键字参数替换了 self.delay 以指示最大间隙。我认为这是应该的。即使这样,逻辑也不能与您提供的简单示例数据正确工作。

【讨论】:

    猜你喜欢
    • 2020-02-16
    • 2019-10-27
    • 2019-02-17
    • 2018-07-16
    • 2013-06-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-03-02
    相关资源
    最近更新 更多