在 Postgres 中识别重复的时间序列答案

【问题标题】：Identify duplicate time-series sequences in Postgres在 Postgres 中识别重复的时间序列
【发布时间】：2014-12-09 13:37:53
【问题描述】：

我有一个带有列的时间序列表（在 Postgres 数据库中）

item_id,  country_id,  year,  month, value

在此表中有重复的时间序列：它们具有相同的 country_id 和时间序列日期/值，但分配了不同的 item_id，例如：'Red Apples' 和 'Apples, Red'

如何识别这些重复的时间序列？我希望 (country_id, year, month, value) 匹配该项目存在的所有日期。

我是初学者，所以请原谅我遗漏的任何细节。我主要在寻找概念方法 - 我可以在 Postgres 或 python/Pandas 中实现它。

例如，我希望能够检测到这样的事情：

item_id,     country_id,     year,     month,    value
-------------------------------------------------------
Red Apples   5               1996      1         300
Red Apples   5               1996      2         500
Red Apples   5               1996      3         370
Apples, Red  5               1996      1         300
Apples, Red  5               1996      2         500
Apples, Red  5               1996      3         370

我希望输出如下所示：

item_id1,     item_id2,      country_id,     year,     month_range
-----------------------------------------------------------------
Red Apples    Apples, Red         5          1996       [1,3]

这样也行：

item_id1,     item_id2,      country_id,     year,     time_month,   value
--------------------------------------------------------------------------
Red Apples    Apples, Red         5          1996         1           300
Red Apples    Apples, Red         5          1996         2           500
Red Apples    Apples, Red         5          1996         3           370

我想过尝试这样的事情：

select distinct A.country_id, A.item_id, B.item_id, A.year, A.month, A.value
                      from my_table as A,
                      my_table as B 
                      where
                      (A.country_id=B.country_id and 
                      A.item_id<>B.item_id and 
                      A.year=B.year and 
                      A.month=B.month and 
                      A.value=B.value )

然后我会检查以确保所有日期/值都出现在每个已识别的 item_id 对中。但如果可能的话，我想一次检查所有日期/值。

我不确定表连接是否合适...？

【问题讨论】：

如果你的数据有另一个条目，比如Yellow Bananas,5,1996,1,300——这里也算重复吗？
我只想识别重复的时间序列或至少子序列。并非只有一个日期的巧合。
系列的最小长度是多少？以及如何处理跨年份的系列，如1996-12,1997-1？
每个 (item_id, country_id) 对会有几年的数据，我想找到至少连续 3 个月是相同的。输出的具体格式并不重要，只要因为它返回值相同的所有 item_ids 和 country_ids & 日期。

标签： python sql postgresql time-series

【解决方案1】：

_{请参阅下面的更新！}

除非您提供有关示例数据和预期结果的更多详细信息，否则我认为以下查询可能会有所帮助：

SELECT country_id,  year,  month, value
  FROM a_table
 GROUP BY country_id,  year,  month, value
HAVING count(*) > 1;

此查询将显示所有条目，除了item_id 之外，它们都是相等的。如果您想查找与重复组对应的所有行，请使用以下查询：

SELECT item_id, country_id,  year,  month, value
  FROM a_table
 WHERE (country_id,  year,  month, value)
    IN (
    SELECT country_id,  year,  month, value
      FROM a_table
     GROUP BY country_id,  year,  month, value
    HAVING count(*) > 1)
 ORDER BY country_id,  year,  month, value, item_id;

我已将item_id 列设置为排序顺序中的最后一个，它应该使识别重复项更加明显。随意调整。此查询可能需要一段时间，具体取决于您的数据。

为了避免将来出现这种情况（重复日期），您可能需要创建一个唯一约束，如下所示：

ALTER TABLE a_table ADD CONSTRAIN u_cymv
    UNIQUE (country_id,  year,  month, value);

编辑： 添加 cmets 后，我提出了以下查询来查找一系列重复项：

WITH a_table(item_id,country_id,year,month,value) AS (VALUES
    ('Red Apples'::text,5,1996,1,300::numeric),
    ('Red Apples',5,1996,2,500),
    ('Red Apples',5,1996,3,370),
    ('Apples, Red',5,1996,1,300),
    ('Apples, Red',5,1996,2,500),
    ('Apples, Red',5,1996,3,370)
), dups AS (
    SELECT string_agg(item_id,'/') AS items,
           country_id,value,
           daterange(to_date(year::text||month,'YYYYMM'),
                     (to_date(year::text||month,'YYYYMM')
                      +INTERVAL'1mon')::date,'[)') AS range
      FROM a_table
     GROUP BY country_id,year,month,value
    HAVING count(*) > 1
)
SELECT grp,count(*),items,country_id,
       daterange(min(lower(range)), max(upper(range)), '[)') r,
       array_agg(value)
  FROM ( 
    SELECT items,country_id,range,value,
           sum(g) OVER (ORDER BY country_id, range) grp
      FROM (
        SELECT items,country_id,
               range,value,
               CASE WHEN lag(range) OVER (PARTITION BY country_id
                                          ORDER BY range) -|- range
                    THEN NULL ELSE 1 END g
          FROM dups) s
    ) s
 GROUP BY grp,country_id,items
HAVING count(*) >= 3
 ORDER BY country_id,r,items;

它的作用：

a_table 是所提供样本数据的副本；
dups 是发现重复记录的人。我还将year,month 列转换为daterange，因为我认为没有其他方法可以正确找到穿越纽约的系列；
在列出重复项后，我将之前的range（在country_id 内）与当前的比较，如果它们不是adjacent，则设置组标志g；
接下来，我使用sum() 函数的running total effect 来创建组标识符grp。但是，对于示例数据，这只会产生一组；
最后，我使用grp 作为GROUP BY 将数据分组为系列。我还将country_id 和items 包含在GROUP BY 中，但这只是为了避免将它们包装成聚合函数——无论如何，它们对于grp 来说都是唯一的。我还形成了一个新的daterange 列，这是由于range 类型没有内置聚合函数。

您可能需要在执行此查询之前将work_mem 增加到我所说的1GB（取决于实际表中的行数）。请尝试一下，让我知道它是否适合您。如果您能为此分享EXPLAIN (analyze, buffers)，那就太好了。

【讨论】：

对不起，也许我的问题现在更清楚了。我不是试图识别重复的行，而是整个数据系列被赋予了 2 个不同的名称。
（你的建议可能有效，但我真的很想看看这两个相互冲突的 item_id 值）
谢谢，这让我在其中找到了一部分。但是，它仍然无法识别冲突的 item_id，而且我的原始表有 1000 万行，可能还有 1000 个不同的 item_id，因此无法真正手动完成。
@user3591836，我不明白你说的identify是什么意思？我提供的查询返回 only 重复的系列。请准确。
我希望输出包含所有重复时间序列的 item_id，以及它们相同的时间间隔。像这样的“红苹果”，“红苹果”，1996，[1,3]

【解决方案2】：

选择 *
FROM my_table
GROUP BY country_id、年、月、值
HAVING count(item_id) > 1

！这是未经测试的！

【讨论】：