【问题标题】:SQL moving averageSQL 移动平均线
【发布时间】:2012-05-24 09:23:28
【问题描述】:

如何在 SQL 中创建移动平均线?

当前表:

Date             Clicks 
2012-05-01       2,230
2012-05-02       3,150
2012-05-03       5,520
2012-05-04       1,330
2012-05-05       2,260
2012-05-06       3,540
2012-05-07       2,330

所需的表或输出:

Date             Clicks    3 day Moving Average
2012-05-01       2,230
2012-05-02       3,150
2012-05-03       5,520          4,360
2012-05-04       1,330          3,330
2012-05-05       2,260          3,120
2012-05-06       3,540          3,320
2012-05-07       2,330          3,010

【问题讨论】:

  • 你用的是什么数据库系统?
  • @BrianWebster:他在对我(现已删除)帖子的评论中说:他正在使用 Hive。但是你删除了它的标签。
  • 好的,已修复 - 老实说,我没有意识到这是一个数据库系统

标签: sql hive moving-average


【解决方案1】:

这是一个常青乔·塞尔科的问题。 我忽略了使用哪个 DBMS 平台。但无论如何,Joe 能够在 10 多年前用标准 SQL 回答。

Joe Celko SQL Puzzles and Answers 引用: “最后一次更新尝试表明我们可以使用谓词来 构造一个可以给我们一个移动平均线的查询:"

SELECT S1.sample_time, AVG(S2.load) AS avg_prev_hour_load
FROM Samples AS S1, Samples AS S2
WHERE S2.sample_time
BETWEEN (S1.sample_time - INTERVAL 1 HOUR)
AND S1.sample_time
GROUP BY S1.sample_time;

额外的列或查询方法更好吗?查询是 技术上更好,因为 UPDATE 方法会使 数据库。但是,如果正在记录的历史数据不 改变和计算移动平均线是昂贵的,你可能 考虑使用列方法。

MS SQL 示例:

CREATE TABLE #TestDW
( Date1 datetime,
  LoadValue Numeric(13,6)
);

INSERT INTO #TestDW VALUES('2012-06-09' , '3.540' );
INSERT INTO #TestDW VALUES('2012-06-08' , '2.260' );
INSERT INTO #TestDW VALUES('2012-06-07' , '1.330' );
INSERT INTO #TestDW VALUES('2012-06-06' , '5.520' );
INSERT INTO #TestDW VALUES('2012-06-05' , '3.150' );
INSERT INTO #TestDW VALUES('2012-06-04' , '2.230' );

SQL 谜题查询:

SELECT S1.date1,  AVG(S2.LoadValue) AS avg_prev_3_days
FROM #TestDW AS S1, #TestDW AS S2
WHERE S2.date1
    BETWEEN DATEADD(d, -2, S1.date1 )
    AND S1.date1
GROUP BY S1.date1
order by 1;

【讨论】:

  • 感谢您的信息 - 但我很难翻译它以了解它如何解决问题。你能给出你将用于问题中的表的查询吗?
  • 这更好,因为它可以被修改以找出 N 个月的移动平均值..
【解决方案2】:

一种方法是在同一张桌子上加入几次。

select
 (Current.Clicks 
  + isnull(P1.Clicks, 0)
  + isnull(P2.Clicks, 0)
  + isnull(P3.Clicks, 0)) / 4 as MovingAvg3
from
 MyTable as Current
 left join MyTable as P1 on P1.Date = DateAdd(day, -1, Current.Date)
 left join MyTable as P2 on P2.Date = DateAdd(day, -2, Current.Date)
 left join MyTable as P3 on P3.Date = DateAdd(day, -3, Current.Date)

调整 ON 子句的 DateAdd 组件以匹配您是否希望移动平均线严格从过去到现在或几天前到几天前。

  • 这非常适用于只需要几个数据点的移动平均值的情况。
  • 对于具有多个数据点的移动平均线,这不是最佳解决方案。

【讨论】:

  • 离开加入那些。 (看前两个没有)
  • 对于大型表来说,进行 4 次联接不是一项成本很高的操作吗?
  • 取决于数据,但根据我的经验,这是一个非常快速的操作。
【解决方案3】:
select t2.date, round(sum(ct.clicks)/3) as avg_clicks
from
(select date from clickstable) as t2,
(select date, clicks from clickstable) as ct
where datediff(t2.date, ct.date) between 0 and 2
group by t2.date

例如here

显然,您可以将间隔更改为您需要的任何值。您也可以使用 count() 代替幻数来使其更容易更改,但这也会减慢速度。

【讨论】:

  • 您的前两个条目是 1 天和 2 天的平均值。该问题要求这些条目为NULL
【解决方案4】:

适用于大型数据集的滚动平均值的通用模板

WITH moving_avg AS (
  SELECT 0 AS [lag] UNION ALL
  SELECT 1 AS [lag] UNION ALL
  SELECT 2 AS [lag] UNION ALL
  SELECT 3 AS [lag] --ETC
)
SELECT
  DATEADD(day,[lag],[date]) AS [reference_date],
  [otherkey1],[otherkey2],[otherkey3],
  AVG([value1]) AS [avg_value1],
  AVG([value2]) AS [avg_value2]
FROM [data_table]
CROSS JOIN moving_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];

对于加权滚动平均值:

WITH weighted_avg AS (
  SELECT 0 AS [lag], 1.0 AS [weight] UNION ALL
  SELECT 1 AS [lag], 0.6 AS [weight] UNION ALL
  SELECT 2 AS [lag], 0.3 AS [weight] UNION ALL
  SELECT 3 AS [lag], 0.1 AS [weight] --ETC
)
SELECT
  DATEADD(day,[lag],[date]) AS [reference_date],
  [otherkey1],[otherkey2],[otherkey3],
  AVG([value1] * [weight]) / AVG([weight]) AS [wavg_value1],
  AVG([value2] * [weight]) / AVG([weight]) AS [wavg_value2]
FROM [data_table]
CROSS JOIN weighted_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];

【讨论】:

  • 加权的有趣方法。不过,对于更多离散的时间点(时间戳而不是日期)来说,它不起作用(很好)
  • @msciwoj 在学术练习之外,非均匀间隔上的固定权重滚动平均值有什么用途?您不是先记录数据还是根据区间大小计算权重?
  • 绝对统一。您只需根据与当前时间点的距离,将其扔到适当的重量桶中即可。例如,“对于当前数据点 24 小时内的数据点,权重=1;对于 48 小时内的数据点,权重=0.5……”。这种情况下,连续数据点(如上午 6:12 和晚上 11:48)彼此相距多少很重要……我能想到的一个用例是尝试在数据点不够密集的地方平滑直方图
【解决方案5】:
select *
        , (select avg(c2.clicks) from #clicks_table c2 
            where c2.date between dateadd(dd, -2, c1.date) and c1.date) mov_avg
from #clicks_table c1

【讨论】:

    【解决方案6】:

    使用不同的连接谓词:

    SELECT current.date
           ,avg(periods.clicks)
    FROM current left outer join current as periods
           ON current.date BETWEEN dateadd(d,-2, periods.date) AND periods.date
    GROUP BY current.date HAVING COUNT(*) >= 3
    

    having 语句将阻止返回没有至少 N 个值的任何日期。

    【讨论】:

    • 这不会显示提问者希望在NULLs 看到的 5 月 1 日和 5 月 2 日的行。
    【解决方案7】:

    假设 x 是要平均的值,xDate 是日期值:

    从 myTable WHERE xDate BETWEEN dateadd(d, -2, xDate) 和 xDate 中选择 avg(x)

    【讨论】:

      【解决方案8】:

      在蜂巢中,也许你可以尝试

      select date, clicks, avg(clicks) over (order by date rows between 2 preceding and current row) as moving_avg from clicktable;
      

      【讨论】:

        【解决方案9】:

        为此,我想创建一个辅助/维度日期表,例如

        create table date_dim(date date, date_1 date, dates_2 date, dates_3 dates ...)
        

        date 是关键,date_1 代表今天,date_2 包含今天和前一天; date_3...

        然后你就可以在hive中做equal join了。

        使用如下视图:

        select date, date               from date_dim
        union all
        select date, date_add(date, -1) from date_dim
        union all
        select date, date_add(date, -2) from date_dim
        union all
        select date, date_add(date, -3) from date_dim
        

        【讨论】:

          【解决方案10】:

          注意:这不是答案,而是 Diego Scaravaggi 答案的增强代码示例。由于评论部分不足,我将其发布为答案。请注意,我已将移动平均线的周期参数化。

          declare @p int = 3
          declare @t table(d int, bal float)
          insert into @t values
          (1,94),
          (2,99),
          (3,76),
          (4,74),
          (5,48),
          (6,55),
          (7,90),
          (8,77),
          (9,16),
          (10,19),
          (11,66),
          (12,47)
          
          select a.d, avg(b.bal)
          from
                 @t a
                 left join @t b on b.d between a.d-(@p-1) and a.d
          group by a.d
          

          【讨论】:

            【解决方案11】:
            --@p1 is period of moving average, @01 is offset
            
            declare @p1 as int
            declare @o1 as int
            set @p1 = 5;
            set @o1 = 3;
            
            with np as(
            select *, rank() over(partition by cmdty, tenor order by markdt) as r
            from p_prices p1
            where
            1=1 
            )
            , x1 as (
            select s1.*, avg(s2.val) as avgval from np s1
            inner join np s2 
            on s1.cmdty = s2.cmdty and s1.tenor = s2.tenor
            and s2.r between s1.r - (@p1 - 1) - (@o1) and s1.r - (@o1)
            group by s1.cmdty, s1.tenor, s1.markdt, s1.val, s1.r
            )
            

            【讨论】:

              【解决方案12】:

              我不确定您的预期结果(输出)是否会显示 3 天的经典“简单移动(滚动)平均值”。因为,例如,根据定义,数字的第一个三元组给出:

              ThreeDaysMovingAverage = (2.230 + 3.150 + 5.520) / 3 = 3.6333333
              

              但你期待4.360,这令人困惑。

              不过,我建议使用以下解决方案,它使用窗口函数AVG。这种方法比其他答案中介绍的SELF-JOIN 更有效(清晰且资源较少)(我很惊讶没有人给出更好的解决方案)。

              -- Oracle-SQL dialect 
              with
                data_table as (
                   select date '2012-05-01' AS dt, 2.230 AS clicks from dual union all
                   select date '2012-05-02' AS dt, 3.150 AS clicks from dual union all
                   select date '2012-05-03' AS dt, 5.520 AS clicks from dual union all
                   select date '2012-05-04' AS dt, 1.330 AS clicks from dual union all
                   select date '2012-05-05' AS dt, 2.260 AS clicks from dual union all
                   select date '2012-05-06' AS dt, 3.540 AS clicks from dual union all
                   select date '2012-05-07' AS dt, 2.330 AS clicks from dual  
                ),
                param as (select 3 days from dual)
              select
                 dt     AS "Date",
                 clicks AS "Clicks",
              
                 case when rownum >= p.days then 
                     avg(clicks) over (order by dt
                                        rows between p.days - 1 preceding and current row)
                 end    
                        AS "3 day Moving Average"
              from data_table t, param p;
              

              您会看到 AVGcase when rownum >= p.days then 包裹以强制 NULLs 在第一行中,其中“3 天移动平均线”毫无意义。

              【讨论】:

                【解决方案13】:

                我们可以应用 Joe Celko 的 “脏”左外连接 方法(如上 Diego Scaravaggi 所引用的)来回答所提出的问题。

                declare @ClicksTable table  ([Date] date, Clicks int)
                insert into @ClicksTable
                    select '2012-05-01', 2230 union all
                    select '2012-05-02', 3150 union all
                    select '2012-05-03', 5520 union all
                    select '2012-05-04', 1330 union all
                    select '2012-05-05', 2260 union all
                    select '2012-05-06', 3540 union all
                    select '2012-05-07', 2330
                

                这个查询:

                SELECT
                    T1.[Date],
                    T1.Clicks,
                    -- AVG ignores NULL values so we have to explicitly NULLify
                    -- the days when we don't have a full 3-day sample
                    CASE WHEN count(T2.[Date]) < 3 THEN NULL
                        ELSE AVG(T2.Clicks) 
                    END AS [3-Day Moving Average] 
                FROM @ClicksTable T1
                LEFT OUTER JOIN @ClicksTable T2
                    ON T2.[Date] BETWEEN DATEADD(d, -2, T1.[Date]) AND T1.[Date]
                GROUP BY T1.[Date]
                

                生成请求的输出:

                Date             Clicks    3-Day Moving Average
                2012-05-01       2,230
                2012-05-02       3,150
                2012-05-03       5,520          4,360
                2012-05-04       1,330          3,330
                2012-05-05       2,260          3,120
                2012-05-06       3,540          3,320
                2012-05-07       2,330          3,010
                

                【讨论】:

                  猜你喜欢
                  • 2017-09-01
                  • 2013-12-22
                  • 2021-06-14
                  • 2014-12-24
                  • 1970-01-01
                  • 2016-07-07
                  • 1970-01-01
                  • 2022-01-26
                  • 1970-01-01
                  相关资源
                  最近更新 更多