【问题标题】:Linear Regression analysis for Date column in SQL ServerSQL Server 中日期列的线性回归分析
【发布时间】:2014-12-27 05:06:08
【问题描述】:

我有以下代码块,它使用线性回归(最小二乘法)计算趋势线的公式。它只是找到X和Y轴的R-Squared和相关系数。

如果 X 和 Y 轴是 int 和 float,这将计算精确值。

CREATE FUNCTION [dbo].[LinearReqression] (@Data AS XML)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  x = n.value('@x', 'float'),
                y = n.value('@y', 'float')
        FROM @Data.nodes('/r/n') v(n)
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)

用法:

DECLARE @DataTable TABLE (
    SourceID    INT,
    x           Date,
    y           FLOAT
) ;
INSERT INTO @DataTable ( SourceID, x, y )
SELECT ID = 0, x = 1.2, y = 1.0
UNION ALL SELECT 1, 1.6, 1
UNION ALL SELECT 2, 2.0, 1.5
UNION ALL SELECT 3, 2.0, 1.75
UNION ALL SELECT 4, 2.1, 1.85
UNION ALL SELECT 5, 2.1, 2
UNION ALL SELECT 6, 2.2, 3
UNION ALL SELECT 7, 2.2, 3
UNION ALL SELECT 8, 2.3, 3.5
UNION ALL SELECT 9, 2.4, 4
UNION ALL SELECT 10, 2.5, 4
UNION ALL SELECT 11, 3, 4.5 ;

-- Create and view XML data array
DECLARE @DataXML XML ;
SET @DataXML = (
    SELECT  -- FLOAT values are formatted in XML like "1.000000000000000e+000", increasing the character count
            -- Converting them to VARCHAR first keeps the XML small without sacrificing precision
            -- They are unpacked as FLOAT in the function either way
            [@x] = CAST(x AS VARCHAR(20)), 
            [@y] = CAST(y AS VARCHAR(20))
    FROM @DataTable
    FOR XML PATH('n'), ROOT('r') ) ;

SELECT @DataXML ;

-- Get the results
SELECT * FROM dbo.LinearReqression (@DataXML) ;

在我的情况下,X 轴也可能是 日期 列?那么如何计算与日期列相同的回归分析呢?

【问题讨论】:

  • 日期可以转换为 float(自 1970 年 1 月 1 日以来的小数天)或 bigint(从您选择的任何时间点算起的秒数)

标签: sql sql-server linear-regression


【解决方案1】:

简短的回答是:计算日期的趋势线与计算浮动的趋势线几乎相同。

对于日期,您可以选择一些开始日期并使用开始日期和日期之间的天数作为X

我没有检查你的函数本身,我认为那里的公式是正确的。

另外,我不明白您为什么从表中生成 XML 并将其解析回函数内的表中。这是相当低效的。您可以简单地传递表格。

我用你的函数做了两个变体:处理浮点数和处理日期。 我在这个例子中使用的是 SQL Server 2008。

首先创建一个用户定义的表类型,这样我们就可以将一个表传递给函数:

CREATE TYPE [dbo].[FloatRegressionDataTableType] AS TABLE(
    [x] [float] NOT NULL,
    [y] [float] NOT NULL
)
GO

然后创建接受此类表的函数:

CREATE FUNCTION [dbo].[LinearRegressionFloat] (@ParamData dbo.FloatRegressionDataTableType READONLY)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  x,
                y
        FROM @ParamData
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)
GO

非常相似,为带有日期的表格创建一个类型:

CREATE TYPE [dbo].[DateRegressionDataTableType] AS TABLE(
    [x] [date] NOT NULL,
    [y] [float] NOT NULL
)
GO

并创建一个接受此类表的函数。对于每个给定日期,它使用DATEDIFF 计算2001-01-01 和给定日期x 之间的天数,然后将结果转换为浮点数以确保其余计算正确。您可以尝试移除演员表以浮动,您会看到不同的结果。您可以选择任何其他开始日期,不一定是2001-01-01

CREATE FUNCTION [dbo].[LinearRegressionDate] (@ParamData dbo.DateRegressionDataTableType READONLY)
RETURNS TABLE AS RETURN (
    WITH Array AS (
        SELECT  CAST(DATEDIFF(day, '2001-01-01', x) AS float) AS x,
                y
        FROM @ParamData
    ),
    Medians AS (
        SELECT  xbar = AVG(x), ybar = AVG(y)
        FROM Array ),
    BetaCalc AS (
        SELECT  Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0)
        FROM Array 
        CROSS JOIN Medians
        CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ),
    AlphaCalc AS (
        SELECT  Alpha = ybar - xbar * beta
        FROM    Medians
        CROSS JOIN BetaCalc),
    SSCalc AS (
        SELECT  SS_tot = SUM((y - ybar) * (y - ybar)),
                SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x)))
        FROM Array
        CROSS JOIN Medians
        CROSS JOIN AlphaCalc
        CROSS JOIN BetaCalc )
    SELECT  r_squared = CASE WHEN SS_tot = 0 THEN 1.0
                             ELSE 1.0 - ( SS_err / SS_tot ) END,
            Alpha, Beta
    FROM AlphaCalc
    CROSS JOIN BetaCalc
    CROSS JOIN SSCalc
)
GO

这是测试功能的方法:

-- test float data
DECLARE @FloatDataTable [dbo].[FloatRegressionDataTableType];

INSERT INTO @FloatDataTable (x, y)
VALUES
(1.2, 1.0)
,(1.6, 1)
,(2.0, 1.5)
,(2.0, 1.75)
,(2.1, 1.85)
,(2.1, 2)
,(2.2, 3)
,(2.2, 3)
,(2.3, 3.5)
,(2.4, 4)
,(2.5, 4)
,(3, 4.5);

SELECT * FROM dbo.LinearRegressionFloat(@FloatDataTable);


-- test date data
DECLARE @DateDataTable [dbo].[DateRegressionDataTableType];

INSERT INTO @DateDataTable (x, y)
VALUES
 ('2001-01-13', 1.0)
,('2001-01-17', 1)
,('2001-01-21', 1.5)
,('2001-01-21', 1.75)
,('2001-01-22', 1.85)
,('2001-01-22', 2)
,('2001-01-23', 3)
,('2001-01-23', 3)
,('2001-01-24', 3.5)
,('2001-01-25', 4)
,('2001-01-26', 4)
,('2001-01-31', 4.5);

SELECT * FROM dbo.LinearRegressionDate(@DateDataTable);

这里有两个结果集:

r_squared            Alpha                Beta
----------------------------------------------------------
0.798224907472009    -2.66524390243902    2.46417682926829


r_squared            Alpha                Beta
----------------------------------------------------------
0.79822490747201     -2.66524390243902    0.246417682926829

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-07-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-12-07
    • 2023-03-13
    • 2017-03-06
    • 1970-01-01
    相关资源
    最近更新 更多