存储日期范围的有效方法答案

【问题标题】：Efficient way of storing date ranges存储日期范围的有效方法
【发布时间】：2017-03-24 18:59:23
【问题描述】：

我需要存储简单的数据 - 假设我有一些产品的代码作为主键、一些属性和有效范围。所以数据可能如下所示：

Products
code    value   begin_date  end_date
10905   13      2005-01-01  2016-12-31
10905   11      2017-01-01  null

这些范围没有重叠，所以在每个日期我都有一个独特产品及其属性的列表。所以为了方便使用，我创建了这个函数：

create function dbo.f_Products
(
    @date date
)
returns table
as
return (
    select
    from dbo.Products as p
    where
        @date >= p.begin_date and
        @date <= p.end_date
)

这就是我要使用它的方式：

select
    *
from <some table with product codes> as t
    left join dbo.f_Products(@date) as p on
        p.code = t.product_code

这一切都很好，但我怎样才能让优化器知道这些行是独一无二的以有更好的执行计划？

我做了一些谷歌搜索，发现了几篇关于 DDL 的非常好的文章，它们可以防止在表中存储重叠范围：

但即使我尝试了这些约束，我也看到优化器无法理解生成的记录集将返回唯一代码。

我想要的是某种方法，它给我的性能基本上与我在某个日期存储这些产品列表并使用date = @date 选择它一样。

我知道一些 RDMBS（如 PostgreSQL）对此有特殊的数据类型（Range Types）。但是 SQL Server 没有这样的东西。

我是否遗漏了什么或者没有办法在 SQL Server 中正确执行此操作？

【问题讨论】：

如果您关心性能，请不要使用 UDF。直接加入表即可。
@GordonLinoff 我不同意这一点，但这里不是这样，所以我们不要进行激烈的讨论
只是一个随机的想法：begin_date 上的索引，包含 end_date，以及 UDF 中的 SELECT TOP 1 ...？这会产生更好的执行计划吗？
“这些范围不重叠”——除了它们是。你的第二行大概应该从2017-01-01开始。
我认为您想以某种方式告诉优化器您的结果行是唯一的，因此您找错了树。您首先应该更关心的是让它有效地检索您感兴趣的行。（不，截至 2018 年，SQL Server 仍然没有对范围的单独支持。）如果没有唯一性的“知识”，优化器将根据您要连接的另一个表的基数来决定连接类型，这应该没问题. CREATE UNIQUE CLUSTERED INDEX IX_Products ON products([code], [begin_date], [end_date]) 应该是你所需要的一切......

标签： sql sql-server intervals sql-server-2016 date-range

【解决方案1】：

没有间隙的解决方案可能是这样的：

DECLARE @tbl TABLE(ID INT IDENTITY,[start_date] DATE);
INSERT INTO @tbl VALUES({d'2016-10-01'}),({d'2016-09-01'}),({d'2016-08-01'}),({d'2016-07-01'}),({d'2016-06-01'});

SELECT * FROM @tbl;

DECLARE @DateFilter DATE={d'2016-08-13'};

SELECT TOP 1 * 
FROM @tbl
WHERE [start_date]<=@DateFilter
ORDER BY [start_date] DESC

重要提示：确保start_date 上有一个（唯一）索引

更新：针对不同的产品

DECLARE @tbl TABLE(ID INT IDENTITY,ProductID INT,[start_date] DATE);
INSERT INTO @tbl VALUES
--product 1
(1,{d'2016-10-01'}),(1,{d'2016-09-01'}),(1,{d'2016-08-01'}),(1,{d'2016-07-01'}),(1,{d'2016-06-01'})
--product 1
,(2,{d'2016-10-17'}),(2,{d'2016-09-16'}),(2,{d'2016-08-15'}),(2,{d'2016-07-10'}),(2,{d'2016-06-11'});

DECLARE @DateFilter DATE={d'2016-08-13'};

WITH PartitionedCount AS
(
    SELECT ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY [start_date] DESC) AS Nr
          ,*
    FROM @tbl
    WHERE [start_date]<=@DateFilter
)
SELECT *
FROM PartitionedCount
WHERE Nr=1

【讨论】：

是的，这是解决任务的好方法。但是，性能仍然不是最佳的。我想要的是某种方法，它给我的性能基本上与我在某个日期存储这些产品列表并使用date = @date 选择它一样。
@RomanPekar 无论是基于相等性获取索引还是必须在区间内定位值，都必须有所不同。尽管如此：如果有索引（如果使用了 EP，请检查 EP！）WHERE [start_date]<=@DateFilter 应该快如闪电，TOP 1 ORDER BY [start_date] DESC 应该立即选择相关行......这让我想知道，你没有更好地观察性能...
好吧，对于一行它可以正常工作，但是如果我想在某个日期获取有效产品列表怎么办？我当然可以将我的表分解为日期并使用相等检查，但我真的很想知道是否有更漂亮的解决方案
@RomanPekar 如果您的所有产品的开始日期（月初）相同，只需使用 SELECT TOP 1 WITH TIES... 这有帮助吗？
@RomanPekar 我刚刚添加了一种使用ROW_NUMBER 的方法，可以一次性获取所有产品。只需将此结果集加入您的主查询...

【解决方案2】：

首先你需要为(begin_date, end_date, code)创建一个唯一的聚集索引

然后SQL引擎就可以做INDEX SEEK了。

此外，您还可以尝试为 dbo.Products 表创建一个视图，以将该表与预先填充的 dbo.Dates 表连接起来。

select p.code, p.val, p.begin_date, p.end_date, d.[date]
    from dbo.Product as p
        inner join dbo.dates d on p.begin_date <= d.[date] and d.[date] <= p.end_date

然后在您的函数中，将该视图用作“where @date = view.date”。结果可能更好，也可能稍差……这取决于实际数据。

您也可以尝试将该视图编入索引（取决于它的更新频率）。

或者，如果您为 [begin_date] .. [end_date] 范围内的每个日期填充 dbo.Products 表，您可以获得更好的性能。

【讨论】：

【解决方案3】：

使用ROW_NUMBER 的方法会扫描整个Products 表一次。如果Products 表中有很多产品代码，并且每个代码的有效范围很少，这是最好的方法。

WITH
CTE_rn
AS
(
    SELECT
        code
        ,value
        ,ROW_NUMBER() OVER (PARTITION BY code ORDER BY begin_date DESC) AS rn
    FROM Products
    WHERE begin_date <= @date
)
SELECT *
FROM
    <some table with product codes> as t
    LEFT JOIN CTE_rn ON CTE_rn.code = t.product_code AND CTE_rn.rn = 1
;

如果您的产品代码很少，而Products 表中每个代码的有效范围很大，那么最好使用OUTER APPLY 查找每个代码的Products 表。

SELECT *
FROM
    <some table with product codes> as t
    OUTER APPLY
    (
        SELECT TOP(1)
            Products.value
        FROM Products
        WHERE
            Products.code = t.product_code
            AND Products.begin_date <= @date
        ORDER BY Products.begin_date DESC
    ) AS A
;

两种变体都需要在 (code, begin_date DESC) include (value) 上具有唯一索引。

请注意，查询甚至不查看 end_date，因为它们假定间隔没有间隙。它们将在 SQL Server 2008 中工作。

【讨论】：

【解决方案4】：

编辑：我最初的答案是使用 INNER JOIN，但提问者想要一个 LEFT JOIN。

CREATE TABLE Products
  (
  [Code] INT NOT NULL
  , [Value] VARCHAR(30) NOT NULL
  , Begin_Date DATETIME NOT NULL
  , End_Date DATETIME NULL
  )

/*
Products
code    value   begin_date  end_date
10905   13      2005-01-01  2016-12-31
10905   11      2017-01-01  null
*/
INSERT INTO Products ([Code], [Value], Begin_Date, End_Date) VALUES (10905, 13, '2005-01-01', '2016-12-31')
INSERT INTO Products ([Code], [Value], Begin_Date, End_Date) VALUES (10905, 11, '2017-01-01', NULL)

CREATE NONCLUSTERED INDEX SK_ProductDate ON Products ([Code], Begin_Date, End_Date) INCLUDE ([Value])

CREATE TABLE SomeTableWithProductCodes
  (
  [CODE] INT NOT NULL 
  )

 INSERT INTO SomeTableWithProductCodes ([Code]) VALUES (10905)

这是一个带有日期谓词的原型查询。请注意，有更优化的方法可以以防弹方式执行此操作，在上限上使用“小于”运算符，但这是一个不同的讨论。

SELECT
  P.[Code]
  , P.[Value]
  , P.[Begin_Date]
  , P.[End_Date]
FROM
   SomeTableWithProductCodes ST
   LEFT JOIN Products AS P ON
     ST.[Code] = P.[Code]
     AND '2016-06-30' BETWEEN P.[Begin_Date] AND ISNULL(P.[End_Date], '9999-12-31')

此查询将对 Product 表执行索引查找。

这是一个 SQL Fiddle：SQL Fiddle - Products and Dates

【讨论】：

【解决方案5】：

您可以创建一个indexed view，其中包含该范围内每个code/date 的一行。

ProductDate (indexed view)
code    value   date
10905   13      2005-01-01
10905   13      2005-01-02
10905   13      ...
10905   13      2016-12-31
10905   11      2017-01-01
10905   11      2017-01-02
10905   11      ...
10905   11      Today

像这样：

create schema digits
go

create table digits.Ones (digit tinyint not null primary key)
insert into digits.Ones (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)

create table digits.Tens (digit tinyint not null primary key)
insert into digits.Tens (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)

create table digits.Hundreds (digit tinyint not null primary key)
insert into digits.Hundreds (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)

create table digits.Thousands (digit tinyint not null primary key)
insert into digits.Thousands (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)

create table digits.TenThousands (digit tinyint not null primary key)
insert into digits.TenThousands (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
go

create schema info
go

create table info.Products (code int not null, [value] int not null, begin_date date not null, end_date date null, primary key (code, begin_date))
insert into info.Products (code, [value], begin_date, end_date) values 
(10905, 13, '2005-01-01', '2016-12-31'),
(10905, 11, '2017-01-01', null)

create table info.DateRange ([begin] date not null, [end] date not null, [singleton] bit not null default(1) check ([singleton] = 1))
insert into info.DateRange ([begin], [end]) values ((select min(begin_date) from info.Products), getdate())
go

create view info.ProductDate with schemabinding 
as
select
    p.code,
    p.value,
    dateadd(day, ones.digit + tens.digit*10 + huns.digit*100 + thos.digit*1000 + tthos.digit*10000, dr.[begin]) as [date]
from
    info.DateRange as dr
cross join
    digits.Ones as ones
cross join
    digits.Tens as tens
cross join
    digits.Hundreds as huns
cross join
    digits.Thousands as thos
cross join
    digits.TenThousands as tthos
join
    info.Products as p on
    dateadd(day, ones.digit + tens.digit*10 + huns.digit*100 + thos.digit*1000 + tthos.digit*10000, dr.[begin]) between p.begin_date and isnull(p.end_date, datefromparts(9999, 12, 31))
go

create unique clustered index idx_ProductDate on info.ProductDate ([date], code)
go

select *
from info.ProductDate with (noexpand)
where 
    date = '2014-01-01'

drop view info.ProductDate
drop table info.Products
drop table info.DateRange
drop table digits.Ones
drop table digits.Tens
drop table digits.Hundreds
drop table digits.Thousands
drop table digits.TenThousands
drop schema digits
drop schema info
go

【讨论】：

那个不错，我自己也在考虑这个，问题是时间长了会占用很多空间
@RomanPekar 你到底想完成什么？您想要所有产品和日期组合，没有间隙，还是只想有效地查询已有的表，即代码、值、Begin_Date 和 End_Date？如果是后者，这些解决方案就会失控。如果是前者，您需要编辑您的问题以明确这一点，因为它不是。
@RomanPekar - 我无法想象它会占用这么多空间......有多少产品？平均时间跨度是多少？
@RomanPekar 搜索“时空权衡”。您不能同时最小化两者。
@Aducci 如果我们想让最终用户易于使用，那么它基本上应该适用于任何过去的日期（期间可以是开放式的）。好吧，假设有 6000 种产品。如果我们使用从 1990-01-01 到 2100-01-01 的时间，它将是 110 * 365 * 6000，已经是 219M 行。它仍然不适用于任何日期（例如 9999- 12-31)。所以这是一个很好的解决方案，但好的“范围”解决方案仍然更好。