慢 SQL 查询答案

【问题标题】：Slow TSQL Query慢 SQL 查询
【发布时间】：2013-05-29 21:31:41
【问题描述】：

关于如何提高此查询性能的任何想法？

[ftsIndex] PK 是 sID，wordPos。
并且在wordID、sID、wordPos上有一个索引。
他们都是int。

最后使用不同的。
大多数 sID 仅有几个匹配项。
某些 sID 可能有超过 10,000 个匹配项并终止查询。

有一个查询，前 27,749 行在 11 秒内返回。
没有一个 sID 有超过 500 个匹配项。
个人匹配的总和为 65,615。

仅第 27,750 行就耗时 2 分钟，有 15,000 场比赛。

这并不奇怪，因为最后的连接位于 [sID] 上。

既然最终使用 distinct 有没有办法找到第一个肯定的

on [wXright].[sID] = [wXleft].[sID]
    and [wXright].[wordPos] >  [wXleft].[wordPos]
    and [wXright].[wordPos] <= [wXleft].[wordPos] + 10

然后移动到下一个 sID？

我知道这对查询优化器提出了很多要求，但这真的很酷。

在现实生活中，问题文档是零件清单，供应商重复了很多次。

select distinct [wXleft].[sID] 
 FROM 
 ( -- begin [wXleft]
   ( -- start term
      select [ftsIndex].[sID], [ftsIndex].[wordPos]
      from [ftsIndex] with (nolock)
      where [ftsIndex].[wordID] in 
              (select [id] from [FTSwordDef] with (nolock) 
                             where [word] like 'Brown') 
   ) -- end term
 ) [wXleft]
 join 
 ( -- begin [wRight]
   ( -- start term
      select [ftsIndex].[sID], [ftsIndex].[wordPos]
      from [ftsIndex] with (nolock)
      where [ftsIndex].[wordID] in 
              (select [id] from [FTSwordDef] with (nolock) 
                             where [word] like 'Fox')
   ) -- end term
 ) [wXright]
 on [wXright].[sID] = [wXleft].[sID]
and [wXright].[wordPos] >  [wXleft].[wordPos]
and [wXright].[wordPos] <= [wXleft].[wordPos] + 10

这将它降低到 1:40

inner loop join

我这样做只是为了尝试，它完全改变了查询计划。
我不知道问题查询需要多长时间。我在 20:00 放弃了。
我什至不打算将此作为答案发布，因为我认为它对其他人没有价值。
希望有更好的答案。
如果我在接下来的两天内没有收到，我将删除该问题。

这并不能解决问题

  select distinct [ft1].[sID]
  from [ftsIndex] as [ft1] with (nolock)
  join [ftsIndex] as [ft2] with (nolock)
    on [ft2].[sID] = [ft1].[sID]
   and [ft1].[wordID] in (select [id] from [FTSwordDef] with (nolock) where [word] like 'brown')
   and [ft2].[wordID] in (select [id] from [FTSwordDef] with (nolock) where [word] like 'fox')
   and [ft2].[wordPos] >  [ft1].[wordPos]
   and [ft2].[wordPos] <= [ft1].[wordPos] + 10

还支持诸如“quick brown”之类的带有 10 个单词的“fox”或“coyote”的查询，因此使用别名连接不是一个好路径。

这需要 14 分钟（但至少会运行）。
同样，这种格式不利于更高级的查询。

 IF OBJECT_ID(N'tempdb..#tempMatch1', N'U') IS NOT NULL   DROP TABLE #tempMatch1 
 CREATE TABLE #tempMatch1(
    [sID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
 CONSTRAINT [PK1] PRIMARY KEY CLUSTERED 
(
    [sID] ASC,
    [wordPos] ASC
))
 IF OBJECT_ID(N'tempdb..#tempMatch2', N'U') IS NOT NULL   DROP TABLE #tempMatch2 
 CREATE TABLE #tempMatch2(
    [sID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
 CONSTRAINT [PK2] PRIMARY KEY CLUSTERED 
(
    [sID] ASC,
    [wordPos] ASC
))
insert into #tempMatch1 
select [ftsIndex].[sID], [ftsIndex].[wordPos]
      from [ftsIndex] with (nolock)
      where [ftsIndex].[wordID] in 
              (select [id] from [FTSwordDef] with (nolock) 
                             where [word] like 'Brown')
        --and [wordPos] < 100000; 
   order by [ftsIndex].[sID], [ftsIndex].[wordPos]                      
insert into #tempMatch2 
select [ftsIndex].[sID], [ftsIndex].[wordPos]
      from [ftsIndex] with (nolock)
      where [ftsIndex].[wordID] in 
              (select [id] from [FTSwordDef] with (nolock) 
                             where [word] like 'Fox')
        --and [wordPos] < 100000;
   order by [ftsIndex].[sID], [ftsIndex].[wordPos]
select count(distinct(#tempMatch1.[sID]))
from #tempMatch1 
join #tempMatch2
  on #tempMatch2.[sID] = #tempMatch1.[sID]
 and #tempMatch2.[wordPos] >  #tempMatch1.[wordPos]
 and #tempMatch2.[wordPos] <= #tempMatch1.[wordPos] + 10

一个稍有不同的连接在 5 秒内运行（并且具有不同的查询计划）。
但是我无法通过提示来修复它，因为它会移动到一个连接的位置。
甚至 +1 也有超过 10 个文档，其中包含超过 7,000 个匹配项。

on [wXright].[sID] = [wXleft].[sID]
and [wXright].[wordPos] =  [wXleft].[wordPos] + 1

全表定义

CREATE TABLE [dbo].[FTSindex](
    [sID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
    [wordID] [int] NOT NULL,
    [charPos] [int] NOT NULL,
 CONSTRAINT [PK_FTSindex] PRIMARY KEY CLUSTERED 
(
    [sID] ASC,
    [wordPos] ASC
)WITH (PAD_INDEX  = ON, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 100) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[FTSindex]  WITH CHECK ADD  CONSTRAINT [FK_FTSindex_FTSwordDef] FOREIGN KEY([wordID])
REFERENCES [dbo].[FTSwordDef] ([ID])
GO

ALTER TABLE [dbo].[FTSindex] CHECK CONSTRAINT [FK_FTSindex_FTSwordDef]
GO

【问题讨论】：

我不知道你的所有数据，但你有没有想过可能插入临时表，然后在它们上创建聚集索引？先插入，再创建索引。这通常比自己创建索引要快。这可能会对您有所帮助，但可能不会，所以我想将其添加为评论。
@djangojazz 插入只需要 5 秒。如果我添加一个排序，以便以 PK 顺序插入记录，它仍然是 5 秒。
我们将需要表/键/索引定义和查询计划（实际）。此外，这种设计/方法是否有任何理由，而不是仅使用 SQL Server 全文搜索？
SQL Server 全文搜索在 "fox 或 "coyote" 的 10 内不做 "quick brown"。将添加 table def。你如何发布查询计划？
您必须将它放到某个在线共享中并链接到它。（我最不喜欢 StackOverflow 的东西）

标签： tsql join sql-server-2008-r2

【解决方案1】：

更新：

您仍然可以使用union all，如果您延迟过滤“L”和“R”边直到过程的最后部分，它有助于优化器保留索引的排序。不幸的是，您需要事先检索所有 wordid 并在equals 条件下使用它们。在我的机器上，它将执行时间减少到 2/3：

  ; with o as (
    select sID, wordPos, wordID
      from FTSindex 
     where wordID = 1
   union all
    select sID, wordPos, wordID
      from FTSindex 
     where wordID = 4
   union all
    select sID, wordPos, wordID
      from FTSindex 
     where wordID = 2
 ),
 g as (
    select sID, wordPos, wordID,
           ROW_NUMBER() over (partition by [sID] order by wordPos) rn
      from o
 )
 select count(distinct(g1.sID))   --   26919 00:02 
      from g g1
      join g g2
        on g1.sID = g2.sID 
       and g1.rn  = g2.rn - 1
       and g1.wordPos >= g2.wordPos - 10 
    -- Now is the time to repartition the stream
       and g1.wordID in (1, 4)
       and g2.wordID = 2

哦，现在真的需要两秒钟吗？

更新 - 2：

; with o as (
 -- Union all resolves costly sort
    select sid, wordpos, wordid
      from FTSindex 
     where wordID = 1
     union all
    select sid, wordpos, wordID
      from FTSindex 
     where wordID = 2
),
g as (
    select sid, wordid, wordpos,
           ROW_NUMBER() over(order by sid, wordpos) rn
      from o
)
select count(distinct g1.sid)
  from g g1
 inner join g g2
    on g1.sID = g2.sID 
   and g1.rn = g2.rn - 1
 where g1.wordID = 1
   and g2.wordID = 2
   and g1.wordPos >= g2.wordpos - 10

1 和 2 代表所选单词的 id。 10字以内多次命中，结果与原查询不同；原始查询将报告所有这些，但此查询将仅显示最接近的查询。

这个想法是只提取搜索的单词并比较两个相邻单词之间的距离，其中 wordID 1 排在第一位，wordID 2 排在第二位。

更新 - 1：

我删除了这篇文章，因为它的表现不如我想象的那么好。但是，它比优化查询更适合 OP 的需求，因为它允许同时搜索多个单词（在 where 子句中指定的另一个单词附近找到的单词列表）。

; with g as (
    select sid, wordid, wordpos,
           ROW_NUMBER() over(order by sid, wordpos) rn
      from FTSindex
     where wordID in (1, 2)
)
select count(distinct g1.sid)
  from g g1
 inner join g g2
    on g1.sID = g2.sID 
   and g1.rn = g2.rn - 1
 where g1.wordID = 1
   and g2.wordID = 2
   and g1.wordPos >= g2.wordpos - 10

第一次尝试：

可能有一种方法可以将cross apply 与top 1 结合使用。

select [wXleft].[sID], [wXleft].[wordPos]
  from [ftsIndex] wXleft with (nolock)
 cross apply 
 (
    select top 1 r.sID 
      from [ftsIndex] r 
     where r.sID = wXleft.sID 
       and r.wordPos > wxLeft.wordPos 
       and r.wordPos <= wxLeft.wordPos + 10 
       and r.wordID in
           (select [id]
              from [FTSwordDef] with (nolock) 
             where [word] like 'Fox') 
 ) wXright
 where [wXleft].[wordID] in 
       (select [id] 
          from [FTSwordDef] with (nolock) 
         where [word] like 'Brown')

奖励枢轴尝试：

; with o as (
    select sid, wordpos, wordid
      from FTSindex 
     where wordID = 1
     union all
    select sid, wordpos, wordID
      from FTSindex 
     where wordID = 2
),
g as (
    select sid, wordid, wordpos,
           ROW_NUMBER() over(order by sid, wordpos) rn
    from o
)
select sid, rn, [1], [2]
from
(
-- Collapse rns belonging to wordid 2 to ones belonging to wordid 1
-- so they appear in the same row
   select sid, wordpos, wordid, rn - case when wordid = 1 then 0 else 1 end rn
   from g
) g1
pivot (max(wordpos) for wordid in ([1], [2])) u
where [2] - [1] <= 10

【讨论】：

在 2/3 的时间内返回与内循环连接相同的答案。在接受这个之前，将等待几天的奇迹答案。谢谢。
你为什么取消其他选项？它更快。我一直在尝试对其进行调整以尝试获得更多信息。奇怪的是，CTE 产生的类型是成本的主导因素。
@Blam 因为我的时机不对，所以花费的时间与我最初的尝试一样多。同时我已经解决了排序部分，但我对 Sql Server 需要每个引用执行一次 CTE 感到困扰，并且有两个引用。我会在一分钟内发布新版本。
在我的测试中它要快一点，我一直在试图弄清楚如何降低这种成本。这两种类型是成本的 70%，即使该索引是那种类型的，所以在我看来它应该是一种便宜的类型。
@Blam 请看一下新版本。

【解决方案2】：

好吧，我希望我有更多信息或测试方法，但如果失败了，我可能会尝试：

 IF OBJECT_ID(N'tempdb..#tempMatch', N'U') IS NOT NULL   DROP TABLE #tempMatch
 CREATE TABLE #tempMatch(
    [sID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
    [wordID] [int] NOT NULL,
 CONSTRAINT [PK2] PRIMARY KEY CLUSTERED 
(
    [sID] ASC,
    [wordPos] ASC
))

--
;WITH cteWords As 
(
            SELECT 'Brown' as [word]
  UNION ALL SELECT 'Fox'
)
INSERT INTO #tempMatch ([sID],[wordPos],[wordID])
SELECT sID, wordPos, wordID
FROM    ftsIndex
WHERE   EXISTS
        (Select * From FTSWordDef s1
         inner join cteWords s2 ON s1.word = s2.word
         Where ftsIndex.wordID = s1.id)
;

select count(distinct(s1.[sID]))
    from #tempMatch s1
    join #tempMatch s2
        on  s2.[sID] = s1.[sID]
        and s2.[wordPos] >  s1.[wordPos]
        and s2.[wordPos] <= s1.[wordPos] + 10
    where s1.wordID = (select id from FTSWordDef w where w.word = 'Brown')
      and s2.wordID = (select id from FTSWordDef w where w.word = 'Fox')

我昨晚想出了一个替代版本。和上面的查询是一样的，但是CREATE语句改成：

 IF OBJECT_ID(N'tempdb..#tempMatch', N'U') IS NOT NULL   DROP TABLE #tempMatch
 CREATE TABLE #tempMatch(
    [sID] [int] NOT NULL,
    [wordID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
 CONSTRAINT [PK0] PRIMARY KEY CLUSTERED 
(
    [wordID] ASC,
    [sID] ASC,
    [wordPos] ASC
))

如果这些有帮助，请告诉我。

【讨论】：

必须将 wordID 添加到第一个约束中，并且在连接 cteWords 时都抛出错误。
@Blam 有什么错误？我无法测试编译，因为我们没有表定义。
@Blam 为什么必须将 wordID 添加到第一个约束？根据您的帖子，(sID, wordPos) 应该足够了，因为它们是我在INSERT..SELECT.. 中绘制的唯一表的主键。（事实上，现在我看到它，我意识到DISTINCT 是多余的，不应该存在）
我只知道我违反了 PK。它也让我感到困惑，因为 sID 和 wordPos 是桌面上的 PK。错误是 Msg 208, Level 16, State 1, Line 40 Invalid object name 'cteWords'。
@Blam "Invalid object name 'cteWords'" 也没有任何意义。这里有点不对劲。