查找全文索引为空的记录答案

【问题标题】：Find records with an empty full text index查找全文索引为空的记录
【发布时间】：2014-07-28 23:26:47
【问题描述】：

我将文档二进制文件（主要是 PDF 文件）存储在 SQL Server 数据库中，并使用 Acrobat IFilter 和全文索引来使文件的内容可搜索。

但是，其中一些 PDF 是使用不执行 OCR 的非常便宜的软件扫描的，并且是文档的图像，而不是带有可搜索文本的正确文档。我想确定数据库中的哪些记录没有可搜索的文本，以便可以对它们进行 OCRed 并重新上传。

我可以通过使用sys.dm_fts_index_keywords_By_Document 获得确实至少有一个全文条目的文档ID。我尝试将不同的 ID 列表与文档表连接起来以查找不匹配的记录，但这结果非常慢——我有大约 20,000 个文档（大约数百页）并且查询运行了 20 多个在我取消之前的几分钟。

有没有更好的方法来做到这一点？

【问题讨论】：

我经历了这个，找不到更好的答案...我的记录集没有那么大，但仍然需要一些时间。执行并离开一天...我建议将其作为插入语句执行，以便将所有行转储到您可以稍后调用的表中。奖金指向任何对此有答案的人。
我目前没有连接到合适的数据库，但我经常发现我可以为 ms 提供的 proc 等提取 sql 代码。也许如果你这样做，你可以确定一个运行速度更快的完整查询的有用子集。
只记得我可以 RDP 到合适的机器。 sys.dm_fts_index_keywords_By_Document 在主、系统、表值函数下，但这不能导出到创建函数脚本，所以那里没有帮助..

标签： sql-server full-text-indexing

【解决方案1】：

我设法想出了一个解决方案，只需要大约 2 分钟就可以在一组 40,000 个文档上运行。

1) 创建一个临时表来存储来自 sys.dm_fts_index_keywords_by_document 的 document_id 值。

2) 通过按 document_id 分组来填充它。几乎所有文档都至少有一些条目，因此选择一个关键字计数阈值，表明全文索引没有有意义的信息（我使用了 30 个，但大多数“坏”文档只有 3-5 个）。在我的特殊情况下，存储 PDF 二进制文件的表是 PhysicalFile。

3) 如果需要，将临时表连接到列出您需要的信息的任何其他表。在我的特殊情况下，MasterDocument 包含文档标题，我还包含了一些查找表。

create table #PhysicalFileIDs (PhysicalFileID int, KeywordCount int)

insert into #PhysicalFileIDs (PhysicalFileID, KeywordCount)
    select document_id, count(keyword) from sys.dm_fts_index_keywords_by_document (db_id(), object_id('PhysicalFile'))
    group by document_id having count(keyword) < 30

select MasterDocument.DocumentID, MasterDocument.Title, ProfileType.ProfileTypeDisplayName, #PhysicalFileIDs.KeywordCount
    from MasterDocument
    inner join #PhysicalFileIDs on Masterdocument.PhysicalFileID = #PhysicalFileIDs.PhysicalFileID
    inner join DocumentType on MasterDocument.DocumentTypeID = DocumentType.DocumentTypeID
    inner join ProfileType on ProfileType.ProfileTypeID = DocumentType.ProfileTypeID

drop table #PhysicalFileIDs

【讨论】：