无法在 mongodb 中创建索引，“键太大而无法索引”答案

【问题标题】：Cannot create index in mongodb, "key too large to index"无法在 mongodb 中创建索引，“键太大而无法索引”
【发布时间】：2015-03-03 18:26:48
【问题描述】：

我在 mongodb 中创建索引，有 1000 万条记录但出现以下错误

db.logcollection.ensureIndex({"Module":1})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 3,
        "ok" : 0,
        "errmsg" : "Btree::insert: key too large to index, failing play.logcollection.$Module_1 1100 { : \"RezGainUISystem.Net.WebException: The request was aborted: The request was canceled.\r\n   at System.Net.ConnectStream.InternalWrite(Boolean async, Byte...\" }",
        "code" : 17282
}

请帮助我如何在 mongodb 中创建索引，

【问题讨论】：

您尝试删除“模块”的索引。我认为您的内容对于正常索引来说太大了。
这也可能是由于同一字段同时具有文本索引和标准索引。通过删除其中一个，您或许可以解决此问题。

标签： mongodb mongodb-query

【解决方案1】：

如果现有文档的索引条目超过index key limit（1024 字节），MongoDB 将不会在集合上创建索引。但是，您可以改为创建 hashed index 或 text index：

db.logcollection.createIndex({"Module":"hashed"})

或

db.logcollection.createIndex({"Module":"text"})

【讨论】：

谢谢，它对我有用，但哈希索引的性能非常慢，我的查询是 db.logcollection.find({"Module":"RezGainUI"}).count()。计数大约需要 18 秒
找到超长值并尽可能缩短它们。然后就可以创建普通索引了
对不起，我是 mongodb 新手，请指导我如何操作
这与 mongodb 无关。需要在模块字段中查找并修剪长度超过1024字节的文本

【解决方案2】：

您可以通过使用以下命令启动 mongod 实例来静默此行为：

mongod --setParameter failIndexKeyTooLong=false

或者通过 mongoShell 执行以下命令

db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )

如果您确保您的字段很少会超出限制，那么解决此问题的一种方法是按字节长度 val，我会将其拆分为字段元组val_1、val_2 等等。 Mongo 将文本存储为 utf-8 有效值。这意味着您需要一个可以正确拆分 utf-8 字符串的函数。

   def split_utf8(s, n):
    """
    (ord(s[k]) & 0xc0) == 0x80 - checks whether it is continuation byte (actual part of the string) or jsut header indicates how many bytes there are in multi-byte sequence

    An interesting aside by the way. You can classify bytes in a UTF-8 stream as follows:

    With the high bit set to 0, it's a single byte value.
    With the two high bits set to 10, it's a continuation byte.
    Otherwise, it's the first byte of a multi-byte sequence and the number of leading 1 bits indicates how many bytes there are in total for this sequence (110... means two bytes, 1110... means three bytes, etc).
    """
    s = s.encode('utf-8')
    while len(s) > n:
        k = n
        while (ord(s[k]) & 0xc0) == 0x80:
            k -= 1
        yield s[:k]
        s = s[k:]
    yield s

然后你可以定义你的复合索引：

db.coll.ensureIndex({val_1: 1, val_2: 1, ...}, {background: true})

或每个 val_i 有多个索引：

db.coll.ensureIndex({val_1: 1}, {background: true})
db.coll.ensureIndex({val_1: 2}, {background: true})
...
db.coll.ensureIndex({val_1: i}, {background: true})

重要提示：如果您考虑在复合索引中使用您的字段，请注意split_utf8 函数的第二个参数。在每个文档中，您需要删除构成索引键的每个字段值的字节总和，例如对于索引 (a:1, b:1, val: 1) 1024 - sizeof(value(a)) - sizeof(value(b))

在任何其他情况下，使用 hash 或 text 索引。

【讨论】：

为此创建复合索引不起作用，因为 1024 大小限制适用于整个 index key 的大小，而不是其中的每个字段。
@JohnnyHK 你是对的。请参阅Important 注释。我改进了它。
在我的项目中，我有 4-5 维索引，这种方法很有效 :)
伙计们，让我们把它作为必须的答案吗？

【解决方案3】：

正如不同的人在答案中指出的那样，错误 key too large to index 表示您正在尝试在长度超过 1024 字节的字段上创建索引。

在 ASCII 术语中，1024 字节通常转换为大约 1024 个字符的长度。

对此没有解决方案，因为这是MongoDB Limits and Thresholds page 中提到的 MongoDB 设置的内在限制：

索引条目的总大小（可能包括取决于 BSON 类型的结构开销）必须小于 1024 字节。

开启failIndexKeyTooLong错误不是解决办法，如server parameters manual page中提到的：

...这些操作将成功插入或修改文档，但索引或索引不包含对文档的引用。

这句话的意思是，有问题的文档将不会包含在索引中，并且可能会从查询结果中丢失。

例如：

> db.test.insert({_id: 0, a: "abc"})

> db.test.insert({_id: 1, a: "def"})

> db.test.insert({_id: 2, a: <string more than 1024 characters long>})

> db.adminCommand( { setParameter: 1, failIndexKeyTooLong: false } )

> db.test.find()
{"_id": 0, "a": "abc"}
{"_id": 1, "a": "def"}
{"_id": 2, "a": <string more than 1024 characters long>}
Fetched 3 record(s) in 2ms

> db.test.find({a: {$ne: "abc"}})
{"_id": 1, "a": "def"}
Fetched 1 record(s) in 1ms

通过强制 MongoDB 忽略 failIndexKeyTooLong 错误，最后一个查询不包含违规文档（即结果中缺少带有_id: 2 的文档），因此查询导致错误的结果集。

【讨论】：

【解决方案4】：

当遇到“index key limit”时，解决方案取决于您的架构的需要。在极少数情况下，大于 1024 字节的值的密钥匹配是设计要求。事实上，几乎所有数据库都施加了索引键限制限制，但通常在旧的关系数据库（Oracle/MySQL/PostgreSQL）中有些可配置，因此您可以轻松地自取其辱。

对于快速搜索，“文本”索引旨在优化长文本字段的搜索和模式匹配，非常适合用例。然而，更常见的是，对长文本值的唯一性约束是一项要求。并且“文本”索引的行为与设置了唯一标志的唯一标量值不同 { unique: true }（更像是字段中所有文本字符串的数组）。

从 MongoDb 的 GridFS 中汲取灵感，可以通过向文档添加“md5”字段并在其上创建唯一标量索引来轻松实现唯一性检查。有点像自定义的唯一哈希索引。这允许几乎无限 (~ 16mb) 的文本字段长度，它被索引以供搜索并且在整个集合中是唯一的。

const md5 = require('md5');
const mongoose = require('mongoose');

let Schema = new mongoose.Schema({
  text: {
    type: String,
    required: true,
    trim: true,
    set: function(v) {
        this.md5 = md5(v);
        return v;
    }
  },
  md5: {
    type: String,
    required: true,
    trim: true
  }
});

Schema.index({ md5: 1 }, { unique: true });
Schema.index({ text: "text" }, { background: true });

【讨论】：

【解决方案5】：

在我的例子中，我试图在一个大的子文档数组上建立索引，当我去查看我的查询时，查询实际上是针对子属性的子属性，所以我更改了索引以专注于所述子子属性并且它有效好的。

在我的例子中，goals 是大型子文档数组，失败的“键太大”索引看起来像 {"goals": 1, "emailsDisabled": 1, "priorityEmailsDisabled": 1}，查询看起来像这样：

emailsDisabled: {$ne: true},
priorityEmailsDisabled: {$ne: true},
goals: {
  $elemMatch: {
    "topPriority.ymd": ymd,
  }
}

一旦我将索引更改为{"goals.topPriority.ymd": 1, "emailsDisabled": 1, "priorityEmailsDisabled": 1}，它就可以正常工作了。

需要明确的是，我确定在这里起作用的是它允许我创建索引。该索引是否适用于该查询的问题是一个单独的问题，我尚未回答。

【讨论】：