pymongo - 消息长度大于服务器最大消息大小答案

【问题标题】：pymongo - Message length is larger than server max message sizepymongo - 消息长度大于服务器最大消息大小
【发布时间】：2018-02-27 17:35:01
【问题描述】：

for doc in collection.find({'is_timeline_valid': True}): 行给出消息长度错误。如何在没有错误的情况下获取所有集合？我知道find().limit()，但我不知道如何使用它。

代码：

from openpyxl import load_workbook
import pymongo
import os

wb = load_workbook('concilia.xlsx')
ws = wb.active
client = pymongo.MongoClient('...')
db = client['...']
collection = db['...']

r = 2
for doc in collection.find({'is_timeline_valid': True}):
   for dic in doc['timeline']['datas']:
     if 'concilia' in dic['tramite'].lower():
        ws.cell(row = r, column = 1).value = doc['id_process_unformatted']
        ws.cell(row = r, column = 2).value = dic['data']
        ws.cell(row = r, column = 3).value = dic['tramite']
        wb.save('concilia.xlsx')
        print('*****************************')
        print(dic['tramite'])
        # print('check!')
        r += 1

【问题讨论】：

标签： python mongodb pymongo

【解决方案1】：

这是一个简单的分页器，它将查询执行拆分为分页查询。

from itertools import count

class PaginatedCursor(object):
    def __init__(self, cur, limit=100):
        self.cur = cur
        self.limit = limit
        self.count = cur.count()

    def __iter__(self):
        skipper = count(start=0, step=self.limit)

        for skip in skipper:
            if skip >= self.count:
                break

            for document in self.cur.skip(skip).limit(self.limit):
                yield document

            self.cur.rewind()

...
cur = collection.find({'is_timeline_valid': True})
...
for doc in PaginatedCursor(cur, limit=100):
   ...

【讨论】：

【解决方案2】：

我今天遇到了这个问题，结果证明它与集合中特定文档的大小超过了max_bson_size 限制有关。将文档添加到集合时，请确保文档大小不超过 max_bson_size 大小。

document_size_limit = client.max_bson_size
assert len(json.dumps(data)) < document_size_limit

我目前正在调查为什么该集合首先允许大于 max_bson_size 的文档。

【讨论】：

我遇到了同样的问题，但发现 client.max_message_size 给了我正确的上限（~4MB 而 max_bson_size 是~16MB）。
你找到解释了吗？遇到同样的问题...
是的，对我来说问题是文档中的字段之一太大了。插入期间没有问题，但查询时失败 - 可能是由于查询逻辑中的某些断言在插入时不存在（奇怪）。一些解决方案是 1) 压缩 2) 将大文本字段存储在 blob 存储中并将引用存储在 doc 中。我从 1 开始，但最近切换到 2，因为我实际上并不经常阅读文本 blob。

【解决方案3】：

我们可以在 find() 中添加 batch_size 以减小消息大小。

for doc in collection.find({'is_timeline_valid': True}):

变成

for doc in collection.find({'is_timeline_valid': True}, batch_size=1):

【讨论】：

谢谢。就我而言，我可以将 batch_size 推到 60。