【问题标题】:Dockerized MongoDB keeps crashing during long writes to a capped collectin (SEGFAULT)Dockerized MongoDB 在长时间写入上限收集器 (SEGFAULT) 期间不断崩溃
【发布时间】:2017-12-28 12:33:18
【问题描述】:

我们运行一个带有一些日志(非关键数据,格式可变的 JSON 文档)的大型 MongoDB 实例(大约 500GB 的数据),我们需要定期删除最旧的记录。我们决定将数据移动到具有固定大小的上限集合中。因此,我们设置了一个新的 mongo 实例(出于兼容性原因 MongoDB 版本 3.2.14),在其中创建了一个集合和索引,并开始了一项工作,将文档从旧 mongo 按时间顺序复制到新的。

复制数据的脚本如下所示:

import pymongo
from pymongo.errors import BulkWriteError
import traceback

src_mongo = pymongo.MongoClient('old_mongo_ip')
src = src_mongo['db_name']['collection_name']

dst_mongo = pymongo.MongoClient('localhost')
dst = dst_mongo['db_name']['collection_name']

bulk = dst.initialize_unordered_bulk_op()
count = 0
total = 0
for doc in src.find().sort([("collector_tstamp",1)]):
    bulk.insert(doc)
    count += 1
    if (count > 1000):
        try:
            result = bulk.execute()
            total += result['nInserted']
        except BulkWriteError as err:
            traceback.print_last()
            total += err.details['nInserted']
        finally:
            bulk = dst.initialize_unordered_bulk_op()
            count = 0
    print(str(total)+"\r",end="")
if (count > 0):
    try:
        bulk.execute()
        total += result['nInserted']
    except:
        traceback.print_last()
print(str(total))

问题是,这项工作需要很长时间(鉴于此设置,这并不奇怪),并且新的 mongo 在复制几个小时后不断崩溃,并出现 SEGFAULT。

新的 mongo 在 EC2 实例(m4.large,与上述脚本相同的实例)上的 docker 容器中运行,并将数据存储到 EBS 卷 (GP2 SSD)。除了 mongod.log 文件中的堆栈跟踪之外,没有任何关于崩溃原因的提示:

2017-07-22T01:50:34.452Z I COMMAND  [conn5] command db_name.collection_name command: insert { insert: "collection_name", ordered: false, documents: 1000 } ninserted:1000 keyUpdates:0 writeConflicts:0 numYields:0 reslen:40 locks:{ Global: { acquireCount: { r: 16, w: 16 } }, Database: { acquireCount: { w: 16 } }, Collection: { acquireCount: { w: 16 } }, Metadata: { acquireCount: { w: 1000, W: 1000 } } } protocol:op_query 318ms
2017-07-22T01:50:34.930Z F -        [thread1] Invalid access at address: 0x78
2017-07-22T01:50:34.994Z F -        [thread1] Got signal: 11 (Segmentation fault).

 0x154e4f2 0x154d499 0x154de77 0x7f22f6862390 0x7f22f685c4c0 0x1bbe09b 0x1bc2305 0x1c10f7a 0x1c0bd53 0x1c0c0d7 0x1c0d9e0 0x1c74406 0x7f22f68586ba 0x7f22f658e82d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"114E4F2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"114D499"},{"b":"400000","o":"114DE77"},{"b":"7F22F6851000","o":"11390"},{"b":"7F22F6851000","o":"B4C0","s":"__pthread_mutex_unlock"},{"b":"400000","o":"17BE09B"},{"b":"400000","o":"17C2305","s":"__wt_split_multi"},{"b":"400000","o":"1810F7A","s":"__wt_evict"},{"b":"400000","o":"180BD53"},{"b":"400000","o":"180C0D7"},{"b":"400000","o":"180D9E0","s":"__wt_evict_thread_run"},{"b":"400000","o":"1874406","s":"__wt_thread_run"},{"b":"7F22F6851000","o":"76BA"},{"b":"7F22F6488000","o":"10682D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.14", "gitVersion" : "92f6668a768ebf294bd4f494c50f48459198e6a3", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-1022-aws", "version" : "#31-Ubuntu SMP Tue Jun 27 11:27:55 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "B04D4C2514E2C891B5791D71A8F4246ECADF157D" }, { "b" : "7FFF43146000", "elfType" : 3, "buildId" : "1AD367D8FF756A82AA298AB1CC9CD893BB5C997C" }, { "b" : "7F22F77DD000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "675F454AD6FD0B6CA2E41127C7B98079DA37F7B6" }, { "b" : "7F22F7399000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "2DA08A7E5BF610030DD33B70DB951399626B7496" }, { "b" : "7F22F7191000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "0DBB8C21FC5D977098CA718BA2BFD6C4C21172E9" }, { "b" : "7F22F6F8D000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C0C5B7F18348654040534B050B110D32A19EA38D" }, { "b" : "7F22F6C84000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "05451CB4D66C321691F64F253880B7CE5B8812A6" }, { "b" : "7F22F6A6E000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F22F6851000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "84538E3C6CFCD5D4E3C0D2B6C3373F802915A498" }, { "b" : "7F22F6488000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "CBFA941A8EB7A11E4F90E81B66FCD5A820995D7C" }, { "b" : "7F22F7A46000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "A7D5A820B802049276B1FC26C8E845A3E194EB6B" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x154e4f2]
 mongod(+0x114D499) [0x154d499]
 mongod(+0x114DE77) [0x154de77]
 libpthread.so.0(+0x11390) [0x7f22f6862390]
 libpthread.so.0(__pthread_mutex_unlock+0x0) [0x7f22f685c4c0]
 mongod(+0x17BE09B) [0x1bbe09b]
 mongod(__wt_split_multi+0x85) [0x1bc2305]
 mongod(__wt_evict+0x8FA) [0x1c10f7a]
 mongod(+0x180BD53) [0x1c0bd53]
 mongod(+0x180C0D7) [0x1c0c0d7]
 mongod(__wt_evict_thread_run+0xC0) [0x1c0d9e0]
 mongod(__wt_thread_run+0x16) [0x1c74406]
 libpthread.so.0(+0x76BA) [0x7f22f68586ba]
 libc.so.6(clone+0x6D) [0x7f22f658e82d]
-----  END BACKTRACE  -----

我尝试四处搜索,但找不到任何可能的解决方案……有没有人遇到过类似的问题,你找到原因了吗?

【问题讨论】:

    标签: mongodb docker amazon-ec2


    【解决方案1】:

    这看起来很像已知的 MongoDB 错误 SERVER-29850,它描述了这种确切的行为并在 3.2.15 中修复:

    WiredTiger 存储引擎中进行页面拆分的算法中的错误可能会触发分段错误,从而导致节点防御性关闭以保护用户数据。 [...]

    该错误在日志中显示为一条类似于以下消息的消息:

    2017-06-23T19:03:29.043+0000 F -        [thread1] Invalid access at address: 0x78
    2017-06-23T19:03:29.073+0000 F -        [thread1] Got signal: 11 (Segmentation fault).
    
    ----- BEGIN BACKTRACE -----
    [...]
     mongod(+0x160C2BB) [0x1a0c2bb]
     mongod(__wt_split_multi+0x85) [0x1a105e5]
     mongod(__wt_evict+0xA55) [0x1a5eac5]
    [...]
    -----  END BACKTRACE  -----
    

    我的建议是从 MonogDB 版本 3.2.14 升级到 3.2.15。

    【讨论】:

    • 我更新了 mongo,此后进程没有崩溃。看起来这确实是问题所在。再次感谢!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-08-09
    • 1970-01-01
    • 2012-06-28
    相关资源
    最近更新 更多