【问题标题】:Troubleshooting mongodb service crashmongodb服务崩溃问题排查
【发布时间】:2019-01-16 07:35:16
【问题描述】:

我在树莓派 (Linux) 上运行 python 程序,它将数据记录到 mongodb(使用 pymongo 模块)。我无法理解 mongodb 服务何时停止运行或它为什么会停止。

现在,我已经设置了我的程序功能,如果它们无法访问 mongodb(获取 pymongo 连接异常),它们会尝试重新启动服务,等待十秒钟然后重新尝试操作。这些函数是这样递归的:

def get_database_collection():
     try:
          # code to get document
          return document
     except Exception:
          # code to log exception in my log files
          start_mongo_service()
          get_database_collection()

这就是 start_mongo_service() 函数的样子:

def start_mongo_service():
     try:
          subprocess.call(["sudo", "service", "mongodb", "start"])
          time.sleep(10)
          return True
     except Exception:
          # code to log exception in my log files (Could not start_mongo_service)
          database_logger = logging.getLogger('database_thread')
          database_logger.exception("Could not start_mongo_service")
          time.sleep(10)
          return False 

现在,我知道捕获所有异常是不好的做法,但我这样做是因为我不希望我的代码崩溃,并且我会记录发生的任何异常,以便检查代码的行为。

所以昨天我的程序崩溃了,控制台上给出的错误是:MaximumRecursionDepth 超出,我假设这意味着它循环了 1000 次,但仍然无法逃脱它的异常。这是程序日志的样子:

2019-01-15 18:12:50,000 - ERROR - database_thread - Could not start_mongo_service
Traceback (most recent call last):
  File "gateway-embedded-code/database.py", line 89, in update_status_collection
AttributeError: 'NoneType' object has no attribute 'update'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pymongo/mongo_client.py", line 266, in __init__
  File "pymongo/mongo_client.py", line 641, in __find_node
pymongo.errors.AutoReconnect: could not connect to localhost:27017: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "gateway-embedded-code/database.py", line 35, in get_database_collection
  File "pymongo/mongo_client.py", line 269, in __init__
pymongo.errors.ConnectionFailure: could not connect to localhost:27017: [Errno 110] Connection timed out

上面的pymongo异常发生了很多次,我把它们缩短了在这里发布。然后我尝试在 /var/log/mongodb 中查看 mongodb 自己的日志,最后一个条目是在 1 月 15 日 17:51:54!在那之后,那天的日志中没有任何内容......我想服务停止了,我的程序也无法重新启动它,所以它在 18:12:50 崩溃了......

Tue Jan 15 17:51:54.391 [conn11125] end connection 127.0.0.1:53052 (1 connection now open)
Tue Jan 15 17:51:54.393 [initandlisten] connection accepted from 127.0.0.1:53054 #11126 (2 connections now open)
Tue Jan 15 17:51:54.408 [conn11126] end connection 127.0.0.1:53054 (1 connection now open)
Tue Jan 15 17:51:54.410 [initandlisten] connection accepted from 127.0.0.1:53056 #11127 (2 connections now open)
Wed Jan 16 04:33:04.994 [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Wed Jan 16 04:33:04.994 [signalProcessingThread] now exiting
Wed Jan 16 04:33:04.994 dbexit:
Wed Jan 16 04:33:04.994 [signalProcessingThread] shutdown: going to close listening sockets...
Wed Jan 16 04:33:04.994 [signalProcessingThread] closing listening socket: 9
Wed Jan 16 04:33:04.994 [signalProcessingThread] closing listening socket: 10
Wed Jan 16 04:33:04.994 [signalProcessingThread] closing listening socket: 11
Wed Jan 16 04:33:04.994 [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
Wed Jan 16 04:33:04.994 [signalProcessingThread] shutdown: going to flush diaglog...
Wed Jan 16 04:33:04.994 [signalProcessingThread] shutdown: going to close sockets...
Wed Jan 16 04:33:04.994 [signalProcessingThread] shutdown: waiting for fs preallocator...
Wed Jan 16 04:33:04.994 [signalProcessingThread] shutdown: lock for final commit...
Wed Jan 16 04:33:04.994 [signalProcessingThread] shutdown: final commit...
Wed Jan 16 04:33:04.995 [signalProcessingThread] shutdown: closing all files...
Wed Jan 16 04:33:04.995 [conn11127] end connection 127.0.0.1:53056 (1 connection now open)
Wed Jan 16 04:33:04.997 [signalProcessingThread] closeAllFiles() finished
Wed Jan 16 04:33:04.997 [signalProcessingThread] journalCleanup...
Wed Jan 16 04:33:04.997 [signalProcessingThread] removeJournalFiles
Wed Jan 16 04:33:05.053 [conn4] end connection 127.0.0.1:56470 (0 connections now open)
Wed Jan 16 04:33:05.223 [signalProcessingThread] shutdown: removing fs lock...
Wed Jan 16 04:33:05.224 dbexit: really exiting now


***** SERVER RESTARTED *****
      # Everything works fine from this point onwards

今天是 1 月 16 日,并且在我重新启动树莓派时记录了 1 月 16 日的所有消息,程序现在完全正常运行...但是当我让它继续运行并在第二天检查时,这个问题仍然存在。

我的问题是,为什么会发生这种情况? mongo服务什么时候停止?为什么我不能用我的功能重新启动它?谁能解释从日志中可以推断出什么?意外的电源断开是否会导致 mongodb 服务在启动时无法运行?请帮助我解决可能发生的问题以及如何处理它,我不希望我的程序崩溃!

抱歉,帖子太长了,我可以提供您需要的更多详细信息。

感谢您的阅读。

编辑:只是想澄清 AttributeError: Nonetype 的来源。请记住,我所说的集合只是指数据库中的一个文档。 我有一个名为 update_status_collection() 的函数:

 def update_status_collection(the_update):
       try:
            document = get_database_collection(collection_name='status_collection')
            document.update(the_update)
       except Exception:
            database_logger = logging.getLogger('database_thread')
            database_logger.exception('Could not update_status_collection')
            start_mongo_service()
            update_status_collection(the_update)

现在由于某种原因,get_database_collection 函数将 Nonetype 返回给文档变量,这就是引发 AttributeError 异常的地方,因为您无法更新 Nonetype。虽然我很好奇当 get_database_collection() 也是递归时如何返回 Nonetype ......它必须在达到 MaximumRecursionDepth 后返回 None 对吧?这是我还没有研究的东西。

更新:好的,我正在搜索系统日志以查找可疑内容,我想我找到了 linux 停止 mongo 的点,以下是 1 月 15 日的系统日志 (/var/log/syslog):

Jan 15 12:12:07 raspberrypi systemd[1]: Stopping An object/document-oriented database...
Jan 15 12:12:08 raspberrypi systemd[1]: Stopped An object/document-oriented database.
Jan 15 12:12:08 raspberrypi rc.local[463]: [967] Failed to execute script __main__
Jan 15 12:12:11 raspberrypi systemd[1]: Started An object/document-oriented database.
Jan 15 12:12:11 raspberrypi mongod[2336]: all output going to: /var/log/mongodb/mongodb.log
Jan 15 12:14:22 raspberrypi systemd[1]: Stopped target Timers.
Jan 15 12:14:22 raspberrypi systemd[1]: Stopped Daily apt upgrade and clean activities.
Jan 15 12:14:22 raspberrypi systemd[1]: Stopped target Bluetooth.
Jan 15 12:14:22 raspberrypi systemd[1]: Stopped Daily apt download activities.
Jan 15 12:14:22 raspberrypi systemd[1]: Stopping User Manager for UID 1000...
Jan 15 12:14:22 raspberrypi systemd[1]: Stopped target System Time Synchronized.
Jan 15 12:14:22 raspberrypi vncserver-x11-serviced[453]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
Jan 15 12:14:22 raspberrypi vncserver-x11-serviced[453]:       after 14426 requests (14426 known processed) with 0 events remaining.
Jan 15 12:14:22 raspberrypi bluetoothd[524]: Terminating
Jan 15 12:14:22 raspberrypi systemd[1]: Stopping Bluetooth service...
Jan 15 12:14:22 raspberrypi watchdog[562]: stopping daemon (5.15)
Jan 15 12:14:22 raspberrypi systemd[1]: Stopping Disk Manager...
Jan 15 12:14:22 raspberrypi udisksd[883]: udisks daemon version 2.1.8 exiting
Jan 15 12:14:22 raspberrypi systemd[1]: Stopped target Sound Card.
Jan 15 12:14:22 raspberrypi bluetoothd[524]: Stopping SDP server
Jan 15 12:14:22 raspberrypi systemd[1]: Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
Jan 15 12:14:22 raspberrypi bluetoothd[524]: Exit
Jan 15 12:14:22 raspberrypi systemd[1]: Stopping watchdog daemon...
Jan 15 12:14:22 raspberrypi systemd[1]: Stopping Save/Restore Sound Card State...
Jan 15 12:14:22 raspberrypi systemd[1]: Unmounting RPC Pipe File System...
Jan 15 12:14:22 raspberrypi systemd[1]: Stopping Authorization Manager...
Jan 15 12:14:22 raspberrypi systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Jan 15 12:14:22 raspberrypi systemd[1]: Stopping Session c1 of user pi.
Jan 15 12:14:22 raspberrypi systemd[1]: Stopped Getty on tty1.
Jan 15 12:14:22 raspberrypi vncserver-x11-serviced[453]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
Jan 15 12:14:28 raspberrypi systemd-modules-load[111]: Inserted module 'i2c_dev'
Jan 15 12:14:28 raspberrypi systemd[1]: Started Apply Kernel Variables.
Jan 15 12:14:28 raspberrypi fake-hwclock[112]: Tue 15 Jan 12:14:24 UTC 2019
Jan 15 12:14:28 raspberrypi systemd[1]: Time has been changed
Jan 15 12:14:28 raspberrypi systemd[1]: Started Restore / save the current clock.
Jan 15 12:14:28 raspberrypi systemd-fsck[113]: e2fsck 1.43.4 (31-Jan-2017)
Jan 15 12:14:28 raspberrypi systemd[1]: Started Create Static Device Nodes in /dev.
Jan 15 12:14:28 raspberrypi systemd[1]: Starting udev Kernel Device Manager...
Jan 15 12:14:28 raspberrypi systemd-fsck[113]: /dev/mmcblk0p2: clean, 137720/939744 files, 1384356/3809792 blocks
Jan 15 12:14:28 raspberrypi systemd[1]: Started File System Check on Root Device.
Jan 15 12:14:28 raspberrypi systemd[1]: Starting Remount Root and Kernel File Systems...
Jan 15 12:14:28 raspberrypi systemd[1]: Started Remount Root and Kernel File Systems.
Jan 15 12:14:28 raspberrypi systemd[1]: Starting Load/Save Random Seed...
Jan 15 12:14:28 raspberrypi systemd[1]: Starting udev Coldplug all Devices...
Jan 15 12:14:28 raspberrypi systemd[1]: Starting Flush Journal to Persistent Storage...
Jan 15 12:14:28 raspberrypi systemd[1]: Started Load/Save Random Seed.
Jan 15 12:14:28 raspberrypi systemd[1]: Started Flush Journal to Persistent Storage.
Jan 15 12:14:28 raspberrypi systemd[1]: Started Set the console keyboard layout.
Jan 15 12:14:28 raspberrypi systemd[1]: Reached target Local File Systems (Pre).
Jan 15 12:14:28 raspberrypi systemd[1]: Started udev Kernel Device Manager.
Jan 15 12:14:28 raspberrypi systemd[1]: Started udev Coldplug all Devices.
Jan 15 12:14:28 raspberrypi systemd[1]: Starting Show Plymouth Boot Screen...
Jan 15 12:14:28 raspberrypi systemd[1]: Found device /dev/serial1.
Jan 15 12:14:28 raspberrypi systemd[1]: Started Show Plymouth Boot Screen.
Jan 15 12:14:28 raspberrypi systemd[1]: Reached target Encrypted Volumes.
Jan 15 12:14:28 raspberrypi systemd[1]: Reached target Paths.
Jan 15 12:14:28 raspberrypi systemd[1]: Started Forward Password Requests to Plymouth Directory Watch.
Jan 15 12:14:28 raspberrypi mtp-probe: checking bus 1, device 3: "/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.1"
Jan 15 12:14:28 raspberrypi mtp-probe: checking bus 1, device 4: "/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.2"
Jan 15 12:14:28 raspberrypi mtp-probe: checking bus 1, device 5: "/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.4"
Jan 15 12:14:28 raspberrypi mtp-probe: bus: 1, device: 4 was not an MTP device
Jan 15 12:14:28 raspberrypi mtp-probe: bus: 1, device: 5 was not an MTP device
Jan 15 12:14:28 raspberrypi mtp-probe: bus: 1, device: 3 was not an MTP device
Jan 15 12:14:28 raspberrypi systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Jan 15 12:14:28 raspberrypi systemd[1]: Reached target Sound Card.
Jan 15 12:14:28 raspberrypi systemd[1]: Found device /dev/disk/by-partuuid/f143b93d-01.
Jan 15 12:14:28 raspberrypi systemd[1]: Starting File System Check on /dev/disk/by-partuuid/f143b93d-01...

前两行停止数据库,第三行“Failed to execute script ma​​in”来自我的程序!但随后它立即重新启动数据库......有人能理解发生了什么吗?好像有很多服务被停止然后重新启动...

【问题讨论】:

  • subprocess.call(["sudo", "service", "mongodb", "start"]) -> 这是正确的命令吗?在我的机器上,我执行“sudo service mongod start”(不是 mongodb)
  • 其实,你可以在终端中运行启动命令并检查输出吗?
  • @Neil 嘿,我已经在终端上使用了该命令并验证了它是否有效,我也在 stackoverflow 上找到了它,我会尝试找到该线程。
  • 好的。你知道这个文件是什么 ""gateway-embedded-code/database.py" 吗?你能更新你的帖子来显示你是如何记录这个异常的吗?
  • 你检查过/var/log/messages吗?我的第一个想法可能是 OOM 杀手,但它看起来像 SIGTERM 而不是 SIGKILL ......但服务不会任意获取 SIGTERM,所以我希望系统日志能有一些解释。

标签: python database mongodb raspberry-pi pymongo


【解决方案1】:

这个问题现在已经解决了,当使用 SPI 读取我在树莓派上的温度传感器时,spi.close() 没有被正确调用,因此每个 spi.open() 都会生成一个新的 spidev 文件打开。一个进程可以打开多少文件是有限制的,在我的情况下,这个限制在大约 6-7 小时内达到,之后脚本崩溃了。

同样重要的是要注意,在代码的这个区域没有实现正确的日志记录,也没有正确捕获异常,因此没有捕获或记录异常“OSError: Too many open files”,这使得这是一个非常神秘的问题。

一旦捕获到异常,就很容易修复,只需在每个打开 spi 的地方正确添加 spi.close()。

您可以检查您的进程打开了多少文件:

lsof -p <pid_of_your_process>

我的 linux 发行版上默认没有安装 lsof 命令,所以我使用了:

sudo apt install lsof

现在这与我的 pymongo 连接异常有关的原因是,为了建立连接,pymongo 模块在套接字等各种事物上调用 open() ,从而打开一个文件。我假设我的程序在六个小时后无法打开这些文件,因为它已经达到了可以打开的最大文件数,因此引发了 pymongo 异常。

希望这对某人有所帮助!

【讨论】:

    猜你喜欢
    • 2022-11-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-01-18
    • 2021-08-30
    • 2012-11-21
    • 2015-09-22
    • 1970-01-01
    相关资源
    最近更新 更多