如何确保只有一个守护进程副本正在运行？答案

【问题标题】：How do I ensure that only one copy of a daemon is running?如何确保只有一个守护进程副本正在运行？
【发布时间】：2011-08-20 11:53:20
【问题描述】：

我的守护进程的代码是：

static int daemonize( const char *lockfile )
{
    pid_t pid, sid, parent;
    int lfp = -1;
    char buf[16];

    /* already a daemon */
    if ( getppid() == 1 ) return 1;

    /* Each copy of the daemon will try to 
     * create a file and write its process ID 
     * in it. This will allow administrators 
     * to identify the process easily
     */ 
    /* Create the lock file as the current user */
    if ( lockfile && lockfile[0] ) {
        lfp = open(lockfile,O_RDWR|O_CREAT,LOCKMODE); 
        if ( lfp < 0 ) {
            syslog( LOG_ERR, "unable to create lock file %s, code=%d (%s)",
                    lockfile, errno, strerror(errno) );
            exit(EXIT_FAILURE);
        }
    }

    /* If the file is already locked, then to ensure that 
     * only one copy of record is running. The filelock function will fail 
     * with errno set to EACCESS or EAGAIN.
     */
    if (filelock(lfp) < 0) {
        if (errno == EACCES || errno == EAGAIN) {
            close(lfp);
            //return(1);
            exit(EXIT_FAILURE);
        }
        syslog(LOG_ERR, "can't lock %s: %s", lockfile, strerror(errno));
        exit(EXIT_FAILURE);
    }
    ftruncate(lfp, 0);
    sprintf(buf, "%ld", (long)getpid());
    write(lfp, buf, strlen(buf)+1); 

    /* Drop user if there is one, and we were run as RUN_AS_USER */
    if ( getuid() == 0 || geteuid() == 0 ) {
        struct passwd *pw = getpwnam(RUN_AS_USER);
        if ( pw ) {
            syslog( LOG_NOTICE, "setting user to " RUN_AS_USER );
            setuid( pw->pw_uid );
        }
    }

    /* Trap signals that we expect to recieve */
    signal(SIGCHLD,child_handler);
    signal(SIGUSR1,child_handler);
    signal(SIGALRM,child_handler);

    /* Fork off the parent process */
    pid = fork();
    if (pid < 0) {
        syslog( LOG_ERR, "unable to fork daemon, code=%d (%s)",
                errno, strerror(errno) );
        exit(EXIT_FAILURE);
    }
    /* If we got a good PID, then we can exit the parent process. */
    if (pid > 0) {
        /* Wait for confirmation from the child via SIGTERM or SIGCHLD, or
           for two seconds to elapse (SIGALRM).  pause() should not return. */
        alarm(2);
        pause();

        exit(EXIT_FAILURE);
    }

    /* At this point we are executing as the child process */
    parent = getppid();

    /* Cancel certain signals */
    signal(SIGCHLD,SIG_DFL); /* A child process dies */
    signal(SIGTSTP,SIG_IGN); /* Various TTY signals */
    signal(SIGTTOU,SIG_IGN);
    signal(SIGTTIN,SIG_IGN);
    signal(SIGHUP, SIG_IGN); /* Ignore hangup signal */
    signal(SIGTERM,SIG_DFL); /* Die on SIGTERM */

    /* Change the file mode mask */
    umask(0);

    /* Create a new SID for the child process */
    sid = setsid();
    if (sid < 0) {
        syslog( LOG_ERR, "unable to create a new session, code %d (%s)",
                errno, strerror(errno) );
        exit(EXIT_FAILURE);
    }

    /* Change the current working directory.  This prevents the current
       directory from being locked; hence not being able to remove it. */
    if ((chdir("/")) < 0) {
        syslog( LOG_ERR, "unable to change directory to %s, code %d (%s)",
                "/", errno, strerror(errno) );
        exit(EXIT_FAILURE);
    }

    /* Redirect standard files to /dev/null */
    freopen( "/dev/null", "r", stdin);
    freopen( "/dev/null", "w", stdout);
    freopen( "/dev/null", "w", stderr);

    /* Tell the parent process that we are A-okay */
    kill( parent, SIGUSR1 );
    return 0;
}

我想在启动程序时一次只运行一个实例：

service [script] start

但每当此命令执行两次或更多次时，它会在运行条件下创建相同数量的守护进程。我想摆脱这种行为。任何建议都将受到高度赞赏。

【问题讨论】：

阅读 this answer 并非常小心竞争条件和错误。

标签： c linux daemon

【解决方案1】：

不要使用文件锁；相反，将O_EXCL 标志用于open()，如果文件已经存在，这将失败并显示EEXIST。这通常使用 pid 文件完成，因为无论如何它都需要独占。

【讨论】：

@geekosaur 但是在带有 fcntl 函数的 filelock() 中，我正在以 F_WRLCK 模式打开 pid 文件，该模式已经是独占写锁。所以应该和 open() 中的 O_EXCL 模式一样。
@Sushant：可能是这样，但它做事的难度更大（因此更容易出现错误和竞争条件）。一般来说，我更喜欢更简单的方法，并且可能出错的步骤更少，而不是不得不追踪你遇到的奇怪问题。在多个进程之间交叉访问文件时应使用锁定。（您也没有显示filelock 函数，所以我不知道它是否正确；fcntl() 锁定中有几个陷阱。）
@Geekosaur 这是我的文件锁定功能：int filelock(int fd){ struct flock fl; fl.l_type = F_WRLCK; /* F_RDLCK, F_WRLCK(an exclusive write lock), or F_UNLCK(unlocking a region) */ fl.l_start = 0; /* offset in bytes relative to l_whence */ fl.l_whence = SEEK_SET; /* SEEK_SET, SEEK _CUR, SEEK _END */ fl.l_len = 0; /* means lock to EOF */ /*fcntl function can change the properties of file that is already open * Here F_SETLK set the record locks define in flock structure var */ return(fcntl(fd, F_SETLK, &fl)); }
@Sushant：顺便说一下，O_EXCL 与文件锁不同；这意味着该文件必须不存在。这是一个更简单且因此更可靠的条件。而且您实际上是在遇到fcntl() 锁定陷阱之一：锁定不会传播到子进程，因此当父进程退出时，pid 文件会被解锁。如果你坚持文件锁定，你需要在孩子身上做，而不是在父母身上。（无论如何你都会这样做，因为否则你会保存不相关的父母的 pid。）
使用 O_EXCL 是真正不可靠的方法来确保单个守护程序正在运行。如果守护进程被突然终止（SIGKILL、SIGSEGV、断电等），那么它将留下一个陈旧的“锁定文件”，如果不手动删除这个陈旧的锁定文件，您将无法重新启动您的守护进程。我在生产环境中看到过这种情况。

【解决方案2】：

pid 文件的另一个替代方法是从您的守护进程打开一个 tcp/udp 端口。尝试打开同一个端口时，运行另一个守护进程实例将失败。

【讨论】：