【问题标题】:TCP, recv function hanging despite KEEPALIVE尽管 KEEPALIVE,TCP,recv 功能仍挂起
【发布时间】:2015-09-13 14:59:44
【问题描述】:

在服务器死机后,TCP keepalive(带有小超时)是否会阻止客户端挂在 recv 上?

场景:

服务器和客户端运行在不同的机器上:

  1. 客户端通过带有 KEEPALIVE 选项的 TCP 连接到服务器
  2. 客户端发送“Hello server”并等待响应
  3. 服务器收到“Hello server”并响应“Hello client”
  4. 客户端收到响应,休眠 10 秒并重复第 2-4 步(现在跳过第 1 步 - 保留连接)

在客户端休眠期间,服务器被关闭,现在:

  1. 客户端唤醒
  2. 发送“Hello server”并等待响应
  3. 20 分钟后 recv 放弃 - 我原以为 KEEPALIVE 会在 45 秒后打破 recv 功能

设置 KEEPALIVE 选项:

void TCPclient::setkeepalive()
{
   int optval;
   socklen_t optlen = sizeof(optval);

   /* Check the status for the keepalive option */
   if(getsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &optval, &optlen) < 0) {
        throw std::string("getsockopt");
   }

   optval = 1;
   optlen = sizeof(optval);
   if(setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &optval, optlen) < 0) {
      close(s);
      exit(EXIT_FAILURE);
   }

    optval = 2;
    if (setsockopt(sock, SOL_TCP, TCP_KEEPCNT, &optval, optlen) < 0) {
        throw std::string("setsockopt");
    }

    optval = 15;
    if (setsockopt(sock, SOL_TCP, TCP_KEEPIDLE, &optval, optlen) < 0) {
        throw std::string("setsockopt");
    }

    optval = 15;
    if (setsockopt(sock, SOL_TCP, TCP_KEEPINTVL, &optval, optlen) < 0) {
        throw std::string("setsockopt");
    }   
}

linux 3.2.0-84-generic

【问题讨论】:

  • 嗯,我希望 45 秒:在发送任何探测之前 15 秒,两个探测间隔每 15 秒,一个探测循环间隔 (TCP_KEEPINTVL) 到确保回复第二次探测不仅仅是延迟。
  • @Joachim Pileborg 正确 - 已编辑
  • recv“放弃”时,你会得到什么错误?
  • 我的意思是,当recv 失败时,errno 的值是多少?使用例如strerror 获取可打印字符串。
  • strerror 是:连接超时

标签: c++ linux sockets tcp network-programming


【解决方案1】:

当线路空闲 15 秒后,Keepalive 变为活动状态。在您的情况下,Keepalive 启动超时为 15 秒,睡眠为 10 秒,这意味着“Hello server”将是服务器被杀死后要发送的下一个命令。

您的 Linux 将尝试多次重新传输该消息。 Keepalive 仍然不会被触发。达到重试限制后连接将中断 - 这将需要 10-30 分钟。

【讨论】:

  • 你能提供任何资源来确认你的答案吗?
  • 这是正确的回答。我运行了一个测试并使用 tcpdump 进行了捕获。我在回复中复制了它。
  • @MichalWegorek 这个答案是正确的。 Keepalive 触发重试和重试超时,就像任何其他数据传输一样。没有任何地方说 keepalive 间隔等于超时间隔,或者一个失败触发超时。
【解决方案2】:

@MMA 的回答是正确的。 我写了一个类似的客户端,在写之前等待了 20 秒。一旦客户端唤醒并发送消息,keep alive 发送的 ACK 消息就不再发送(连接不再空闲)。

在重试 15 次后(在 /proc/sys/net/ipv4 中配置 tcp_retries2)发送 tcp 段,其中超时呈指数增长,直到达到 ~2 分钟(在我的情况下),连接设置为错误并等待读取或接收返回 ETIMEDOUT (errno 110)。就我而言,它花了大约 15 分钟。这个时间取决于 RTO。查看 TCPDUMP,三次握手后有两个 ACK​​(我不知道为什么这两个 ack 中的第一个),然后是 15 条带有数据和推送标志的消息。

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on p2p1, link-type EN10MB (Ethernet), capture size 65535 bytes
01:16:45.296179 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [S], seq 515423022, win 14600, options [mss 1460,sackOK,TS val 19212623 ecr 0,nop,wscale 7], length 0
E..<.a@.@......d4.....'...........9............
.%)O........
01:16:45.477983 IP ec2-52-7-150-140.compute-1.amazonaws.com.10221 > 192.168.2.100.60895: Flags [S.], seq 3672727778, ack 515423023, win 26847, options [mss 1436,sackOK,TS val 114765522 ecr 19212623,nop,wscale 7], length 0
E..<..@.-...4......d'.....`..../..h............
.....%)O....
01:16:45.478046 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [.], ack 1, win 115, options [nop,nop,TS val 19212805 ecr 114765522], length 0
E..4.b@.@......d4.....'..../..`....s.......
.%*.....
01:17:00.512812 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [.], ack 1, win 115, options [nop,nop,TS val 19227840 ecr 114765522], length 0
E..4.c@.@......d4.....'.......`....s.......
.%d.....
01:17:00.731160 IP ec2-52-7-150-140.compute-1.amazonaws.com.10221 > 192.168.2.100.60895: Flags [.], ack 1, win 210, options [nop,nop,TS val 114769336 ecr 19212805], length 0
E..4N.@.-.r.4......d'.....`..../....M......
..=..%*.
01:17:05.478933 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19232806 ecr 114769336], length 14
E..B.d@.@......d4.....'..../..`....s.......
.%x&..=.Hello Word :).
01:17:06.027768 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19233354 ecr 114769336], length 14
E..B.e@.@......d4.....'..../..`....s.......
.%zJ..=.Hello Word :).
01:17:07.120879 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19234448 ecr 114769336], length 14
E..B.f@.@......d4.....'..../..`....s.......
.%~...=.Hello Word :).
01:17:09.312833 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19236640 ecr 114769336], length 14
E..B.g@.@......d4.....'..../..`....s.......
.%. ..=.Hello Word :).
01:17:13.697663 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19241024 ecr 114769336], length 14
E..B.h@.@......d4.....'..../..`....s.......
.%.@..=.Hello Word :).
01:17:22.466187 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19249793 ecr 114769336], length 14
E..B.i@.@......d4.....'..../..`....s.......
.%....=.Hello Word :).
01:17:40.001653 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19267328 ecr 114769336], length 14
E..B.j@.@......d4.....'..../..`....s.......
.%....=.Hello Word :).
01:18:15.074493 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19302401 ecr 114769336], length 14
E..B.k@.@......d4.....'..../..`....s.......
.&....=.Hello Word :).
01:19:25.217799 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19372545 ecr 114769336], length 14
E..B.l@.@......d4.....'..../..`....s.......
.'....=.Hello Word :).
01:21:25.537775 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 19492864 ecr 114769336], length 14
E..B.m@.@......d4.....'..../..`....s.......
.)p...=.Hello Word :).
01:23:25.856854 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, 69336], length 14
E..B.n@.@......d4.....'..../..`....s.......
.+F...=.Hello Word :).
01:25:26.176894 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, 69336], length 14
E..B.o@.@......d4.....'..../..`....s.......
.-....=.Hello Word :).
01:27:26.497691 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, 69336], length 14
E..B.p@.@......d4.....'..../..`....s.......
......=.Hello Word :).
01:29:26.816905 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, 69336], length 14
E..B.q@.@......d4.....'..../..`....s.......
.0....=.Hello Word :).
01:31:27.137013 IP 192.168.2.100.60895 > ec2-52-7-150-140.compute-1.amazonaws.com.10221: Flags [P.], seq 1:15, ack 1, win 115, options [nop,nop,TS val 20094464 ecr 114769336], length 14
E..B.r@.@......d4.....'..../..`....s.......
.2....=.Hello Word :).

我使用的客户端代码:

#include <sys/types.h>
#include <sys/socket.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <netinet/in.h>
#include <net/if.h>
#include <arpa/inet.h>
#include <stdio.h>


#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/tcp.h>


#define DEST_PORT 10221
#define ADDRLEN INET_ADDRSTRLEN


int main(int argc, char** argv)
{
    int sock;
    int bytesWritten;
    struct sockaddr_in their_addr;
    char buffer[] = "Hello Word :)";
    char addrstr[ADDRLEN + 1];

    if (argc != 2)
    {
       printf("ERROR - Number of args\n");
       return 10;
    }

    strncpy(addrstr, argv[1], ADDRLEN);

    bzero(&their_addr, sizeof(their_addr));
    their_addr.sin_family = AF_INET;
    their_addr.sin_port = htons(DEST_PORT);

    if (inet_pton(AF_INET, addrstr,(void *)&their_addr.sin_addr) != 1)
    {
        printf("ERROR - Converting Address: %d\n", errno);
        return 2;
    }

    if ((sock = socket(AF_INET, SOCK_STREAM, 0)) == -1)
    {
        printf("ERROR - Socket could not be open: %d\n", errno);
        return 1;
    }

//// Copied option setting
   int optval;
   socklen_t optlen = sizeof(optval);

   /* Check the status for the keepalive option */
   if(getsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &optval, &optlen) < 0) {
        printf("ERROR - SOL_SOCKET: %d\n", errno);
        return 19;
   }

   optval = 1;
   optlen = sizeof(optval);
   if(setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE, &optval, optlen) < 0) {
        printf("ERROR - SOL_SOCKET-2: %d\n", errno);
        return 20;
   }

    optval = 2;
    if (setsockopt(sock, SOL_TCP, TCP_KEEPCNT, &optval, optlen) < 0) {
        printf("ERROR - SOL_TCP: %d\n", errno);
        return 21;
    }

    optval = 15;
    if (setsockopt(sock, SOL_TCP, TCP_KEEPIDLE, &optval, optlen) < 0) {
        printf("ERROR - SOL_TCP-2: %d\n", errno);
        return 22;
    }

    optval = 15;
    if (setsockopt(sock, SOL_TCP, TCP_KEEPINTVL, &optval, optlen) < 0) {
        printf("ERROR - SOL_TCP-3: %d\n", errno);
        return 23;
    }   
/////
    if (connect(sock, (const struct sockaddr *)&their_addr, 
                (socklen_t)sizeof(their_addr)) == -1)
    {
        printf("ERROR - Could not connect to destination: %d\n", errno);
        return 3;
    }

/// Sleep 20 seconds    
    sleep(20);
    printf("About to write\n");

    if ((bytesWritten = write(sock, (const void *)buffer, sizeof(buffer))) == -1)
    {
        printf("ERROR - Sending message: %d\n", errno);
        return 4;
    }

    printf("Message Sent to Address %s, Port: %d\n\n", addrstr, DEST_PORT);

    int bytesRead;

    if ((bytesRead = read(sock, buffer, sizeof(buffer))) == -1)
    {
        printf("ERROR - Sending message: %d\n", errno);
        return 4;
    }

    close(sock);

    return 0;
}

我使用托管在 AWS 中的服务器运行此测试。在不注意客户端的情况下模拟删除服务器的方法是:我有一个与服务器关联的公共(弹性)IP,并且在三次握手后立即解除弹性 IP 与服务器的关联。我无法粘贴服务器代码,但这里不相关。

请注意,在此示例中,keepalive 由于发送消息而停止。

【讨论】:

  • 当然 - 我遇到了同样的行为。问题是:在进行 recv 时,keepalive 是否被禁用?请粘贴指向支持您答案的来源的链接。实验证明是真的,但我对keepalive的理解不一样..
  • @MichalWegorek,Recv 没有禁用 Keepalive,它只是在连接不再空闲时停止。在这种情况下,由于 write 发送数据,keepalive 停止。如果客户端没有写入数据,keepalive 就不会停止。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-06-06
  • 1970-01-01
  • 2021-12-30
  • 2013-05-10
  • 2018-03-12
  • 2021-10-19
  • 2019-02-20
相关资源
最近更新 更多