【发布时间】:2017-03-23 02:13:10
【问题描述】:
背景
我正在对我的客户端-服务器应用程序进行压力测试。两端都是通过epoll进行事件检测的C++程序。
在这个测试中,它们都运行在 CentOS 7 上的 Oracle VirtualBox 5.0.22 实例中,通过 VirtualBox 的仅主机以太网适配器进行通信(类型:Intel PRO/1000 MT Desktop (82504EM)) .
客户端打开一个到服务器的 TCP/IP 连接,交换一些应用程序级的握手消息,并通过每 10 秒发送一个 ASCII 20(空白)来维护它。称之为“ping”。在任何一方错过一定数量的预期“ping”后,连接将关闭。
在某些情况下,服务器还可以打开与客户端的连接以更快地重新建立通信(例如在服务器重新启动后)。在大多数配置中,客户端实际上最终也会重新打开自己的传出连接,而服务器的连接将作为“冗余”关闭。
这在小范围内工作正常,但是当我尝试模拟网络上有许多客户端时,things fall apart。由于服务器需要每个客户端在不同的 IP 上,为了模拟,我在 192.168.21.0/24 中创建了一些“虚拟接口”,并使用路由。
假设我正在模拟 20 个客户。要设置第 12 个,我将在我的客户端 VM 上执行此操作:
ip link add link enp0s8 sbsim12 type macvlan
ip link set up dev sbsim12
ip addr add 192.168.21.12/24 broadcast 192.168.21.255 dev sbsim12
(enp0s8 是 VirtualBox Host-Only 适配器)
然后,在服务器虚拟机上:
ip route add 192.168.21.0/24 dev enp0s8
然后,我的客户端实例可以绑定到 192.168.21.12,此后,在我的系统中,这似乎是它的 IP。
问题
当我们的应用程序使用 UDP 通信时,这种机制对我们来说非常有效。它也适用于小规模。但是,当我一次启动越来越多的客户端时,我开始看到奇怪的行为。症状各不相同,但一般模式似乎是 TCP/IP 连接停止。通过我的应用程序中的大量调试输出,我可以看到发送端正确检测到套接字上的EPOLLOUT 和sending,但接收端偶尔从未检测到EPOLLIN,因此数据实际上丢失了。每隔几次运行就会发生这种情况,随着客户端数量的增加,这种可能性也会增加。
我花了十年的时间来取证分析我的应用程序逻辑的正确性,我开始怀疑我是否在较低层遇到了某种网络错误,无论是在 MAC VLAN 领域还是在 VirtualBox 驱动程序中领域。
为了排除这种可能性,我需要一个比我更了解 TCP 的人来确认或否认以下内容确实很奇怪。
这个数据包流到底发生了什么?
No. Time Source Destination Protocol Info
26496 581.345275 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=377482702 TSecr=0 WS=128
26499 581.345711 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381905815 TSecr=377482702 WS=128
26500 581.345936 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=377482703 TSecr=381905815
26516 581.349421 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [PSH, ACK] Seq=1 Ack=1 Win=29312 Len=131 TSval=381905865 TSecr=377482703
26519 581.349661 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [ACK] Seq=1 Ack=132 Win=30336 Len=0 TSval=377482706 TSecr=381905865
26647 581.394528 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [PSH, ACK] Seq=1 Ack=132 Win=30336 Len=131 TSval=377482751 TSecr=381905865
26648 581.394574 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [ACK] Seq=132 Ack=132 Win=30336 Len=0 TSval=381905911 TSecr=377482751
26690 581.401738 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [PSH, ACK] Seq=132 Ack=132 Win=30336 Len=289 TSval=377482758 TSecr=381905911
26691 581.401756 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [ACK] Seq=132 Ack=421 Win=31360 Len=0 TSval=381905918 TSecr=377482758
26735 581.418696 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [PSH, ACK] Seq=132 Ack=421 Win=31360 Len=48 TSval=381905935 TSecr=377482758
26737 581.418927 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [ACK] Seq=421 Ack=180 Win=30336 Len=0 TSval=377482776 TSecr=381905935
26749 581.432843 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [PSH, ACK] Seq=180 Ack=421 Win=31360 Len=45 TSval=381905949 TSecr=377482776
26751 581.433022 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [ACK] Seq=421 Ack=225 Win=30336 Len=0 TSval=377482790 TSecr=381905949
26758 581.436982 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [PSH, ACK] Seq=421 Ack=225 Win=30336 Len=819 TSval=377482793 TSecr=381905949
26793 581.476317 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [ACK] Seq=225 Ack=1240 Win=33024 Len=0 TSval=381905993 TSecr=377482793
26892 581.579434 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [PSH, ACK] Seq=225 Ack=1240 Win=33024 Len=64 TSval=381906096 TSecr=377482793
26950 581.619040 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [ACK] Seq=1240 Ack=289 Win=30336 Len=0 TSval=377482976 TSecr=381906096
27012 581.652478 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [PSH, ACK] Seq=1240 Ack=289 Win=30336 Len=1230 TSval=377483007 TSecr=381906096
27013 581.652520 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [ACK] Seq=289 Ack=2470 Win=35968 Len=0 TSval=381906168 TSecr=377483007
28392 590.844958 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [PSH, ACK] Seq=289 Ack=2470 Win=35968 Len=1 TSval=381915361 TSecr=377483007
28427 590.955619 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=1 TSval=377492312 TSecr=381906168
28428 590.955628 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [ACK] Seq=290 Ack=2471 Win=35968 Len=0 TSval=381915472 TSecr=377492312
28457 591.077735 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive] cisco-sccp(2000)→42551 [PSH, ACK] Seq=289 Ack=2471 Win=35968 Len=1 TSval=381915594 TSecr=377492312
28494 591.161676 192.168.21.51 192.168.99.100 TCP [TCP Keep-Alive] 42551→cisco-sccp(2000) [PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=1 TSval=377492518 TSecr=381906168
28495 591.161733 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive ACK] cisco-sccp(2000)→42551 [ACK] Seq=290 Ack=2471 Win=35968 Len=0 TSval=381915678 TSecr=377492518 SLE=2470 SRE=2471
28526 591.367239 192.168.21.51 192.168.99.100 TCP [TCP Keep-Alive] 42551→cisco-sccp(2000) [PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=1 TSval=377492724 TSecr=381906168
28527 591.367344 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive ACK] cisco-sccp(2000)→42551 [ACK] Seq=290 Ack=2471 Win=35968 Len=0 TSval=381915883 TSecr=377492724 SLE=2470 SRE=2471
28566 591.776390 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive] cisco-sccp(2000)→42551 [PSH, ACK] Seq=289 Ack=2471 Win=35968 Len=1 TSval=381916293 TSecr=377492724
28567 591.780375 192.168.21.51 192.168.99.100 TCP [TCP Keep-Alive] 42551→cisco-sccp(2000) [PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=1 TSval=377493137 TSecr=381906168
28568 591.780472 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive ACK] cisco-sccp(2000)→42551 [ACK] Seq=290 Ack=2471 Win=35968 Len=0 TSval=381916297 TSecr=377493137 SLE=2470 SRE=2471
28601 592.243918 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive] cisco-sccp(2000)→42551 [PSH, ACK] Seq=289 Ack=2471 Win=35968 Len=1 TSval=381916760 TSecr=377493137
28639 592.607472 192.168.21.51 192.168.99.100 TCP [TCP Keep-Alive] 42551→cisco-sccp(2000) [PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=1 TSval=377493964 TSecr=381906168
28640 592.607575 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive ACK] cisco-sccp(2000)→42551 [ACK] Seq=290 Ack=2471 Win=35968 Len=0 TSval=381917124 TSecr=377493964 SLE=2470 SRE=2471
28729 593.177610 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive] cisco-sccp(2000)→42551 [PSH, ACK] Seq=289 Ack=2471 Win=35968 Len=1 TSval=381917694 TSecr=377493964
28826 594.259300 192.168.21.51 192.168.99.100 TCP [TCP Keep-Alive] 42551→cisco-sccp(2000) [PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=1 TSval=377495616 TSecr=381906168
28827 594.259358 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive ACK] cisco-sccp(2000)→42551 [ACK] Seq=290 Ack=2471 Win=35968 Len=0 TSval=381918776 TSecr=377495616 SLE=2470 SRE=2471
28863 595.043696 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive] cisco-sccp(2000)→42551 [PSH, ACK] Seq=289 Ack=2471 Win=35968 Len=1 TSval=381919560 TSecr=377495616
29669 597.563164 192.168.21.51 192.168.99.100 TCP [TCP Keep-Alive] 42551→cisco-sccp(2000) [PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=1 TSval=377498920 TSecr=381906168
29670 597.563296 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive ACK] cisco-sccp(2000)→42551 [ACK] Seq=290 Ack=2471 Win=35968 Len=0 TSval=381922079 TSecr=377498920 SLE=2470 SRE=2471
30012 598.779594 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive] cisco-sccp(2000)→42551 [PSH, ACK] Seq=289 Ack=2471 Win=35968 Len=1 TSval=381923296 TSecr=377498920
30485 604.179630 192.168.21.51 192.168.99.100 TCP [TCP Keep-Alive] 42551→cisco-sccp(2000) [PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=1 TSval=377505536 TSecr=381906168
30486 604.179745 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive ACK] cisco-sccp(2000)→42551 [ACK] Seq=290 Ack=2471 Win=35968 Len=0 TSval=381928696 TSecr=377505536 SLE=2470 SRE=2471
30679 606.251285 192.168.99.100 192.168.21.51 TCP [TCP Keep-Alive] cisco-sccp(2000)→42551 [PSH, ACK] Seq=289 Ack=2471 Win=35968 Len=1 TSval=381930768 TSecr=377505536
30824 610.881089 192.168.21.51 192.168.99.100 TCP 42551→cisco-sccp(2000) [FIN, PSH, ACK] Seq=2471 Ack=289 Win=30336 Len=1 TSval=377512238 TSecr=381906168
30825 610.881786 192.168.21.51 192.168.99.100 TCP 45431→cisco-sccp(2000) [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=377512238 TSecr=0 WS=128
30826 610.881829 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381935398 TSecr=377512238 WS=128
30858 610.885132 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→42551 [FIN, PSH, ACK] Seq=290 Ack=2473 Win=35968 Len=1 TSval=381935401 TSecr=377512238
30937 611.883833 192.168.21.51 192.168.99.100 TCP [TCP Spurious Retransmission] 45431→cisco-sccp(2000) [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=377513240 TSecr=0 WS=128
30938 611.884005 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381936400 TSecr=377512238 WS=128
30973 612.884024 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381937400 TSecr=377512238 WS=128
30996 613.887453 192.168.21.51 192.168.99.100 TCP [TCP Spurious Retransmission] 45431→cisco-sccp(2000) [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=377515244 TSecr=0 WS=128
30997 613.887564 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381938404 TSecr=377512238 WS=128
31123 616.083906 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381940600 TSecr=377512238 WS=128
31195 617.395119 192.168.21.51 192.168.99.100 TCP [TCP Spurious Retransmission] 42551→cisco-sccp(2000) [FIN, PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=2 TSval=377518752 TSecr=381906168
31196 617.395213 192.168.99.100 192.168.21.51 TCP [TCP Dup ACK 30858#1] cisco-sccp(2000)→42551 [ACK] Seq=292 Ack=2473 Win=35968 Len=0 TSval=381941911 TSecr=377518752 SLE=2470 SRE=2473
31197 617.891274 192.168.21.51 192.168.99.100 TCP [TCP Spurious Retransmission] 45431→cisco-sccp(2000) [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=377519248 TSecr=0 WS=128
31198 617.891377 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381942407 TSecr=377512238 WS=128
31358 621.211512 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→42551 [FIN, PSH, ACK] Seq=289 Ack=2473 Win=35968 Len=2 TSval=381945728 TSecr=377518752
31392 622.484650 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381947001 TSecr=377512238 WS=128
31465 625.907246 192.168.21.51 192.168.99.100 TCP [TCP Spurious Retransmission] 45431→cisco-sccp(2000) [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=377527264 TSecr=0 WS=128
31466 625.907346 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381950423 TSecr=377512238 WS=128
31847 634.085643 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381958602 TSecr=377512238 WS=128
32326 641.938500 192.168.21.51 192.168.99.100 TCP [TCP Spurious Retransmission] 45431→cisco-sccp(2000) [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=377543296 TSecr=0 WS=128
32327 641.938568 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381966455 TSecr=377512238 WS=128
32458 643.859279 192.168.21.51 192.168.99.100 TCP [TCP Spurious Retransmission] 42551→cisco-sccp(2000) [FIN, PSH, ACK] Seq=2470 Ack=289 Win=30336 Len=2 TSval=377545216 TSecr=381906168
32459 643.859394 192.168.99.100 192.168.21.51 TCP [TCP Dup ACK 30858#2] cisco-sccp(2000)→42551 [ACK] Seq=292 Ack=2473 Win=35968 Len=0 TSval=381968375 TSecr=377545216 SLE=2470 SRE=2473
32861 651.099614 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→42551 [FIN, PSH, ACK] Seq=289 Ack=2473 Win=35968 Len=2 TSval=381975616 TSecr=377545216
33374 658.088603 192.168.99.100 192.168.21.51 TCP [TCP Retransmission] cisco-sccp(2000)→45431 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=381982605 TSecr=377512238 WS=128
34426 674.002725 192.168.21.51 192.168.99.100 TCP [TCP Spurious Retransmission] 45431→cisco-sccp(2000) [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=377575360 TSecr=0 WS=128
34433 674.004602 192.168.99.100 192.168.21.51 TCP cisco-sccp(2000)→45431 [RST, ACK] Seq=668009898 Ack=1 Win=0 Len=0
【问题讨论】:
-
它几乎从来不是低层的错误。我会检查是否正确使用了 epoll,因为该 API 非常棘手(你为什么不使用 llibevent/libev/libuv 代替?)。我还会查看所有这些重传,也许有丢包。
-
@o9000:在虚拟化世界中,这种情况并不少见。我现在非常有信心正确使用
epoll。我目前正在以代码方式隔离问题,但我希望有人证实我的怀疑,上面的一些 SEQ 和 ACK 数字没有意义,并且 Keep-Alive 标志不应该全部出现自己。 -
如果 TCP 流看起来不错(我不是网络工程师 :P),那么我可以重新集中精力。
-
“服务器也可以打开到客户端的连接” - 你的意思是客户端也为
listen打开了一个端口并且角色改变了? -
保持连接应该在线路空闲后至少一秒开始。您的 tcpdump 指示在最后一次“ping”上线后的一秒内开始保持活动状态。您可以完全撤消应用程序中的保持活动设置并尝试吗?
标签: networking tcp virtualbox epoll