通过阻塞套接字发送（）超时，但此后消息到达目的地答案

【问题标题】：send() via blocking socket timed out but the message arrived at the destination thereafter通过阻塞套接字发送（）超时，但此后消息到达目的地
【发布时间】：2016-11-02 11:55:31
【问题描述】：

我正在处理点对点通信系统（用 Python 3.5 编写和运行）中的分布式“死锁”情况。在这个系统中，每个节点都与每个节点保持 2 个所谓的 inconn 和 outconn 连接。我使用 select.poll() 来执行多路复用。因此有时会发生以下死锁：如果两个连接的对等方都试图通过 outconn 向对方发送数据，则每个对等方的 select.poll() 循环都会阻塞在 send() 中，因此另一个inconn 连接上的端不能 recv()。

我处理这种死锁的方法是在 outconnn 的套接字上设置超时（），这似乎有效。然而，有趣的是，在套接字超时后，消息似乎能够到达目的地。以下是两个节点的示例日志：

节点 A(192.168.56.109)

INFO: [2016-11-02 11:08:05,172] [COOP] 将 ASK_COOP [2016-11-02 11:08:05.172643] 发送到 192.168.56.110 用于分段 2 .

警告：[2016-11-02 11:08:06,173] [COOP] 无法发送到 192.168.56.110。错误：超时

信息：[2016-11-02 11:08:06,174] [COOP] 来自 192.168.56.110 的消息将于 10 日发布。

INFO：[2016-11-02 11:08:06,174] [COOP] 从 192.168.56.110 获取第 2 段的 HEARTBEAT [2016-11-02 11:08:04.503723] .

节点 B(192.168.56.110)

INFO：[2016-11-02 11:08:04,503] [COOP] 将 HEARTBEAT [2016-11-02 11:08:04.503723] 发送到 192.168.56.109 以获取分段 2 .

警告：[2016-11-02 11:08:05,505] [COOP] 无法发送到 192.168.56.109。错误：超时

信息：[2016-11-02 11:08:05,505] [COOP] 来自 192.168.56.109 的消息将于 11 日发布。

INFO：[2016-11-02 11:08:05,505] [COOP] 从 192.168.56.109 获取分段 2 的 ASK_COOP [2016-11-02 11:08:05.172643] .

我可以知道这是为什么吗？顺便说一句，我处理这种僵局的方式是一个好习惯吗？如果不是，避免这种分布式死锁的最佳做法是什么？

【问题讨论】：

标签： python sockets network-programming deadlock p2p

【解决方案1】：

根据我的经验，避免此问题的最佳做法是始终使用非阻塞 I/O。如果您的应用在 send() 或 recv() 中从不阻塞，则不会出现死锁（至少不是您所描述的那种）。

当然，非阻塞 I/O 也有其自身的复杂性——特别是，您的代码需要能够正确处理部分发送和部分接收。实际上，这意味着您的应用程序的事件循环可能看起来像这样（伪代码）：

while true:
   block in select() until at least one socket is ready-for-read (or ready-for write, if you have data you want to send on that socket)

   for each ready-for-read socket:      
      read as many bytes as you can (without blocking) into a FIFO receive buffer that you have associated with that socket
      parse as many complete messages as you can out of the beginning of the FIFO buffer 
      (pop the parsed bytes out of the FIFO when you're done with them)

   for each ready-for-write socket:
      send as many bytes as you can (without blocking) from a FIFO send buffer that you have associated with that socket
      (pop the sent bytes out of the FIFO when you're done with them)

在这种设计中，每当您的应用程序生成了要在套接字上发送的新数据时，它不应该直接调用 send()；相反，它应该将该数据附加到与该套接字关联的 FIFO 发送缓冲区的末尾，并且上述事件循环将允许尽快发送数据（在发送任何已存在于 FIFO 中的数据之后，当然），不会阻止事件循环执行它可能具有的任何其他职责。

在最坏的情况下（您想要发送大量数据的非常慢的 TCP 连接），FIFO 可能会变大（使用额外的内存），但它永远不会“死锁”。

【讨论】：