在不丢失跟踪事件的情况下停止和启动 Erlang 跟踪器答案

【问题标题】：Stop and start Erlang tracer without losing trace events在不丢失跟踪事件的情况下停止和启动 Erlang 跟踪器
【发布时间】：2016-10-13 18:11:50
【问题描述】：

我有一个关于 Erlang 中的跟踪器的问题，以及如何在不丢失任何跟踪事件的情况下打开和关闭这些跟踪器。假设我有一个进程 P1 正在使用 send 和 receive 跟踪标志进行跟踪，如下所示：

erlang:trace(P1Pid, true, [set_on_spawn, send, 'receive', {tracer, T1Pid}])

由于指定了 set_on_spawn 标志，一旦 P1 生成（子）进程 P2，相同的标志（ie set_on_spawn、send、'receive'）将也适用于 P2。现在假设我想在 P2 上创建一个新的跟踪器，以便跟踪器 T1 处理来自 P1 的跟踪，而跟踪器 T2 处理来自 P2 的跟踪。为此，（因为 Erlang 只允许每个进程使用一个跟踪器），我需要首先从 P2 取消设置跟踪标志（ie set_on_spawn、send、'receive'）（因为这些是由于set_on_spawn标志而自动继承的）并在P2上再次设置它们，如下所示：

    % Unset trace flags on P2. 
    erlang:trace(P2Pid, false, [set_on_spawn, send, 'receive']),

    % We might lose trace events at this instant which were raised
    % by process P2 while un-setting the tracer on P2 and setting
    % it again.

    % Now set again trace flags on P2, directing the trace to 
    % a new tracer T2.
    erlang:trace(P2Pid, true, [set_on_spawn, send, 'receive', {tracer, T2Pid}]),

在设置和取消设置跟踪器之间的行中，进程 P2 引发的一些跟踪事件可能会由于这里的竞争条件而丢失。

我的问题是：这可以在不丢失跟踪事件的情况下实现吗？

Erlang 是否提供了可以以原子方式完成这种“跟踪器切换”（即从 T1 到 T2）的方法？

或者，是否可以暂停 Erlang VM 并在此过程中暂停跟踪，从而避免丢失跟踪事件？

【问题讨论】：

标签： erlang trace

【解决方案1】：

我已经更深入地研究了这个问题，并且可能找到了一个半可取的（见下文）部分解决方法。阅读 Erlang 文档后，我发现了 erlang:suspend_process/1 和 erlang:resume_process/1 BIF。使用这两个，我可以像这样实现所需的行为：

% Suspend process P2. According to the Erlang docs, this function
% blocks the caller (i.e. the current tracer) until P2 is suspended.
% This way, we do not lose trace events.
erlang:suspend_process(P2Pid),

% Unset trace flags on P2. 
erlang:trace(P2Pid, false, [set_on_spawn, send, 'receive']),

% We should not lose any trace events from P2, since it is
% currently suspended, and therefore cannot generate any.
% However, we can still lose receive trace events that are 
% generated as a result of other processes sending messages 
% to P2.

% Now set again trace flags on P2, directing the trace to 
% a new tracer T2.
erlang:trace(P2Pid, true, [set_on_spawn, send, 'receive', {tracer, T2Pid}]),

% Finally, resume process P2, so that we can receive any trace 
% messages generated by P2 on the new tracer T2.
erlang:resume_process(P2Pid).

我使用这种方法的唯一三个问题是：

erlang:suspend_process/1 和 erlang:resume_process/1 的 Erlang 文档明确声明它们仅用于调试目的。我的问题是为什么不能在生产中使用这些，如示例所示，除非进程 P2 暂停，否则我们面临丢失跟踪事件的风险（从跟踪器 T1 切换到跟踪器 T2 时）？
我们实际上是在搞乱系统（即我们正在干扰它的调度）。是否存在与此相关的风险（除了人们可能忘记在先前暂停的进程上调用erlang:resume_process/1）？
更重要的是，即使我们可以阻止进程 P2 采取任何行动，我们也无法阻止其他进程向 P2 发送消息。这些消息将导致{trace, Pid, receive, ...} 跟踪事件，这些事件在我们切换跟踪时可能会丢失。有没有办法避免这种情况？

NB：如果 P'（调用 erlang:suspend_process/1 的那个）死亡，之前被进程 P' 暂停的进程 P自动恢复。

【讨论】：