TaskScheduler 在高并发异步 asp.net 核心应用程序中的 OutOfMemoryException答案

【问题标题】：OutOfMemoryException by TaskScheduler in highly concurrent async asp.net core applicationTaskScheduler 在高并发异步 asp.net 核心应用程序中的 OutOfMemoryException
【发布时间】：2019-11-20 02:05:13
【问题描述】：

在托管在 AWS ECS FARGATE (docker) 上的 dotnet core 2.2 REST 服务中，即使 ECS 报告最大内存使用率为 11%（超出16 GB）。崩溃总是来自 TaskScheduler（下面的堆栈跟踪）。它只发生在生产环境中。

我正在寻求有关如何解决此问题的建议。（编辑：我不认为这实际上是 OutOfMemory 问题，除非 Thread:StartInternal() 突然使用 16GB 的速度比 AWS 监控工具注册它的速度快 90%）

该应用程序在 Windows 10 上本地运行，我还尝试通过维持 100 个并发请求在单独的 ECS 集群（我们的测试集群）上重现，但没有运气。服务的一个端点接收 99% 以上的请求。基本操作是：

尝试使用async/await 在 MongoDB 数据库中查找一些文档（基于输入）
从 WCF 获取数据（同步，见下文）
对于某些结果，使用System.New.WebRequest 使用async/await 从外部URL 获取数据（有时很慢）
返回结果

WCF 服务称为同步，因为我们在 WCF 之上使用客户端库，这不是异步安全的。但是，结果会在MemoryCache 中存储 1 分钟，并且使用AsyncEx.AsyncMonitor 保护到期时重新获取，因此只允许一个调用者更新缓存，如下所示：

using( await _monitor.EnterAsync( ) )
{
    if( !Cache.TryGetValue( "UserLookup", out LookupUsers lookupUsers ) )
    {
        lookupUsers = await GetCachedUsers( ssoToken );
        Cache.Set( "UserLookup", lookupUsers, TimeSpan.FromMinutes( 1 ) );
    }
    return lookupUsers;
}

GetCachedUsers() 这样做：

var users = await Task.Run( ( ) => client.Proxy.ListUsers( new ListUsersInput { } ) );

并且在超时或其他问题的情况下也返回一个默认值。

动作的入口点是这样的：

[Route( "get-content" )]
[HttpPost]
public async Task<RemoteGetContentResult> GetContent( [FromBody]RemoteGetContentInput input )
{
    // input validation
    var c = Interlocked.Increment( ref _concurrency );
    try
    {
        // log value of _concurrency
        return await _provider.GetContentExAsync( input );
    }
    finally
    {
        Interlocked.Decrement( ref _concurrency );
    }
}

记录的并发级别通常为 10-30，但可以达到 100（当有很多外部 http 获取时）。

这是我在 AWS ECS 日志中看到的堆栈跟踪：

2019-07-10T06:22:39.554Z Unhandled Exception: System.Threading.Tasks.TaskSchedulerException: An exception was thrown by a TaskScheduler. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
2019-07-10T06:22:39.554Z    at System.Threading.Thread.StartInternal()
2019-07-10T06:22:39.554Z    at System.Threading.Tasks.Task.ScheduleAndStart(Boolean needsProtection)
2019-07-10T06:22:39.554Z    --- End of inner exception stack trace ---
2019-07-10T06:22:39.554Z    at System.Threading.Tasks.Task.ScheduleAndStart(Boolean needsProtection)
2019-07-10T06:22:39.554Z    at System.Threading.Tasks.Task.InternalStartNew(Task creatingTask, Delegate action, Object state, CancellationToken cancellationToken, TaskScheduler scheduler, TaskCreationOptions options, InternalTaskOptions internalOptions)
2019-07-10T06:22:39.554Z    at System.Runtime.IOThreadScheduler.ScheduleCallbackHelper(SendOrPostCallback callback, Object state)
2019-07-10T06:22:39.554Z    at System.Runtime.IOThreadScheduler.ScheduleCallbackNoFlow(SendOrPostCallback callback, Object state)
2019-07-10T06:22:39.554Z    at System.Runtime.CompilerServices.YieldAwaitable.YieldAwaiter.System.Runtime.CompilerServices.IStateMachineBoxAwareAwaiter.AwaitUnsafeOnCompleted(IAsyncStateMachineBox box)
2019-07-10T06:22:39.554Z    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AwaitUnsafeOnCompleted[TAwaiter,TStateMachine](TAwaiter& awaiter, TStateMachine& stateMachine)
2019-07-10T06:22:39.554Z --- End of stack trace from previous location where exception was thrown ---
2019-07-10T06:22:39.554Z    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
2019-07-10T06:22:39.554Z --- End of stack trace from previous location where exception was thrown ---
2019-07-10T06:22:39.554Z    at System.Threading.ThreadPoolWorkQueue.Dispatch()

更新：我每 5 秒添加一些关于该过程的额外日志记录。在 18:30:16.741Z 它记录了：

2019-07-10T18:30:16.741Z concurrency:   4 proc thread cnt:   29 avail worker threads: 32,766 avail compl port threads:  1,000 ws: 1,733,996,544 peak ws:      0

因此，16GB 中的工作集约为 1.7GB。（出于某种原因，峰值 WS 始终为 0，但我看到的最大值为 2,053,316,608 字节）。 4秒后，抛出OOM异常：

2019-07-10T18:30:20.630Z Unhandled Exception: System.Threading.Tasks.TaskSchedulerException: An exception was thrown by a TaskScheduler. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.

【问题讨论】：

并发级别和OOM异常有关联吗？
@StephenCleary 没有明确的相关性，但我不能肯定地说。例如，最后一个崩溃的并发级别约为 25，但我已经看到它处理 100 左右没有问题
我正在添加代码以记录（每 5 秒）一般并发级别（使用中间件）、工作集、峰值工作集、线程计数和线程池可用计数。还有其他建议吗？
Possibly relevant
这个might also be fixed in .NET Core 3.0。

标签： linux amazon-web-services docker .net-core async-await

【解决方案1】：

原来我们使用的库在使用 HttpClient 时没有释放它，从而导致套接字泄漏。

我们在 Windows 上使用这个库已经有一段时间了，但显然套接字最终会被终结器关闭，但在 Linux 上却没有。

我终于在普通的 Linux 机器上运行了该应用程序，从而更容易监控操作系统。原来这个命令

$ lsof -p <PID>

这样返回数千行

dotnet  15613 ec2-user  215u     sock                0,8      0t0  4968805 protocol: TCP
dotnet  15613 ec2-user  219u     sock                0,8      0t0  4968844 protocol: TCP
dotnet  15613 ec2-user  220u     sock                0,8      0t0  4968236 protocol: TCP
dotnet  15613 ec2-user  221u     sock                0,8      0t0  4968247 protocol: TCP
...

将HttpClient 用法转换为单例解决了该问题。

【讨论】：