【论文阅读】In Pursuit of Optimal Storage Performance: Hardware/Software Co-Design with Dual-Mode SSD

1.工作动机和贡献
- 1.1 当前超大规模数据中心面临挑战
- 1.2 贡献
2.背景知识
3.Dual-Mode SSD Overview
4.Software Architecture
- 4.1现有open-channel SSD软件栈结构
- 4.2 User-space storage software stack
5.Advanced I/O Scheduling
- 5.1问题描述
- 5.2 Precise Priority-based I/O Scheduling
6.Evaluation

1.工作动机和贡献

1.1 当前超大规模数据中心面临挑战

标准化硬件(如NVMe SSD)留给定制化的空间有限，无法适应众多不同的IO模式需求。
传统上，软硬件分开优化。软件层不明白硬件内部的机制，硬件层也不会知道软件层所需要的IO模式。这造成了无法深入优化。例如将RocksDB搭建在standard SSD上造成写放大和功能冗余。
标准化硬件对于软件层是黑盒。如此软件层是无法控制其IO性能…
标准化SSD无法满足当前的应用需求(或者说我们的需求就常常变化)。
超大规模数据中心常常发生错误。为了更加快速应对error，数据中心希望对于IO有更大的控制…

1.2 贡献

在硬件层设计dual-mode SSD同时支持native NVMe mode和Open Channel Mode
在软件层，开发了FULL-User-Space Storage Software Stack，包含各类特性以及可定制的FTL满足不同的IO模式(pattern)。

2.背景知识

主要包含NAND Flash SSD原理和Open-Channel SSD概念(参考LOCS)。

3.Dual-Mode SSD Overview

【论文阅读】In Pursuit of Optimal Storage Performance: Hardware/Software Co-Design with Dual-Mode SSD

支持两种模式：Native NVMe/Open channel
通过更新固件来切换模式，注意切换的时候设备元数据如坏块信息是要被保留滴…
controller本身为内部设计产品，支持多厂家的3D TLC NAND
Open channel command interface

4.Software Architecture

4.1现有open-channel SSD软件栈结构

【论文阅读】In Pursuit of Optimal Storage Performance: Hardware/Software Co-Design with Dual-Mode SSD
lightNVM中也说明了类似的架构：

总的来说，这种结构高度依赖于Linux Kernel和kernel-mode drivers。这种结构带来了一些弊端(此弊端也可以参考SPDK在user sapce中实现的动机)

IO路径长并带来很高的开销。(例如需要在kernel-space和user-space之间进行上下文切换)
Kernel and kernel-mode drivers are more difficult to debug and update
Hard to debug and update, kernel is more like a black box to applications(这…不够白盒)

4.2 User-space storage software stack

文章引入了自己设计的软件栈。利用SPDK的user-space NVMe driver，并在骑上搭建自己的Dual-mode base library。下图为整体框架。
【论文阅读】In Pursuit of Optimal Storage Performance: Hardware/Software Co-Design with Dual-Mode SSD

Dual-mode base library
主要完成下面功能：
1Processing Open Channel commands
2错误和状态通知
3设备发现识别
4Common media management functionalities
5与其他SPDK组件沟通
Block Device/Customized FTL
即有common block device FTL也有可定制的FTL…前者也有定制的特性，例如第五点中的advanced IO Scheduling。注意基于ocssd spec 1.2
threading models
支持两种线程模式
1Run-to-Completion Model：最小化IO path开销。应用需要自己负责调配所有的IO。所以比较适用于定制化的FTL。
2Pipeline Model：需要更多的资源开销。This model is suitable for cases when application does not want to botherwith background tasks by itself. A good example is the generic block device FTL.

5.Advanced I/O Scheduling

5.1问题描述

为了提高存储的利用率，pooling是很常见的管理存储资源的办法。在storage pooling中，每个SSD都会收到不同Application workload的IO请求。但是不同的应用是由不同的QoS需求。对此根据使用场景将IO traffic分为三个类别：

critical：关键路径。例如对于critical read traffic，读延时就是最重要的性能指标。因此，read requests from on-line services are often classified as critical.
high priority：For example, requests from off-line data processing services are often labeled as high-priority.
low priority：I/O traffic from background tasks are often treated as low-priority.比如RocksDB的background compaction。

传统的NVMe SSD缺乏手段来获得某个IO的优先级信息传递给controller…因此我们提出自己的IO scheduling方法支持有区分度的QoS…

5.2 Precise Priority-based I/O Scheduling

PU-based Read I/O Scheduling (PRS)
实现的前提是在Open-channel的情况，我们有办法知道特定PU的状态(busy/idle)。并且读取时，要先确定某个PU。 如果不停向busy PU派任务，显然会影响IO的并发。对于每个PU都要对应的read queue，同时使用PU Map跟踪number of active read requests dispatched to the SSD。如果active read request的数目高于阈值(通过测试2为比较好滴)，就是busy…下图是逻辑框图

按照下面规则安排read request：

If the PU is idle, a new read request will be directly dispatched to the PU on SSD. If
the PU is busy, a new read request will be put into the host-side per-PU read queue.

When a read request is enqueued to the per-PU queue, it is sorted based on its priority。同样优先级就按照FIFO执行。

When a read request is completed, if the per-PU queue is not empty, the host driver
will dispatch the next front request from the per-PU queue.
time-out mechanism is employed。如果某个请求长时间没被处理掉就修改优先级为critical…

PU-based Read Throttling (PRT)

问题描述：有时候低优先级的读请求比很多优先级高的请求需要大很多的IO大小…从而延缓高优先级请求的的处理。
解决方案：限制每个优先级的最大active read requests数目，从而控制分配给低优先级请求的带宽。

Priority-based Write Scheduling (PWS)
也按照优先级安排…
和读请求不同，写请求先被缓存在write buffer中以达到较低的延时。当写入buffer之后，host端就认为写请求完成了…之后SSD controller在后台周期性地将数据flush进入PU。
Adaptive Write Throttling (AWT)

目标：和读类似，为了保证高优先级请求有足够带宽…
解决方案：我们将时间划分为固定的间隔。给定目标写入带宽，我们可以计算每个时间间隔可以写入的数据总量。在每个时间间隔内，还会计算每个优先级级别写入的数据量。低优先级写请求的写配额是总数据量和高优先级请求写的数据量之间的差值。 写配额在下一个时间间隔应用于低优先级写请求，以确保总的写带宽低于目标写带宽

Coordinated GC Scheduling
实际上，垃圾回收会带来额外读写从而和应用发出的读写请求抢占带宽…这不太行。实际情况下会有burst write的情况(狂写的…)，我们希望可以避免在burst write phase做垃圾回收…从而提高burst write时候的系统性能…
因此引入下面机制：
1 Read/write requests from GC are typically tagged with Medium priority, in order to
balance between high-priority reads and GC needs.
2紧急情况(空余空间低于critical thread,比较少见)，暂时优先处理垃圾回收
3GC can be disabled if there are burst writes.

6.Evaluation

…不想写emm

参考文献：In Pursuit of Optimal Storage Performance: Hardware/Software Co-Design with Dual-Mode SSD

目录