UNet3+(UNet+++)论文翻译与详细解读

论文下载地址: 链接

UNET 3+: A FULL-SCALE CONNECTED UNET FOR MEDICAL IMAGE SEGMENTATION

ABSTRACT

Recently, a growing interest has been seen in deep learningbased semantic segmentation. UNet, which is one of deep learning networks with an encoder-decoder architecture, is widely used in medical image segmentation. Combining multi-scale features is one of important factors for accurate segmentation. UNet++ was developed as a modified Unet by designing an architecture with nested and dense skip connections. However, it does not explore sufficient information from full scales and there is still a large room for improvement. In this paper, we propose a novel UNet 3+, which takes advantage of full-scale skip connections and deep supervisions. The full-scale skip connections incorporate low-level details with high-level semantics from feature maps in different scales; while the deep supervision learns hierarchical representations from the full-scale aggregated feature maps. The proposed method is especially benefiting for organs that appear at varying scales. In addition to accuracy improvements, the proposed UNet 3+ can reduce the network parameters to improve the computation efficiency. We further propose a hybrid loss function and devise a classification-guided module to enhance the organ boundary and reduce the over-segmentation in a non-organ image, yielding more accurate segmentation results. The effectiveness of the proposed method is demonstrated on two datasets. The code is available at:ZJUGiveLab/UNet-Version
Index Terms—Segmentation, Full-scale skip connection, Deep supervision, Hybrid loss function, Classification.

近年来，人们对基于深度学习的语义分割产生了浓厚的兴趣。UNet是一种采用编码-解码结构的深度学习网络，在医学图像分割中有着广泛的应用。结合多尺度特征是实现精确分割的重要因素之一。UNet++在UNet基础上进行改进，它是通过设计具有嵌套和密集跳过连接的体系结构。

玖零猴：U-Net+与FCN的区别+医学表现+网络详解+创新玖零猴：UNet++解读 + 它是如何对UNet改进 + 作者的研究态度和方式

然而，它没有从多尺度中表达足够的信息，仍然有很大的改进空间。在这篇论文中，我们提出了一种新颖的UNet 3+(UNet+++)，它利用了全尺度的跳跃连接(skip connection)和深度监督(deep supervisions)。全尺度的跳跃连接把来自不同尺度特征图中的高级语义与低级语义结合;而深度监督则从多尺度聚合的特征图中学习层次表示。本文所提出的方法特别适用于不同规模的器官。除了提高精度外，所提出的UNet 3+还可以减少网络参数，提高计算效率。此外,我们还进一步提出了一种混合损失函数，并设计了一个classification-guided module来增强器官边界和减少非器官图像的过度分割，从而获得更准确的分割结果。在两个数据集上验证了该方法的有效性。代码可在ZJUGiveLab/UNet-Version中找到。

关键词:分割,多尺度跳跃连接,深度监督,混合损失函数,分类

1. INTRODUCTION

Automatic organ segmentation in medical images is a critical step in many clinical applications. Recently, convolutional neural networks (CNNs) greatly promoted to developed a variety of segmentation models, e.g. fully convolutional neural networks (FCNs) [1], UNet [2], PSPNet [3] and a series of DeepLab version [4-6]. Especially, UNet, which is based on an encoder-decoder architecture, is widely used in medical image segmentation. It uses skip connections to combine the high-level semantic feature maps from the decoder and corresponding low-level detailed feature maps from the encoder.To recede the fusion of semantically dissimilar feature from plain skip connections in UNet, UNet++ [7] further strengthened these connections by introducing nested and dense skip connections, aiming at reducing the semantic gap between the encoder and decoder. Despite achieving good performance,this type of approach is still incapable of exploring sufficient information from full scales.

医学图像中器官的自动分割是许多临床应用的关键步骤。近年来，卷积神经网络(convolutional neural networks, CNNs)得到了极大的推动，发展出了多种分割模型，如全卷积神经网络(tional neural networks, FCNs)[1]、UNet[2]、PSPNet[3]和一系列DeepLab版本[4-6]。特别是基于编码-解码结构的UNet在医学图像分割中得到了广泛的应用。它使用跳跃连接来结合来自解码器的高级语义特征图和来自编码器的相应尺度的低级语义特征图。为了避免UNet中的纯跳跃连接在语义上的不相似特征的融合，UNet++[7]通过引入嵌套的和密集的跳跃连接进一步加强了这些连接，目的是减少编码器和解码器之间的语义差距。尽管取得了良好的性能，但这种方法仍然不能从多尺度中探索足够的信息。

As witnessed in many segmentation studies [1-7], feature maps in different scale explore distinctive information. Lowlevel detailed feature maps capture rich spatial information,which highlight the boundaries of organs; while high-level semantic feature maps embody position information, which locate where the organs are. Nevertheless, these exquisite signals may be gradually diluted when progressively down- and up-sampling. To make full use of the multi-scale features, we propose a novel U-shape-based architecture, named UNet 3+,in which we re-design the inter-connection between the encoder and the decoder as well as the intra-connection between the decoders to capture fine-grained details and coarsegrained semantics from full scales. To further learn hierarchical representations from the full-scale aggregated feature maps, each side output is connected with a hybrid loss function, which contributes to accurate segmentation especially for organs that appear at varying scales in the medical image volume. In addition to accuracy improvements, we also show that the proposed UNet 3+ can reduce the network parameters to improve the computation efficiency.

在许多分割研究中[1-7]，不同尺度的特征图展示着不同的信息。低层次特征图捕捉丰富的空间信息，能够突出器官的边界;而高级语义特征图则体现了器官所在的位置信息。然而，当逐步下采样和上采样时，这些微妙的信号可能会逐渐稀释时。为了充分利用多尺度特征，我们提出了一种新的基于u形的体系结构，命名为UNet 3+。在该网络结构中，我们重新设计了编码器和解码器之间的相互连接以及解码器之间的内部连接，以从全尺度捕获细粒度的细节和粗粒度的语义。为了进一步从全尺寸的聚合特征图中学习层次表示法，每个边的输出都与一个混合损失函数相连接，这有助于精确分割，特别是对于在医学图像体积中出现不同尺度的器官。除了提高精度外，我们还证明了所提出的UNet 3+可以减少网络参数，提高计算效率。

To address the demand for more accurate segmentation in medical image, we further investigate how to effectively reduce the false positives in non-organ images. Existing methods solve the problem by introducing attention mechanisms[8] or conducting a pre-defined refinement approach such as CRF [4] at inference. Different from these methods, we extend a classification task to predict the input image whether has organ, providing a guidance to the segmentation task.

为了满足医学图像分割的准确性要求，我们进一步研究了如何有效地减少非器官图像的误报。现有的方法通过引入注意力机制[8]或在推理时执行预定义的细化方法(如CRF[4])来解决这个问题。不同于这些方法，我们提出了一个分类任务来预测输入图像是否有器官，为分割任务提供了指导。

In summary, our main contributions are four-fold: (i) devising a novel UNet 3+ to make full use of the multi-scale features by introducing full-scale skip connections, which incorporate low-level details with high-level semantics from feature maps in full scales, but with fewer parameters; (ii) developing a deep supervision to learn hierarchical representations from the full-scale aggregated feature maps, which optimizes a hybrid loss function to enhance the organ boundary; (iii) proposing a classification-guided module to reduce over-segmentation on none-organ image by jointly training with an image-level classification; (iv) conducting extensive experiments on liver and spleen datasets, where UNet 3+ yields consistent improvements over a number of baselines.

总之，我们的主要贡献有四方面:(一)设计一个新的UNet 3+来充分利用多尺度特征，引入全尺度的skip connection，该连接结合了来自全尺度特征图的低级语义和高级语义，并且参数更少;(二)进行深度监督，从全面的聚合特征图中学习层次表示，优化了混合损失函数以增强器官边界;(三)提出分类指导模块，通过图像级分类联合训练，减少非器官图像的过度分割;(四)在肝脏和脾脏数据集上进行广泛的实验，UNet 3+的表现得到了提高并且超过了很多baselines。

2. METHODS

Fig.1 gives simplified overviews of UNet, UNet++ and the proposed UNet 3+. Compared with UNet and UNet++, UNet 3+ combines the multi-scale features by re-designing skip connections as well as utilizing a full-scale deep supervision,which provides fewer parameters but yields a more accurate position-aware and boundary-enhanced segmentation map.

图1给出了UNet、UNet++和UNet 3+的简化概述。与UNet和UNet++相比，UNet 3+结合了多尺度特征，重新设计了跳跃连接，并利用多尺度的深度监督，UNet 3+提供更少的参数，但可以产生更准确的位置感知和边界增强的分割图。

2.1. Full-scale Skip Connections

The proposed full-scale skip connections convert the interconnection between the encoder and decoder as well as intraconnection between the decoder sub-networks. Both UNet with plain connections and UNet++ with nested and dense connections are short of exploring sufficient information from full scales, failing to explicitly learn position and boundary of an organ. To remedy the defect in UNet and UNet++,each decoder layer in UNet 3+ incorporates both smaller- and same-scale feature maps from encoder and larger-scale feature maps from decoder, which capturing fine-grained details and coarse-grained semantics in full scales.

所提出的全尺寸跳跃连接改变了编码器和解码器之间的互连以及解码器子网之间的内连接。无论是连接简单的UNet，还是连接紧密嵌套的UNet++，都缺乏从全尺度探索足够信息的能力，未能明确了解器官的位置和边界。为了弥补UNet和UNet++的缺陷，UNet 3+中的每一个解码器层都融合了来自编码器中的小尺度和同尺度的特征图，以及来自解码器的大尺度的特征图，这些特征图捕获了全尺度下的细粒度语义和粗粒度语义。

As an example, Fig. 2 illustrates how to construct the feature map of . Similar to the UNet, the feature map from the same-scale encoder layer are directly received in the decoder. In contrast to the UNet, a set of inter encoder-decode skip connections delivers the low-level detailed information from the smaller-scale encoder layer and , by applying non-overlapping max pooling operation; while a chain of intra decoder skip connections transmits the high-level semantic information from larger-scale decoder layer and , by utilizing bilinear interpolation. With the five same resolution feature maps in hand, we need to further unify the number of channels, as well as reduce the superfluous information. It occurred to us that the convolution with 64 filters of size 3 × 3 can be a satisfying choice. To seamlessly merge the shallow exquisite information with deep semantic information, we further perform a feature aggregation mechanism on the concatenated feature map from five scales, which consists of 320 filters of size 3 × 3, a batch normalization and a ReLU activation function. Formally, we formulate the skip connections as follows: let i indexes the down-sampling layer along the encoder, refers to the total number of the encoder. The stack of feature maps represented by , is computed as:

例如，图2说明了如何构造 UNet3+(UNet+++)论文翻译与详细解读特征图。与UNet类似，直接接收来自相同尺度编码器层的特征图。但不同的是，跳跃连接不止上面一条。其中，上面两条跳跃连接通过不同的最大池化操作将较小尺度编码器层和进行池化下采样，以便传递底层的低级语义信息。之所以要池化下采样，是因为要统一特征图的分辨率。从图中可知， UNet3+(UNet+++)论文翻译与详细解读要缩小分辨率4倍，要缩小分辨率2倍。另外的下面两条跳跃连接则通过双线性插值法对解码器中的和进行上采用从而放大特征图的分辨率，从图中可知，()要放大分辨率4倍，要放大分辨率2倍。统一完特征图之后，还不能结合它们，还需要统一特征图的数量，减少多余的信息。作者发现64个3×3大小的滤波器进行卷积表现效果较好，卷积后便产生64个通道的特征图(在之前CNN那篇文章中我也提到，卷积核的数量==输出特征图的数量，还不知道的童鞋可以点击下方链接进行学习)

玖零猴：卷积神经网络CNN(卷积池化、感受野、共享权重和偏置、特征图)

统一好了feature map的分辨率和数量后，就可以将浅层的精细信息与深层的语义信息进行特征融合了，关于特征融合一般有如下两种方法，FCN式的逐点相加或者U-Net式的通道维度拼接融合，本文是后者。这5个尺度融合后，便产生320个相同分辨率的特征图，然后再经过320个3×3大小的滤波器进行卷积，最后再经过BN + ReLU得到 UNet3+(UNet+++)论文翻译与详细解读。

下面从公式上表示这种Full-scale Skip Connections，i表示沿着编码方向的第i个下采样层，N表示编码器的个数(文中有5个)，那么特征图 UNet3+(UNet+++)论文翻译与详细解读的计算公式如下:

其中，函数C表示卷积操作，函数H表示特征聚合机制(一个卷积层+一个BN+一个ReLU)，函数D和函数U分别表示上采样和下采样操作，[ ]表示通道维度拼接融合。

此外，值得一提的是UNet3+ 的参数少于UNet和UNet++，下面从公式上解释。编码器的结构三者都是一样的， UNet3+(UNet+++)论文翻译与详细解读都为 channels，即编码部分的参数都是一样多的。而UNet的解码部分和编码部分是对称的，因此都为 channels。UNet解码部分中第i个解码阶段的参数数量可以通过下面式子计算得到:

其中， UNet3+(UNet+++)论文翻译与详细解读表示卷积核的大小，函数d表示求出节点的深度。在UNet++中，它在每一条跳跃路径上都利用了dense conv block，它的计算公式如下:

正如大家所见，UNet++的参数量比UNet的更大，而在UNet3+中，每一个解码器由N个尺度连接所成，所以产生 UNet3+(UNet+++)论文翻译与详细解读 channels，计算公式如下:

解码部分的通道减少使得UNet3+的参数少于UNet和UNet++。

(PS:这些公式我不太能理解是怎么操作的?知道的小伙伴在下方评论下~)

2.2. Full-scale Deep Supervision

In order to learn hierarchical representations from the full scale aggregated feature maps,the full-scale deep supervision is further adopted in the UNet 3+. Compared with the deep supervision performed on generated full-resolution feature map in UNet++, the proposed UNet 3+ yields a side output from each decoder stage, which is supervised by the ground truth. To realize deep supervision, the last layer of each decoder stage is fed into a plain 3 × 3 convolution layer followed by a bilinear up-sampling and a sigmoid function.

在UNet++中用到了深度监督(Deep supervision)，如下图所示UNet++的深度监督，它对生成的全分辨率特征图(全分辨率=最后分割图的分辨率)进行深度监督。具体的实现操作就是在图中 X0,1 、X0,2、 X0,3 、X0,4后面加一个1x1的卷积核，相当于去监督每个level，或者说监督每个分支的UNet的输出。

为了从全尺度的聚合特征图中学习层次表示，UNet 3+进一步采用了全尺度深度监督。不同于UNet++，UNet 3+中每个解码器阶段都有一个侧输出，是金标准(ground truth，GT)进行监督。为了实现深度监督，每个解码器阶段的最后一层被送入一个普通的3×3卷积层，然后是一个双线性上采样和一个sigmoid函数。(这里的上采样是为了放大到全分辨率)

To further enhance the boundary of organs, we propose a multi-scale structural similarity index (MS-SSIM) [9] loss function to assign higher weights to the fuzzy boundary. Benefiting from it, the UNet 3+ will keep eye on fuzzy boundary as the greater the regional distribution difference, the higher the MS-SSIM value. Two corresponding N×N sized patches are cropped from the segmentation result P and the ground truth mask G, which can be denoted as = { [: =1, … , (} and = [: = 1, … , ( , respectively. The MSSSIM loss function of p and g is defined as:

为了进一步增强器官的边界，我们提出了一个多尺度结构相似度指数[9]损失函数来赋予模糊边界更高的权重。受益于此，UNet 3+将关注模糊边界，区域分布差异越大，MS-SSIM值越高。从分割结果P和金标准G裁剪了两个相应的N×N大小的块，它们可以分别表示为 UNet3+(UNet+++)论文翻译与详细解读 = 1,…, }和。定义p和g的MSSSIM损失函数为:

其中，M表示尺度的总数量， UNet3+(UNet+++)论文翻译与详细解读和分别是p、g的均值和方差，表示它们的协方差。定义每个尺度中这两个部分的相对重要性，设置可以参考[9]。两个小的常量和是避免被0除。在原文中，作者将尺度设置为5(和UNet、UNet++保持一致)，基于[9]。

By combining focal loss [10], MS-SSIM loss and IoU loss[11], we develop a hybrid loss for segmentation in three-level hierarchy – pixel-, patch- and map-level, which is able to capture both large-scale and fine structures with clear boundaries. The hybrid segmentation loss is defined as:

结合focal损失函数[10], MS-SSIM损失函数和IoU损失函数[11],我们提出一种混合损失函数用于分割在三个不同层次像素级别分割、块分割、图像级别分割，它能捕获大尺度的和精细结构清晰的界限。混合分割损失被定义为:

2.3. Classification-guided Module (CGM)

In the most medical image segmentations, the appearance of false-positives in a non-organ image is an inevitable circumstance. It is, in all probability, caused by noisy information from background remaining in shallower layer, leading to the phenomenon of over-segmentation. To achieve more accurate segmentation, we attempt to solve this problem by adding an extra classification task, which is designed for predicting the input image whether has organ or not.

在大多数医学图像分割中，非器官图像出现假阳性是不可避免的。它很有可能是由于来自背景的噪声信息停留在较浅的层次，导致过度分割的现象。为了实现更精确的分割，我们尝试通过增加一个额外的分类任务来解决这个问题，这个任务是为预测输入图像是否有器官而设计的。

As depicted in Fig. 3, after passing a series of operations including dropout, convolution, maxpooling and sigmoid, a 2-dimensional tensor is produced from the deepest-level , each of which represents the probability of with/without organs. Benefiting from the richest semantic information, the classification result can further guide each segmentation sideoutput in two steps. First, with the help of the argmax function, 2-dimensional tensor is transferred into a single output of {0,1}, which denotes with/without organs. Subsequently, we multiply the single classification output with the side segmentation output. Due to the simpleness of binary classification task, the module effortlessly achieves accurate classification results under the optimization of Binary Cross Entropy loss function [12], which realizes the guidance for remedying the drawback of over-segmentation on none-organ image.

如图3中所描绘的一样,最深层次的二维张量 UNet3+(UNet+++)论文翻译与详细解读经过一系列的操作包括dropout,卷积,maxpooling,sigmoid,最后有两个值代表有/没有器官的概率。利用最丰富的语义信息，分类结果可以进一步指导每一个切分侧边输出两个步骤。首先，在argmax函数的帮助下，将二维张量转化为{0,1}的单个输出，表示有/没有器官。随后，我们将单个分类输出与侧分割输出相乘。由于二值分类任务的简单性，该模块通过优化二值交叉熵损失函数[12]，轻松获得准确的分类结果，实现了对非器官图像过分割的指导。

3. EXPERIMENTS AND RESULTS

略

reference

亲自体验,不错的两本书,适合入门~

()