YOLO V3 - 爱码网

原文链接：点击打开链接

第一部分：论文以及代码测试

论文：https://pjreddie.com/media/files/papers/YOLOv3.pdf

翻译：https://zhuanlan.zhihu.com/p/34945787

代码：https://github.com/pjreddie/darknet

训练与测试：https://pjreddie.com/darknet/yolo/

旧版链接：

https://pjreddie.com/darknet/yolov2/

https://pjreddie.com/darknet/yolov1/

第二部分：如何训练自己的数据

说明：本文平台 linux + 作者官方代码

训练自己的数据主要分以下几步：

（0）关于数据集制作，该过程比较简单，不会的可参考我之前的文章VOC数据集制作。

（1）关于 .data .names 两个文件修改非常简单，参考官网即可。

（2）关于cfg修改，以6类目标检测为例，主要有以下几处调整（蓝色标出）：

[net]
# Testing
# batch=1
# subdivisions=1
# Training
batch=64

subdivisions=8

......

[convolutional]
size=1
stride=1
pad=1
filters=33###75

activation=linear

[yolo]
mask = 6,7,8
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=6###20
num=9
jitter=.3
ignore_thresh = .5
truth_thresh = 1
random=0###1

......

[convolutional]
size=1
stride=1
pad=1
filters=33###75
activation=linear

[yolo]
mask = 3,4,5
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=6###20
num=9
jitter=.3
ignore_thresh = .5
truth_thresh = 1
random=0###1

......

[convolutional]
size=1
stride=1
pad=1
filters=33###75
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes=6###20
num=9
jitter=.3
ignore_thresh = .5
truth_thresh = 1
random=0###1

A.filters数目是怎么计算的：3x(classes数目+5)，和聚类数目分布有关，论文中有说明；

B.如果想修改默认anchors数值，使用k-means即可；

C.如果显存很小，将random设置为0，关闭多尺度训练；

D.其他参数基本与V2一致，不再说明;

E.前100次迭代loss较大，后面会很快收敛；

YOLO V3

F.模型测试：效果较好，速度稍慢于v2.

YOLO V3

H.关于训练日志：

为了解决一部分童鞋的疑惑，训练时候的各个参数是否正常。【Avg IOU，class , .5R, .75R等数值的变化趋势】

为方便大家参考对照。今天又重新训了一次，截取前200次迭代日志。

下载地址：https://download.csdn.net/download/lilai619/10317560 【如果连接无法打开，表明在审核，稍后即可】

YOLO V3

I.如果训练还有问题或其他疑问，请参考第三部分或者网络搜索。

第三部分：V3问题详解

Tips0: out of memory 以及resizing 问题

显存不够了，调小batch，关闭多尺度训练。

Tips1: 在前100次迭代，loss超级大，有点发散的感觉？

因为我的目标比较小，所以loss超级大，后面就很快收敛了。所以，学习率需要针对自己的情况稍微调调。

Tips2: YOLOV3中的mask作用？

参考#558 #567

Every layer has to know about all of the anchor boxes but is only predicting some subset of them. This could probably be named something better but the mask tells the layer which of the bounding boxes it is responsible for predicting. The first yolo layer predicts 6,7,8 because those are the largest boxes and it's at the coarsest scale. The 2nd yolo layer predicts some smallers ones, etc.

The layer assumes if it isn't passed a mask that it is responsible for all the bounding boxes, hence the ifstatement thing.

Tips3: YOLOV3中的num作用？

#参考567

num is 9 but each yolo layer is only actually looking at 3 (that's what the mask thing does). so it's (20+1+4)*3 = 75. If you use a different number of anchors you have to figure out which layer you want to predict which anchors and the number of filters will depend on that distribution.

according to paper, each yolo (detection) layer get 3 anchors with associated with its size, mask is selected anchor indices.

Tips4: YOLOV3训练出现nan的问题？

参考#566

You must be training on a lot of small objects! nan's appear when there are no objects in a batch of images since i definitely divide by zero. For example, Avg IOU is the sum of IOUs for all objects at that level / # of objects, if that is zero you get nan. I could probably change this so it just does a check for zero 1st, just wasn't a priority.

Tips5: Anchor box作用是？

参考#568

Here's a quick explanation based on what I understand (which might be wrong but hopefully gets the gist of it). After doing some clustering studies on ground truth labels, it turns out that most bounding boxes have certain height-width ratios. So instead of directly predicting a bounding box, YOLOv2 (and v3) predict off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes are the anchor boxes.

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475

 
b.w = exp(x[index + 2*stride]) * biases[2*n] / w;

b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;

x[...] - outputs of the neural network
biases[...] - anchors
b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Tips6: YOLOv2和YOLOv3中anchor box为什么相差很多？

参考#562 #555

Now anchors depends on size of the network-input rather than size of the network-output (final-feature-map): #555 (comment)
So values of the anchors 32 times more.

Now filters=(classes+1+coords)*anchors_num where anchors_num is a number of masks for this layer.
If mask is absence then anchors_num = num for this layer:

darknet/src/yolo_layer.c

Lines 31 to 37 in e4acba6

 
if(mask) l.mask = mask;

else{

l.mask = calloc(n, sizeof(int));

for(i = 0; i < n; ++i){

l.mask[i] = i;

}

}

Each [yolo] layer uses only those anchors whose indices are specified in the mask=

So YOLOv2 I made some design choice errors, I made the anchor box size be relative to the feature size in the last layer. Since the network was downsampling by 32 this means it was relative to 32 pixels so an anchor of 9x9 was actually 288px x 288px.

In YOLOv3 anchor sizes are actual pixel values. this simplifies a lot of stuff and was only a little bit harder to implement

Tips7: YOLOv3打印的参数都是什么含义？

详见yolo_layer.c文件的forward_yolo_layer函数。

printf("Region %d Avg IOU: %f, Class: %f, Obj: %f, No Obj: %f, .5R: %f, .75R: %f, count: %d\n", net.index, avg_iou/count, avg_cat/class_count, avg_obj/count, avg_anyobj/(l.w*l.h*l.n*l.batch), recall/count, recall75/count, count);

网络深度加到了106层。速度依然很快，精度有提升。

小目标检测有改善，中大目标有一定程度的削弱，遮挡漏检问题依然存在【下图】。

YOLO V3

	b.w = exp(x[index + 2stride]) biases[2*n] / w;
	b.h = exp(x[index + 3stride]) biases[2*n+1] / h;

	if(mask) l.mask = mask;
	else{
	l.mask = calloc(n, sizeof(int));
	for(i = 0; i < n; ++i){
	l.mask[i] = i;
	}
	}