目标检测模型主要分为two-stage和one-stage, one-stage的代表主要是yolo系列和ssd。简单记录下学习yolo系列的笔记。
1 yolo V1
yolo v1是2015年的论文 you only look once:unified,real-time object detection 中提出,为one-stage目标检测的开山之作。其网络架构如下:(24个卷积层和两个全连接层,注意最后一个全连接层可以理解为1*4096到1*1470(7*7*30)的线性变换)
yolo v1的理解主要在于三点:
1.1 网格划分: 输入图片为448*448,yolo将其划为为49(7*7)个cell, 每个cell只负责预测一个物体框, 如果这个物体的中心点落在了这个cell中,这个cell就负责预测这个物体
1.2 预测结果:最后网络的输出为7*7*30, 也可以看做49个1*30的向量,每个向量的组成如下: (x, y, w, h, confidence) *2 + 20; 即每一个向量预测两个bounding box及对应的置信度,还有物体属于20个分类(VOC数据集包括20分类)的概率。
1.3 Loss 函数理解:loss函数如下图所示,下面几个概念需要理清楚
s2:最后网络的输出为7*7*30, 因此49个cell;
B: 每个cell(1*30)预测了两个bbox,因此B=2,只有和ground truth具有最大IOU的bbox才参与计算
7*7的正掩膜????????????obj:最开始进行网络划分时,ground truth的中心点落在了该cell中,则该cell出值为1;只有为1出的cell才参与计算
7*7的反掩膜????????????noobj:正掩膜取反。
(1) 坐标预测损失(coordinate loss): 上面损失函数的第一部分是对预测bbox的坐标损失,如下图所示,有两个注意点:一是对宽高取平方根,抑制大物体的loss值,平衡小物体和大物体预测的loss差异;二是采用了权重系数5,因为参与计算正样本太少(如上面7*7掩膜中只有三个cell的坐标参与计算),增加权重
(2)置信度损失(Confidence loss):第二部分是正负样本bbox的置信度损失,如下图所示;注意下ground truth的置信度: 对于正样本其置信度为预测框和ground truth之间的IOU*1, 对于负样本,置信度为IOU*0;另外由于负样本多余正样本,取负样本的权重系数为0.5
(3)分类损失(Classification Loss): 第三部分是预测所属分类的损失,如下图所示,预测值为网络中softmax计算出,真实值为标注类别的one-hot编码(可以理解为20分类任务,若为第五类,则编码为00001000000000000000)
yolo v1的主要特点
(1) 优点: one-stage,速度快
缺点:
(1) 不支持拥挤物体的检测(划分网格时一个cell只预测一个物体)
(2) 对小物体的检测效果差, 且对新的宽高比物体检测效果不好
(3)网络中没有使用batch normalization
下面是pytorch的实现的Yolo V1 network 和 loss计算方式:(未经实验,仅供理解用)
import torch import torch.nn as nn from torch.nn import functional from torch.autograd import Variable import torchvision.models as models class YoloLoss(nn.Module): def __init__(self, n_batch, B, C, lambda_coord, lambda_noobj, use_gpu=False): """ :param n_batch: number of batches :param B: number of bounding boxes :param C: number of bounding classes :param lambda_coord: factor for loss which contain objects :param lambda_noobj: factor for loss which do not contain objects """ super(YoloLoss, self).__init__() self.n_batch = n_batch self.B = B # assume there are two bounding boxes self.C = C self.lambda_coord = lambda_coord self.lambda_noobj = lambda_noobj self.use_gpu = use_gpu def compute_iou(self, bbox1, bbox2): """ Compute the intersection over union of two set of boxes, each box is [x1,y1,w,h] :param bbox1: (tensor) bounding boxes, size [N,4] :param bbox2: (tensor) bounding boxes, size [M,4] :return: """ # compute [x1,y1,x2,y2] w.r.t. top left and bottom right coordinates separately b1x1y1 = bbox1[:,:2]-bbox1[:,2:]**2 # [N, (x1,y1)=2] b1x2y2 = bbox1[:,:2]+bbox1[:,2:]**2 # [N, (x2,y2)=2] b2x1y1 = bbox2[:,:2]-bbox2[:,2:]**2 # [M, (x1,y1)=2] b2x2y2 = bbox2[:,:2]+bbox2[:,2:]**2 # [M, (x1,y1)=2] box1 = torch.cat((b1x1y1.view(-1,2), b1x2y2.view(-1, 2)), dim=1) # [N,4], 4=[x1,y1,x2,y2] box2 = torch.cat((b2x1y1.view(-1,2), b2x2y2.view(-1, 2)), dim=1) # [M,4], 4=[x1,y1,x2,y2] N = box1.size(0) M = box2.size(0) tl = torch.max( box1[:,:2].unsqueeze(1).expand(N,M,2), # [N,2] -> [N,1,2] -> [N,M,2] box2[:,:2].unsqueeze(0).expand(N,M,2), # [M,2] -> [1,M,2] -> [N,M,2] ) br = torch.min( box1[:,2:].unsqueeze(1).expand(N,M,2), # [N,2] -> [N,1,2] -> [N,M,2] box2[:,2:].unsqueeze(0).expand(N,M,2), # [M,2] -> [1,M,2] -> [N,M,2] ) wh = br - tl # [N,M,2] wh[(wh<0).detach()] = 0 #wh[wh<0] = 0 inter = wh[:, :, 0] * wh[:, :, 1] # [N,M] area1 = (box1[:,2]-box1[:,0]) * (box1[:,3]-box1[:,1]) # [N,] area2 = (box2[:,2]-box2[:,0]) * (box2[:,3]-box2[:,1]) # [M,] area1 = area1.unsqueeze(1).expand_as(inter) # [N,] -> [N,1] -> [N,M] area2 = area2.unsqueeze(0).expand_as(inter) # [M,] -> [1,M] -> [N,M] iou = inter / (area1 + area2 - inter) return iou def forward(self, pred_tensor, target_tensor): """ :param pred_tensor: [batch,SxSx(Bx5+20))] :param target_tensor: [batch,S,S,Bx5+20] :return: total loss """ n_elements = self.B * 5 + self.C batch = target_tensor.size(0) target_tensor = target_tensor.view(batch,-1,n_elements) #print(target_tensor.size()) #print(pred_tensor.size()) pred_tensor = pred_tensor.view(batch,-1,n_elements) coord_mask = target_tensor[:,:,5] > 0 noobj_mask = target_tensor[:,:,5] == 0 coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor) noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor) coord_target = target_tensor[coord_mask].view(-1,n_elements) coord_pred = pred_tensor[coord_mask].view(-1,n_elements) class_pred = coord_pred[:,self.B*5:] class_target = coord_target[:,self.B*5:] box_pred = coord_pred[:,:self.B*5].contiguous().view(-1,5) box_target = coord_target[:,:self.B*5].contiguous().view(-1,5) noobj_target = target_tensor[noobj_mask].view(-1,n_elements) noobj_pred = pred_tensor[noobj_mask].view(-1,n_elements) # compute loss which do not contain objects if self.use_gpu: noobj_target_mask = torch.cuda.ByteTensor(noobj_target.size()) else: noobj_target_mask = torch.ByteTensor(noobj_target.size()) noobj_target_mask.zero_() for i in range(self.B): noobj_target_mask[:,i*5+4] = 1 noobj_target_c = noobj_target[noobj_target_mask] # only compute loss of c size [2*B*noobj_target.size(0)] noobj_pred_c = noobj_pred[noobj_target_mask] noobj_loss = functional.mse_loss(noobj_pred_c, noobj_target_c, size_average=False) # compute loss which contain objects if self.use_gpu: coord_response_mask = torch.cuda.ByteTensor(box_target.size()) coord_not_response_mask = torch.cuda.ByteTensor(box_target.size()) else: coord_response_mask = torch.ByteTensor(box_target.size()) coord_not_response_mask = torch.ByteTensor(box_target.size()) coord_response_mask.zero_() coord_not_response_mask = ~coord_not_response_mask.zero_() for i in range(0,box_target.size()[0],self.B): box1 = box_pred[i:i+self.B] box2 = box_target[i:i+self.B] iou = self.compute_iou(box1[:, :4], box2[:, :4]) max_iou, max_index = iou.max(0) if self.use_gpu: max_index = max_index.data.cuda() else: max_index = max_index.data coord_response_mask[i+max_index]=1 coord_not_response_mask[i+max_index]=0 # 1. response loss box_pred_response = box_pred[coord_response_mask].view(-1, 5) box_target_response = box_target[coord_response_mask].view(-1, 5) contain_loss = functional.mse_loss(box_pred_response[:, 4], box_target_response[:, 4], size_average=False) loc_loss = functional.mse_loss(box_pred_response[:, :2], box_target_response[:, :2], size_average=False) +\ functional.mse_loss(box_pred_response[:, 2:4], box_target_response[:, 2:4], size_average=False) # 2. not response loss box_pred_not_response = box_pred[coord_not_response_mask].view(-1, 5) box_target_not_response = box_target[coord_not_response_mask].view(-1, 5) # compute class prediction loss class_loss = functional.mse_loss(class_pred, class_target, size_average=False) # compute total loss total_loss = self.lambda_coord * loc_loss + contain_loss + self.lambda_noobj * noobj_loss + class_loss return total_loss def test(): voc = False vot = 1-voc if voc: img_folder = '../codedata/voc2012train/JPEGImages' file = '../voc2012.txt' img_size = 448 train_dataset = YoloDataset(img_folder=img_folder, file=file, img_size=img_size, S=7, B=2, C=20, transforms=[transforms.ToTensor()]) train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, num_workers=0) train_iter = iter(train_loader) img, target = next(train_iter) print(target.size()) target = Variable(target) img = Variable(img) net = YOLO_V1() pred = net(img) yololoss = YoloLoss(n_batch=2, B=2, C=20, lambda_coord=5, lambda_noobj=0.5) print(pred.size()) print(target.size()) loss = yololoss(pred, target) print(loss) if vot: img_folder = './small_train_dataset' bboxes = dd.io.load('girl_bbox_4dim.h5') learning_rate = 0.0005 img_size = 224 num_epochs = 2 lambda_coord = 5 lambda_noobj = .5 n_batch = 5 S = 7 B = 2 C = 1 train_dataset = VotDataset(img_folder=img_folder, bboxes=bboxes, img_size=img_size, S=S, B=B, C=C, transforms=[transforms.ToTensor()]) train_loader = DataLoader(train_dataset, batch_size=n_batch, shuffle=False, num_workers=2) yololoss = YoloLoss(n_batch=n_batch, B=B, C=C, lambda_coord=5, lambda_noobj=0.5) train_iter = iter(train_loader) img, target = next(train_iter) target = Variable(target) img = Variable(img) model = models.vgg16(pretrained=True) model.classifier = nn.Sequential( nn.Linear(512 * 7 * 7, 4096), nn.LeakyReLU(0.1, inplace=True), nn.Dropout(), nn.Linear(4096, 11 * 7 * 7), nn.Sigmoid(), ) model.train() loss_fn = YoloLoss(n_batch, B, C, lambda_coord, lambda_noobj) optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-4) use_gpu = False for epoch in range(num_epochs): for i, (images, target) in enumerate(train_loader): images = Variable(images) target = Variable(target) if use_gpu: images, target = images.cuda(), target.cuda() pred = model(images) print(pred.size()) print(target.size()) loss = loss_fn(pred, target) print(i + 1, loss) optimizer.zero_grad() loss.backward() optimizer.step() if i == 10: break break if __name__=='__main__': from own_yolo_v1.network import * from own_yolo_v1.load_dataset import * test()