这篇论文和之前看过的论文不一样。本文是基于传统的机器学习、统计学和频域分析等方法,而之前看的论文都是基于深度学习且都是卷积神经网络。主要是由于这篇论文是2013年的。
1 Abstract
- Our approach relies on multiple sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region.
- We employ a global consistency constraint on counts using Markov Random Field.
我们与之前的方法不同,我们会在一个新的数据集上训练我们的方法。这个新的数据集包含50张图片6万4千个人,一张图片的人数是94到4543这个范围,而之前那些方法用的图片只有十几个人。
3. Framework
3.1. Counting in Patches
我们通过三种不同且互补的源(sources)来计算人数。The three sources are later combined to obtain a single estimate of count for that patch using the individual counts and confidences.
3.1.1 HOG based Head Detections
For each patch, we use number of detections, ηH, mean and variance of scale µH,s, σH,s and confidence µH,c, σH,c. The consistency in scale and confidence is a measure of how reliable head detections are in that patch.n. There are many false negatives and positives since the images are inherently difficult (see Fig. 2).
3.1.2 Fourier Analysis
人群密度很大的图,一个头可能只占几个像素,再加上一些扭曲,从远处看,没法分辨谁是谁,就有一点像一个人重复的出现在图片中(A crowd is inherently repetitive in nature, since all humans appear the same from a distance),那么从频域图看,峰值就对应人头出现的位置,并且峰值成周期出现,如图3所示(Crowd density in the patch is uniform, can be captured by Fourier Transform, f(ξ), where the periodic occurrence of heads shows as peaks in the frequency domain)。
通过给的碎片,我们计算得到梯度图片(gradient image, ∇(P)),然后通过一个低通滤波器,去除掉非常高频的部分。
3.1.3 Interest Points based Counting
天空、建筑物和树等无关的信息常常出现在户外的图片中,而傅里叶分析是crowd-blind,这些信息会影响检测头部的位置。所有有必要放弃这些信息,选择我们感兴趣的区域去计算。为了得到稀疏SIFT特征,我们使用支持向量回归来计算数量(In order to obtain counts or densities using sparse SIFT features, we use Support Vector Regression using the counts computed at each patch from ground truth)。
泊松分布
N(I) = N(P1 ∪ P2 . . . Pn) = N(P1) + N(P2) + . . . + N(Pn), (1)
(2)
The above equation gives us a confidence for presence of crowd in a patch. The resulting confidence maps are shown in Fig. 4 for two images.
3.2. Fusion of Three Sources
Computing counts and confidences from the three sources, we scale individual features and regress using ϵSVR, with the counts computed from the annotations.
3.3. Counting in Images
The graph can be represented with and N are the four neighbors at the same level and intermediate nodes that connect a patch to layers above and below it.
energy function
where labeling ℓ assigns a label ℓp ∈ L = {0, 1, 2, ..., Cmax} for every every patch p ∈ P.
The inference starts by sweeping in four directions at the bottom level using Eq. 4
The beliefs are then evaluated for each patch using Eq. 5.
Fig. 6 shows three instances where the estimated count of patch was improved based on neighbors (both spatial and layer).
——20190411