MDNet: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking
1. Motivations
(1) tracking by detection/classification
(2) learning the generic properties (lower level) shared by different tracked objects
(3) large variations/inconsistencies (mask, deformation, etc.) for single tracked objects across different sequences
(4) no need for large ConvNets like those for classification tasks (just two-categories classification problem) -> use relevantly small network compared to normal VGG-m.
2. (Pre-)Training Scheme
For K sequences, with N iterations, training in round-robin way to learn generic property.
For each minibatch (each sequence), randomly sample 8 frames. for each frame, sample 4 pos (objects) and 8 neg (backgrounds), respectively. (each sequence/minibatch has 32 positive samples and 96 negative samples for training)
3. Online Tracking
(1) Updating Scheme: Combining long-term and short-term updating; using bounding-box regression (just as F-RCNN) and Hard negative mining (just as Cascaded face detector).
(2) long and short term updating
long-term: updating parameters using collected positive samples with pre-determined time intervals.
short-term: updating parameters using mined hard negatives in short term (means a lot !)
(3) determine the bounding box candidates for next frame
- its positive score should be larger than give threshold (0.5)
- finding the optimal target state x* with largest positive score.