MDNet: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

1. Motivations

(1) tracking by detection/classification
(2) learning the generic properties (lower level) shared by different tracked objects
(3) large variations/inconsistencies (mask, deformation, etc.) for single tracked objects across different sequences
(4) no need for large ConvNets like those for classification tasks (just two-categories classification problem) -> use relevantly small network compared to normal VGG-m.

2. (Pre-)Training Scheme

For K sequences, with N iterations, training in round-robin way to learn generic property.
MDNet: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

For each minibatch (each sequence), randomly sample 8 frames. for each frame, sample 4 pos (objects) and 8 neg (backgrounds), respectively. (each sequence/minibatch has 32 positive samples and 96 negative samples for training)

3. Online Tracking

MDNet: Learning Multi-Domain Convolutional Neural Networks for Visual Tracking

(1) Updating Scheme: Combining long-term and short-term updating; using bounding-box regression (just as F-RCNN) and Hard negative mining (just as Cascaded face detector).

(2) long and short term updating

long-term: updating parameters using collected positive samples with pre-determined time intervals.

short-term: updating parameters using mined hard negatives in short term (means a lot !)

(3) determine the bounding box candidates for next frame

its positive score should be larger than give threshold (0.5)
finding the optimal target state x* with largest positive score.