[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks

論文出處 : Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks

Introduction

[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
以往 face detection task 都是建立在欲辨識的人臉是facing up的正面條件下，而實際的狀況中，人臉未必是朝上的狀況，而目前的 detection method 中鮮少是針對 rotation-invariant 的狀況。
文中引進一種方法架構：Progressive Calibration Networks。

首先了解文中定義的名詞 Rotation-In-Plane(RIP) angles ：
[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
RIP angle 即 Y 軸中到額頭上的旋轉角度，向左為負，向右為正，如上圖，該圖的 RIP angle 為 ${-120^\circ}$

文中提及以下三種傳統face rotation問題的解決方法以及優缺點比較：
[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
Data Augmentation ：
將原本訓練資料做旋轉，使model學習到旋轉過後的人臉，雖然方法簡單，但隨著資料分佈變多樣，也需要對應到更大的neural network架構以及運算時間。

[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
Divide-and-Conquer ：
分別訓練出對應不同 RIP angle 變化範圍的 model，如 [-45,45]，[-135,-45]，[-180,-135]，[45,135]， [135,180]， 總共五種範圍，則需要五種 model 來對應，分別預測出各 RIP angle 條件下是人臉的位置分佈與機率，但在因為精準度與範圍種類是取捨，需要耗費較多的運算時間。

[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
Rotation Router ：
直接估計出每一張可能是人臉目標的 RIP angle 在將其轉正，再做預測，但 face RIP angle estimation 是一大難題，進而使得 face detection 的表現不佳。

Framework

[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
整體架構為 two-stage model，region-proposal 之後選出可能為一目標的 candidate 作為 PCN 的 input，主要分成三個 stage，在每個階段裡面逐步縮小 RIP angle，且逐步淘汰最不可能是臉的 candidate，每一個 stage 的輸出有三個：是臉的可能性，臉的位置與邊框大小，臉對應到的 RIP angle estimation。

在選擇適當輸入的機制中，輸入資料比照 groundtruth 的 IoU 分為三個種類：
Positive Samples(IoU > 0.7), Suspected Samples(IoU < 0.7 and IoU > 0.4), Negative Samples(IoU < 0.4)
其中，Positive Samples 與 Negative Samples 可做為訓練是否為人輸入，而 Positive Samples 與 Suspected Samples 為尋找臉位置與邊框大小以及 RIP angle estimation 的訓練輸入。

每一個 stage 中的 network 輸入為影像在不同 scaling 的結果，公式如下：
${PCN_{i}(I) = [f, t , g]}$

$i$ 為 state number , ${i \in}$ { $~1,2,3~$ }
$f$ 為 face confidence score，代表輸入影像為臉的可能程度。
$t$ 為 prediction of bonding box regression，為一向量，內容為預測 bonding box 的位置以及邊框大小。
$g$ 為 orientation score，即 PIR angle estimation 越小則 $g$ 越大。

Objective 可分為三類： $L_{cls}~,~ L_{reg}~,~ L_{cal}$

$L_{cls} = ylog(f) + (1-y)log(1-f)$

$y$ 為 $label$ ，是臉為 $1$ ，不是臉為 $0$ 。

${L_{ref} = S(t,t^*)}$

$t,t^*$ 分別為 $bonding~box~regression~prediction$ 以及 $ground~truth$ 的結果，其向量內部包含 $(a , b)$ 即 $top-left~~coordinate$ ，而 $w$ 代表偵測目標的邊框 $width$ 。
其 $S(.)~function$ 為參考Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks，在此則不贅述。

$L_{cal} = ylog(g) + (1-y)log(1-g)$

$g$ 代表的是 PIR angle estimation 的反指標，而該 Object 則是假定在額頭向上為正的狀況下盡可能的使 $g$ 越大。

training process 為以 $L_{cls}$ 為 primary，並給予 $L_{reg}~,~ L_{cal}$ 權重，使得 $L_{cls}+ \lambda _{reg}L_{reg}+ \lambda _{reg}L_{cal}$ 最小化。

Progressive Calibration Networks

[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
每個 stage network 中會估計其影像的 RIP angle 並修正之，但是不同於直接估計的地方在於直接估計的結果往往不如預期，但若是以 coarse-to-fine 的分類方式逐漸修正角度在實驗結果上則會有不錯的表現。

在 stage 1 中，僅先判斷圖像額頭較為在上還是在下，故 Calibration Class 為兩類，若 $g > 0.5$ 時則為額頭在上，不做照片的翻轉，紀錄 $\theta_{1}$ 為 $0^\circ$ ，反之則判定為額頭在下，則將照片做 $180^\circ$ 的翻轉，紀錄 $\theta_{1}$ 為 $180^\circ$ ，經過 stage 1 後，則有效地將 RIP angle 的範圍從 [-180,180] 限縮到 [-90,180]。

在 stage 2 中，更進一步將 RIP angle 的可能區間，即 Calibration Class ，區分為三類 ：[-90,-45], [-45,45], [45,90]
其判斷依據為：找到 $argmin(g_{i})$ 對應的參數 $i$ ，將對應到的可能角度再分為 $-90^\circ,0^\circ,90^\circ$ 三個種類，即 $\theta_{2}$ 可能的三種結果。

在 stage 3 中，RIP angle 範圍僅限縮在 [-45,45] 內，此時則明確的估計出旋轉角度，以及 bonding box regression。
[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks
如上述過程，可由 $\theta_{1},\theta_{2},\theta_{3}$ 求得 $\theta_{RIP}$ ： $\theta_{RIP} = \theta_{1}+\theta_{2}+\theta_{3}$

Evaluation Results

PCN 好處總結可歸為兩個，其一為對於多樣性的魯棒，其二為計算時間較少，將原本 $360$ 個類別的問題，簡化為少數的類別，可減少不必要的運算，在 stage 1 及 stage 2 中 accuracy 分別為 $95$ % 及 $96$ % 而 stage 3 中 mean error 為 $8^\circ$ ，相較於參考文獻 Rotation Invariant Neural Network-Based Face Detection 的 $90$ % accuracy 有顯著的改善，且其運算速度上較 Faster R-CNN (VGG16), SSD500(VGG16), R-FCN (ResNet-50) 來的快且準確率較高。
[論文筆記] Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks