LR with spark liblinear

Logistic Regression

sparkliblinear 库的类关系图

LR with spark liblinear
1、LR

Given a set of training label-instance pairs ${(x_ i ,y_ i )}^ l_{i=1} ,
x i \in \mathbb{R}^ n , y i \in{−1,1}, \forall{i} $

LR with L2 reg model considers the following optimization problem:

min_{w} f (w) = \frac{1}{2} w^{T} w + C \sum_{i = 1}^{l} l o g (1 + e x p (- y_{i} w^{T} x_{i})) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

min w f (w) = 12 w T w + C \sum i = 1 l l o g (1 + e x p (- y i w T x i)) （ 1 ）

2、A Trust Region Newton Method(TRON)

TRON obtains the truncated Newton step by approximately solving
mindqt(d)subject to‖d‖≤Δt（2）

min_{d} q_{t} (d) subject to ‖ d ‖ \leq Δ_{t} （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

min d q t (d) subject to ∥ d ∥ \leq Δ t （ 2 ）

Δt $Δ_{t}$ ΔtΔtΔtΔtΔtΔtΔtΔtΔt is the size of the trust region, qt(d)=∇f(wt)Td+12dT∇2f(xt)d $q_{t} (d) = \nabla f (w^{t})^{T} d + \frac{1}{2} d^{T} \nabla^{2} f (x^{t}) d$ qt(d)=∇f(wt)Td+12dT∇2f(xt)dqt(d)=∇f(wt)Td+12dT∇2f(xt)dqt(d)=∇f(wt)Td+12dT∇2f(xt)dqt(d)=∇f(wt)Td+12dT∇2f(xt)dqt(d)=∇f(wt)Td+12dT∇2f(xt)dqt(d)=∇f(wt)Td+12dT∇2f(xt)dqt(d)=∇f(wt)Td+12dT∇2f(xt)dqt(d)=∇f(wt)Td+12dT∇2f(xt)dqt(d)=∇f(wt)Td+12dT∇2f(xt)d
is the second-order Taylor approximation of f(wt+d)−f(wt) $f (w_{t} + d) - f (w_{t})$ f(wt+d)−f(wt)f(wt+d)−f(wt)f(wt+d)−f(wt)f(wt+d)−f(wt)f(wt+d)−f(wt)f(wt+d)−f(wt)f(wt+d)−f(wt)f(wt+d)−f(wt)f(wt+d)−f(wt).

applying CG(Conjugate Gradient) to slove (2)

2.1 Distributed Algorithm

We partition the data matrix X and the labels Y into
disjoint p parts.

X = [X_{1}, . . ., X_{p}]^{T}, Y = d i a g (y_{1}, . . ., y_{l}) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v_{1}), . . ., 1 + e x p (- v_{n})]^{T}

X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T

X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T

X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T

X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T

X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T X = [X 1, . . ., X p] T, Y = d i a g (y 1, . . ., y l) = [Y 1... Y p], σ (v) \equiv [1 + e x p (- v 1), . . ., 1 + e x p (- v n)] T

We can observe that for computing (12)-(14), only the data partition Xk $X_{k}$ XkXkXkXkXkXkXkXkXk is needed in computing. Therefore, the computation can be done in parallel, with the partitions being stored distributedly. After the map functions are computed, we need to reduce the results to the machine performing the TRON algorithm in order to obtain the summation over all partitions.

3、Implement Design

1) Loop Structure: choose the while loop to implement the software

2) Data Encapsulation:

AA uses two arrays to store indices and feature values of an instance:

ndex1 index2 index3 index4 index5 …

value1 value2 value3 value4 value5 …

3) Using mapPartitions Rather Than map

4) not to cache σ(YkXkw) $σ (Y_{k} X_{k} w)$ σ(YkXkw)σ(YkXkw)σ(YkXkw)σ(YkXkw)σ(YkXkw)σ(YkXkw)σ(YkXkw)σ(YkXkw)σ(YkXkw)

5) Using Broadcast Variables