zhangjie

Derivatives for L-Softmax

2017-05-03T00:00:00+08:00

This post records the derivatives for Large-Margin Softmax. The code can be found here.

Basic

$$ w^Tx = |w||x|cos\theta $$

For $x$

$$ \frac{\partial |x|}{\partial x} = \frac{x}{|x|} $$

$$ \frac{\partial cos\theta}{\partial x} = \frac{\partial}{\partial x} \left( \frac{w^Tx}{|w||x|} \right) = \frac{w}{|w||x|} - \frac{(w^Tx)x}{|w||x|^3} $$

$$ \frac{\partial sin^2\theta}{\partial x} = \frac{\partial}{\partial x} \left( 1-cos^2\theta \right) = -2cos\theta \frac{\partial cos\theta}{\partial x} $$

$$ cos(m\theta) = \sum_{n=0}^{\lfloor \frac {m} {2} \rfloor} (-1)^n {m \choose {2n}} (cos\theta)^{m-2n} (sin^2\theta)^n $$

$$ \frac{\partial cos(m\theta)}{\partial x} = m(cos\theta)^{m-1} \frac{\partial cos\theta}{\partial x} + \sum_{n=1}^{\lfloor \frac{m}{2} \rfloor} (-1)^n{m \choose {2n}} \left[ n(cos\theta)^{m-2n}(sin^2\theta)^{n-1}\frac{\partial sin^2\theta}{\partial x} + (m-2n)(cos\theta)^{m-2n-1}(sin^2\theta)^n\frac{\partial cos\theta}{\partial x} \right] $$

$$ f = (-1)^k|w||x|cos(m\theta) - 2k|w||x| = \left[ (-1)^kcos(m\theta) - 2k \right]|w||x| $$

$$ \frac{\partial f}{\partial x} = \left[ (-1)^kcos(m\theta)-2k \right] \frac{|w|}{|x|}x + (-1)^k|w||x| \frac{\partial cos(m\theta)}{\partial x} $$

$$ \begin{align} \frac{\partial J}{\partial x_i} &= \sum_{j,j \neq y_i} \frac{\partial J}{\partial f_{i,j}} \cdot \frac{\partial f_{i,j}}{\partial x_i} + \frac{\partial J}{\partial f_{i,y_i}}\cdot\frac{\partial f_{i,y_i}}{\partial x_i} \ &= \sum_{j}\frac{\partial J}{\partial f_{i,j}}\cdot w_j + \frac{\partial J}{\partial f_{i,y_i}}\left( \frac{\partial f_{i,y_i}}{\partial x_i} - w_{y_i} \right) \end{align} $$

For $w$

$$ \frac{\partial |w|}{\partial w} = \frac{w}{|w|} $$

$$ \frac{\partial cos\theta}{\partial w} = \frac{\partial}{\partial x} \left( \frac{w^Tx}{|w||x|} \right) = \frac{x}{|x||w|} - \frac{(w^Tx)w}{|x||w|^3} $$

$$ \frac{\partial sin^2\theta}{w} = \frac{\partial}{\partial w} \left( 1-cos^2\theta \right) = -2cos\theta \frac{\partial cos\theta}{\partial w} $$

$$ cos(m\theta) = \sum_{n=0}^{\lfloor \frac {m} {2} \rfloor} (-1)^n {m \choose {2n}} (cos\theta)^{m-2n} (sin^2\theta)^n $$

$$ \frac{\partial cos(m\theta)}{\partial w} = m(cos\theta)^{m-1} \frac{\partial cos\theta}{\partial w} + \sum_{n=1}^{\lfloor \frac{m}{2} \rfloor} (-1)^n{m \choose {2n}} \left[ n(cos\theta)^{m-2n}(sin^2\theta)^{n-1}\frac{\partial sin^2\theta}{\partial w} + (m-2n)(cos\theta)^{m-2n-1}(sin^2\theta)^n\frac{\partial cos\theta}{\partial w} \right] $$

$$ f = (-1)^k|w||x|cos(m\theta) - 2k|w||x| = \left[ (-1)^kcos(m\theta) - 2k \right]|w||x| $$

$$ \frac{\partial f}{\partial w} = \left[ (-1)^kcos(m\theta)-2k \right] \frac{|x|}{|w|}w + (-1)^k|w||x| \frac{\partial cos(m\theta)}{\partial w} $$

$$ \begin{align} \frac{\partial J}{\partial w_j} &= \sum_{i,y_i \neq j} \frac{\partial J}{\partial f_{i,j}} \cdot \frac{\partial f_{i,j}}{\partial w_j} + \sum_{i,y_i=j} \frac{\partial J}{\partial f_{i,j}} \cdot \frac{\partial f_{i,j}}{\partial w_j} \ &= \sum_i \frac{\partial J}{\partial f_{i,j}}\cdot x_i + \sum_{i,y_i=j}\frac{\partial J}{\partial f_{i,j}}\cdot \left( \frac{\partial f_{i,j}}{\partial w_j} - x_i \right) \end{align} $$

HandWrite

Reference

Reinforcement Learning Notes

2017-03-09T00:00:00+08:00

Learning note on Markov Decision Process.

Markov Property

$$ \mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_1, ..., S_t] $$

Markov Process or Markov Chain

$\langle S, P \rangle$

$S$ is a finite set of states.

$P$ is a state transition probability matrix. $P_{ss'} = \mathbb{P}[S_{t+1}=s'|S_t=s]$

Markov Reward Process

$\langle S, P, R, \gamma \rangle$

$S$ is a finite set of states.

$P$ is a state transition probability matrix. $P_{ss'} = \mathbb{P}[S_{t+1}=s'|S_t=s]$

$R$ is a reward function, $R_s = \mathbb{E}[R_{t+1}|S_t=s]$

$\gamma$ is a discount factor, $\gamma \in [0, 1]$

Value Function

value function $v(s)$ gives the long-term value of state $s$

$$ G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^\infty\gamma^k R_{t+k+1} $$

$$ \begin{eqnarray} v(s) &=& \mathbb{E}[G_t|S_t=s] \ &=& \mathbb{E}[R_{t+1} + \gamma(R_{t+2}+\gamma R_{t+3}+...)|S_t=s] \ &=& \mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_t=s] \ &=& \mathbb{E}[R_{t+1} + \gamma v(S_{t+1})|S_t=s] \end{eqnarray} $$

Bellman Equation for MRP

$$ \begin{eqnarray} v(s) &=& \mathbb{E}[R_{t+1} + \gamma v(S_{t+1})|S_t=s] \ &=& R_s + \gamma \sum_{s' \in S}P_{ss'}v(s') \end{eqnarray} $$

$$ v = R + \gamma P v $$

Markov Decision Process

$\langle S, A, P, R, \gamma \rangle$

$S$ is a finite set of states

$A$ is a finite set of actions

$P$ is state transition probability matrix, $P_{ss'}^a = \mathbb{P}[S_{t+1}=s'|S_t=s,A_t=a]$

$R$ is a reward function, $R_s^a = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]$

$\gamma$ is a discount factor, $\gamma \in [0, 1]$

Policy

A policy $\pi$ is a distribution over actions given states

$$ \pi(a|s) = \mathbb{P}[A_t=a|S_t=s] $$

$$ A_t \sim \pi(\cdot|s), \forall t \gt 0 $$

Given $M = \langle S, A, P, R, \gamma \rangle$ and policy $\pi$

$S_1, S_2, ...$ is a Markov process $\langle S, P^\pi \rangle$

$S_1, R_2, S_2, R_3, ...$ is a Markov reward process $\langle S, P^\pi, R^\pi, \gamma \rangle$

$$ P_{ss'}^\pi = \sum_{a \in A}\pi(a|s)P_{ss'}^a $$

$$ R_s^\pi = \sum_{a \in A}\pi(a|s)R_s^a $$

state-value function $v_\pi(s)$

$$ v_\pi(s) = \mathbb{E}_\pi[G_t|S_t=s] $$

action-value function $q_\pi(s, a)$

$$ q_\pi(s, a) = \mathbb{E}_\pi[G_t|S_t=s, A_t=a] $$

Bellman Equation for value function

$$ v_\pi(s) = \mathbb{E_\pi}[R_{t+1} + \gamma v_\pi(S_{t+1})|S_t=s] $$

$$ q_\pi(s, a) = \mathbb{E_\pi}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1})|S_t=s,A_t=a] $$

$$ v_\pi(s) = \sum_{a \in A}\pi(a|s)q_\pi(s, a) $$

$$ q_\pi(s, a) = R_s^a + \gamma \sum_{s' \in S}P_{ss'}^a v_\pi(s') $$

$$ v_\pi(s) = \sum_{a \in A}\pi(a|s)(R_s^a + \gamma \sum_{s' \in S}P_{ss'}^a v_\pi(s')) $$

$$ v_\pi = R^\pi + \gamma P^\pi v_\pi $$

$$ q_\pi(s, a) = R_s^a + \gamma \sum_{s' \in S}P_{ss'}^a \sum_{a' \in A}\pi(a'|s')q_\pi(s', a') $$

Optimal Value Function

$$ v_\ast(s) = \max_{\pi}v_\pi(s) $$

$$ q_\ast(s, a) = \max_{\pi}q_\pi(s, a) $$

policy ordering

$$ \pi \gt \pi' \quad if v_\pi(s) \ge v_{\pi'}(s), \forall s $$

There exists an optimal policy $\pi_\ast$ that $\pi_\ast \ge \pi, \forall \pi$ All optimal policies achieve the optimal value function, $v_{\pi_\ast}(s) = v_\ast(s)$ All optimal policies achieve the optimal action value function, $q_{\pi_\ast}(s, a) = q_\ast(s, a)$

$$ \begin{eqnarray} \pi_\ast(a|s) = \begin{cases} 1 \quad if a = \arg\max_{a \in A}q_\ast(s, a) \ 0 \quad otherwise \end{cases} \end{eqnarray} $$

$$ v_\ast(s) = \max_{a}q_\ast(s, a) $$

$$ q_\ast(s, a) = R_s^a + \gamma \sum_{s' \in S}P_{ss'}^av_\ast(s') $$

$$ v_\ast(s) = \max_{a}R_s^a + \gamma \sum_{s' \in S}P_{ss'}^av_\ast(s') $$

$$ q_\ast(s, a) = R_s^a + \sum_{s' \in S}P_{ss'}^a\max_{a'}q_\ast(s', a') $$

Summary

$$ v_\pi(s) = \sum_{a \in A}\pi(a|s)q_\pi(s, a) $$

$$ q_\pi(s, a) = R_s^a + \gamma \sum_{s' \in S}P_{ss'}^a v_\pi(s') $$

$$ v_\ast(s) = \max_{a \in A}q_\ast(s, a) $$

$$ q_\ast(s, a) = R_s^a + \gamma \sum_{s' \in S}P_{ss'}^av_\ast(s') $$

Learning note on Planning by Dynamic Programming.

Bellman Expectation Equation

$v_\pi(s)$, $q_\pi(s, a)$

$$ v_\pi(s) = \sum_{a \in A}\pi(a|s)q_\pi(s, a) $$

$$ q_\pi(s, a) = R_s^a + \sum_{s' \in S}P_{ss'}^a v_\pi(s') $$

Bellman Optimality Equation

$v_\ast(s) $, $q_\ast(s, a)$

$$ v_\ast(s) = \max_{a \in A}q_\ast(s, a) $$

$$ q_\ast(s, a) = R_s^a + \gamma \sum_{s' \in S}P_{ss'}^a v_\ast(s') $$

Iterative Policy Evaluation

Problem: evaluate a given policy $\pi$

Solution: iterative application of Bellman Expectation Equation

$v_1 \to v_2 \to ... \to v_\pi$

$$ v_{k+1}(s) = \sum_{a \in A}\pi(s, a)(R_s^a + \gamma \sum_{s' in S}P_{ss'}^a v_k(s')) $$

Policy Iteration

Given an initial policy $\pi$
Evaluate the policy $\pi$

$$ v_\pi(s) = \mathbb{E_\pi}[R_{t+1}+\gamma R_{t+1} + ...|S_t=s] $$

Improve the policy by acting greedily with respect $v_\pi$

$$ \pi' = greedy(v_\pi) $$

Repeat step 2 and 3 until $\pi$ converges to $\pi^\ast$

for deterministic policy $a = \pi(s)$

$$ \pi'(s) = \arg\max_{a \in A}q_\pi(s, a) $$

$$ q_\pi(s, \pi'(s)) = \arg\max_{a \in A}q_\pi(s, a) \ge q_\pi(s, \pi(s)) = v_\pi(s) $$

$$ \begin{eqnarray} v_\pi(s) &\le& q_\pi(s, \pi'(s)) = \mathbb{E_{\pi'}}[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s] \ &\le& \mathbb{E_{\pi'}}[R_{t+1}+\gamma q_\pi(S_{t+1}, \pi'(S_{t+1}))|S_t=s] \ &\le& \mathbb{E_{\pi'}}[R_{t+1}+\gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}. \pi'(S_{t+2}))|S_t=s] \ &\le& \mathbb{E_{\pi'}}[R_{t+1}+\gamma R_{t+2} +...|S_t=s] = v_{\pi'}(s) \end{eqnarray} $$

$$ v_\pi(s) \le v_{\pi'}(s) $$

if improvements stop or converges

$$ q_\pi(s, \pi'(s)) = \max_{a \in A}q_\pi(s, a) = q_\pi(s, \pi(s)) = v_\pi(s) $$

$$ v_\pi(s) = \max_{a \in A}q_\pi(s, a) $$

so $v_\pi(s)$ satisfies Bellman Optimality Equation

$$ v_\pi(s) = v_\ast(s) $$

Value Iteration

Problem: find optimal policy $\pi$ Solution: iterative application of Bellman Optimality Equaltion $v_1 \to v_2 \to ... \to v_\ast $ Using synchronous backups

at each iteration k+1
for all states $s \in S$
update $v_{k+1}(s) = v_k(s')$

No explicit policy Intermediate value function $v_k$ may not correspond to any policy

$$ v_{k+1}(s) = \max_{a \in A}R_s^a + \gamma \sum_{s' \in S}P_{ss'}^a v_k(s') $$

Summary

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Iterative Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimiality Equation	Value Iteration

Learning note on Model-Free Prediction.

Model-Free Prediction is about estimating the value function of an unknown MDP.

Monte-Carlo Reinforcement Learning

learn $v_\pi$ from complete episodes of experience under policy $\pi$.

$$ S_1, A_1, R_2, ..., S_k \sim \pi $$

return is the total discount reward

$$ G_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-1}R_T $$

$$ v_\pi(s) = \mathbb{E_\pi}[G_t|S_t=s] $$

Monte-Carlo policy evaluation uses empirical mean return instead of expected return.

$N(s) \gets N(s) + 1$
$S(s) \gets S(s) + G_t$
$V(s) = S(s) / N(s)$
$V(s) \to v_\pi(s)$ as $N(s) \to \infty$

update $S(s)$ with return $G_t$

$$ N(S_t) \gets N(S_t) + 1 $$

$$ V(S_t) \gets V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t)) $$

$$ V(S_t) \gets V(S_t) + \alpha(G_t - V(S_t)) $$

Temporal-Difference Learning

learn directly from episodes of experience. episodes can be incomplete using bootstrapping and updates a guess towards guess.

learn $v_\pi$ online from experience under policy $\pi$
$V(S_t) \gets V(S_t) + \alpha(G_t - V(S_t))$
replace actual return $G_t$ with estimated return $R_{t+1}+\gamma V(S_{t+1})$
$V(S_t) \gets V(S_t) + \alpha(R_{t+1} + \gamma V(S_{t+1}) - V(S_t))$
$R_{t+1}+\gamma V(S_{t+1})$ is called TD target
$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is called TD error

Driving Home Example: MC vs. TD

Differences between MC and TD

Return $G_t$ is unbiased estimate of $v_\pi(S_t)$ while True TD target $R_{t+1}+\gamma v_\pi(S_{t+1})$ is a biased estimate.

TD target is much lower variance than Return

Return depends on many random actions, transitions, rewards
TD target depends on one random action, transition, reward

MC has high variance and zero bias

Good convergence properties
Not every sensitive to initial value
Very simple to understand and use
MC doesn't exploit Markov property, usually more efficient in non-Markov environments

TD has low variance and some bias

Usually more efficient than MC
TD(0) converges to $v_\pi(s)$
More sensitive to initial value, because bootstrapping
TD exploits Markov property, usually more efficient in Markov environments

TD and MC both converge: $V(s) \to v_\pi(s)$ as $experience \to \infty$

Monte-Carlo Backup

$$ V(S_t) \gets V(S_t) + \alpha(G_t - V(S_t)) $$

Temporal-Difference Backup

$$ V(S_t) \gets V(S_t) + \alpha(R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) $$

Dynamic Programming Backup

$$ V(S_t) \gets \mathbb{E_\pi}[R_{t+1} + \gamma V(S_{t+1})] $$

Bootstrapping: update involves an estimate

MC doesn't boostrap
DP, TD bootstraps

Sampling: update samples an expectation

DP doesn't sample
MC, TD samples

$TD(\lambda)$

n-step return

$n = 1$, $G_t^{(1)} = R_{t+1} + \gamma V(S_{t+1})$, TD
$n = 2$, $G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})$
$n = \infty$, $G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-1} R_{T}$, MC

$$ G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n}) $$

$$ V(S_t) \gets V(S_t) + \alpha (G_t^{(n)} - V(S_t)) $$

$\lambda$ return combines all n-step return with weight $(1-\lambda)\lambda^{n-1}$

$$ G_t^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}G_t^{(n)} $$

$$ V(S_t) \gets V(S_t) + \alpha(G_t^\lambda - V(S_t)) $$

Forward view of $TD(\lambda)$

Update value function towards the $\lambda$-return
Forward-view looks into the future to compute $G_t^\lambda$
Like MC, can only be computed from complete episodes

Backward view of $TD(\lambda)$

Eligibility traces

$$ E_0(s) = 0 $$

$$ E_t(s) = \gamma \lambda E_{t-1}(s) + 1(S_t=s) $$

$$ \delta_t = R_{t+1} +\gamma V(S_{t+1}) - V(S_t) $$

$$ V(s) \gets V(s) + \alpha \delta_t E_t(s) $$

The sum of offline updates is identical for forward-view and backward view for $TD(\lambda)$

$$ \sum_{t=1}^{T}\alpha \delta_t E_t(s) = \sum_{t=1}^T \alpha (G_t^\lambda - V(S_t))1(S_t=s) $$

Learning note on Model Free Control.

Model-Free Control is about optimizing the value function of an unknown MDP.

On-policy Monte-Carlo Control

On-policy: Learn about policy $\pi$ from experience sampled from $\pi$. Off-policy: Learn about policy $\pi$ from experience sampled from $\mu$.

Greedy policy improvement over $Q(s, a)$ is model free

$$ \pi'(s) = \arg\max_{a \in A}Q(s, a) $$

$\epsilon$-Greedy Exploration

$$ \pi'(a|s) = \begin{cases} \frac{\epsilon}{m} + 1 - \epsilon, & a^\ast = \arg\max_{a \in A}Q(s, a) \\ \frac{\epsilon}{m}, & otherwise \end{cases} $$

Every Episode

Policy evaluation: Monte-Carlo policy evaluation, $Q \approx q_\pi$
Policy improvement: $\epsilon$-greedy improvement

On-policy TD Control

SARSA

$$ Q(S,A) \gets Q(S,A) + \alpha(R+\gamma Q(S',A') - Q(S,A)) $$

Every time step

Policy evaluation: Sarsa, $Q \approx q_\pi$
Policy improvement: $\epsilon$-greedy improvement

Sarsa$(\lambda)$

n-Step Sarsa

$n=1$, $q_t^{(1)} = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})$, Sarsa
$n=2$, $q_t^{(2)} = R_{t+1} + \gamma R_{t+1} + \gamma^s Q(S_{t+2}, A_{t+2})$
$n=\infty$, $q_t^{(\infty)} = R_{t+1} +\gamma R_{t+2} + ... + \gamma^{T-1}R_T$

$$ q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1}R_{t+n} + \gamma^n Q(S_{t+n}, A_{t+n}) $$

$$ Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha (q_t^{(n)} - Q(S_t, A_t)) $$

$\lambda$ return as TD$(\lambda)$

$$ q_t^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}q_t^{(n)} $$

Forward View

$$ Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha (q_t^{\lambda} - Q(S_t, A_t)) $$

Backward View use eligibility traces

$$ E_0(s, a) = 0 $$

$$ E_t(s, a) = \gamma \lambda E_{t-1}(s, a) + 1(S_t=s, A_t=a) $$

$Q(s, a)$ is updated for every state $s$ and action $a$

$$ \delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) $$

$$ Q(s, a) \gets Q(s, a) + \alpha \delta_t E_t(s, a) $$

Sarsa $(\lambda)$ makes reward information flow backward to the path it follows

Off-policy Learning

targe policy $\pi$, behave policy $\mu$, with importance sampling

$$ V(S_t) \gets V(S_t) + \alpha (\frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1} + \gamma V(S_{t+1})) - V(S_t)) $$

Q-Learning

Consider off-policy learning of action-value $Q(s,a)$
No importance sampling required
Next action is chosen using behavior policy $A_{t+1} \sim \mu(\cdot|S_t)$
But consider alternative successor action $A' \sim \pi(\cdot|S_t)$
Update $Q(S_t, A_t)$ towards value of alternative action

$$ Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A') - Q(S_t, A_t)) $$

policy $\pi$ is greedy w.r.t $Q(s,a)$, policy $\mu$ is $\epsilon$-greedy w.r.t $Q(s,a)$

$$ \pi(S_{t+1}) = \arg\max_{a'}Q(S_{t+1}, a') $$

$$ R_{t+1} + \gamma Q(S_{t+1}, A') = R_{t+1} + \max_{a'}\gamma Q(S_{t+1}, a') $$

$$ Q(S,A) \gets Q(S,A) + \alpha(R + \gamma \sum_{a'}Q(S', a') - Q(S,A)) $$

Summary

References

RL Course by David Silver

Way to implement custom Layer for Deep Learning framework

2017-01-07T00:00:00+08:00

It's a common situation that we may need to implement a custom operator or layer for the Deep Learning framework we are using. When I mean implement a Layer or Operator for the framework, it's because the framework doesn't offer us the Operation we want. Sometimes, awesome paper appears with strange functions that not supported by the framework. Sometimes, you want to change the behavior of traditional Layer implementation that can suits your demand. But mostly, you may want to create a new function to adapt it to the neural network that can help you get better result. As a result, you may need to create a Layer or Operator for the framework you use.

Difference between Operator and Layer

DL frameworks like Caffe and Torch use Layer for their basic network components while MXNet and TensorFlow use Operator. There is little difference between Operator and Layer if you only focus on the implementation of Forward and Backward operation. Layer usually holds the learnable parameters by themselves while Operator only focus on the operation and let other part of the framework to consider about the parameters. We can abstract these two conception easily using the following Python code.

class Layer(object):
    '''an example for Layer
    '''

    def __init__(self, initializer):
        '''initialize learnable parameters

        Parameters
        ----------
        initializer: way to initialize parameters
        '''
        self.params = {
            'weight': initializer('layer_weight'),
        }

    def forward(self, is_train, in_data, out_data):
        '''perform forward

        Parameters
        ----------
        is_train: train or test
        in_data: input data to this layer
        out_data: output data of this layer
        '''
        pass

    def backward(self, in_data, out_data, in_grad, out_grad):
        '''perform backward

        Parameters
        ----------
        in_data: input data to this layer
        out_data: output data of this layer
        in_grad: gradient w.r.t. to in_data, backprop to former layers
        out_grad: gradient w.r.t. to out_data, backprop from latter layers
        '''
        pass

    def update(self, updater):
        '''update learnable parameters

        Parameters
        ----------
        updater: updater using different optimize strategy to update parameters
        '''
        updater(self.params)


class Operator(object):
    '''an example for Operator
    '''

    def __init__(self):
        '''initialize operator
        '''
        pass

    def forward(self, is_train, in_data, out_data):
        '''perform forward

        Parameters
        ----------
        is_train: train or test
        in_data: input data to this layer including learnable parameters attached to this Operator
        out_data: output data of this layer
        '''
        pass

    def backward(self, in_data, out_data, in_grad, out_grad):
        '''perform backward

        Parameters
        ----------
        in_data: input data to this layer
        out_data: output data of this layer
        in_grad: gradient w.r.t. to in_data, backprop to former layers
        out_grad: gradient w.r.t. to out_data, backprop from latter layers
        '''
        pass

    def infer_shape(self, in_shape, out_shape):
        '''infer data shape of in_data and out_data
        this helps framework to collect the information about operator

        Parameters
        ----------
        in_shape: in_data shape
        out_shape: out_data shape
        '''
        pass

Since Layer holds the parameters themselves, they may need initializer and updater to initialize and update them. However, for most frameworks, Layer holds the parameters don't need to care about the parameters update, all they need to do is put the gradients w.r.t. parameters in some place where the framework can fetch. Actually, the initialization part can also do in this way. Somehow, there seems little difference between Layer and Operator, as Layer accesses its parameters in its own class member while Operator accesses the parameters through in_data. But just because of this little difference, Layers are not easy (but still possible) to share parameters between each other while Operator can easily do it. It's important for RNN but not a common case for CNN. And that's a reason why frameworks like MXNet and TensorFlow use Operator instead of Layer as their basic network component.

Regardless of the difference between Layer and Operator, we still need to implement Forward and Backward for them. There's nothing difference or special. They go the same way. So we will use Layer for the next part, but it reads parameters from in_data which like a Operator.

Write down formulas

Before we write any code, the first thing we need to do is to figure out all the formula that our Layer need. Let's consider a function which acts like a fully connected Layer, but will modify output result at some location. This function is adapted from paper Large-Margin Softmax Loss for Convolutional Neural Networks. The original function is kind of complex, I simplify it for the demonstration.

We define the function below.

$$ f_{i, j} = w_j^T \cdot x_i \quad j \neq y_i $$

$$ \begin{eqnarray} f_{i, y_i} = \begin{cases} w_{y_i}^T \cdot x_i & w_{y_i}^T \cdot x_i < 0 \ k * w_{y_i}^T \cdot x_i & w_{y_i}^T \cdot x_i > 0 \end{cases} \end{eqnarray} $$

It's the same thing as fully connected layer does except $f_{i, y_i}$ will be smaller than original one if $f_{i, yi} > 0$. What's more, $0 < k < 1$ is a hyperparameter of this Layer. We also omit bias term. Then we need to calculate the derivatives for $x$ and $w$.

$$ \frac {\partial f_{i, j}} {\partial x_i} = w_j \quad j \neq y_i $$

$$ \begin{eqnarray} \frac {\partial f_{i, y_i}} {\partial x_i} = \begin{cases} w_{y_i} & w_{y_i} \cdot x_i < 0 \ k * w_{y_i} & w_{y_i} \cdot x_i > 0 \end{cases} \end{eqnarray} $$

$w$ goes the same way.

$$ \frac {\partial f_{i, j}} {\partial w_j} = x_i \quad j \neq y_i $$

$$ \begin{eqnarray} \frac {\partial f_{i, y_i}} {\partial w_{y_i}} = \begin{cases} x_i & w_{y_i} \cdot x_i < 0 \ k * x_i & w_{y_i} \cdot x_i > 0 \end{cases} \end{eqnarray} $$

It's time to bring Loss $J$ in. Then we can write gradient w.r.t. $x$ and $w$.

$$ \begin{eqnarray} \frac {\partial J} {\partial x_i} &=& \sum_j \frac {\partial J} {\partial f_{i, j}} \frac {\partial f_{i, j}} {\partial x_i} \ &=& \sum_{j, j \neq y_i} \frac {\partial J} {\partial f_{i, j}} w_j + \frac {\partial J} {\partial f_{i, y_i}} \frac {\partial f_{i, y_i}} {\partial x_i} \end{eqnarray} $$

$$ \begin{eqnarray} \frac {\partial J} {\partial w_j} &=& \sum_i \frac {\partial J} {\partial f_{i, j}} \frac {\partial f_{i, j}} {\partial w_j} \ &=& \sum_{i, j \neq y_i} \frac {\partial J} {\partial f_{i, j}} x_i + \sum_{i, j = y_i} \frac {\partial J} {\partial f_{i, y_i}} \frac {\partial f_{i, y_i}} {\partial w_{y_i}} \end{eqnarray} $$

With the above formulas, we now know how the Forward and Backward of our Layer should do.

Implement the Layer

Let's implement Forward and Backward in Python. It's your choice to pick a programming language to implement. I happens to use Python a lot and most deep learning framework have support for Python. It's a good idea to choose a language that the framework you use supports. You can easily wrap it after you finish the implementation. In this step, we really don't need to consider the performance of implementation as long as it can work.

def forword(self, is_train, in_data, out_data):
    X = in_data['X']
    W = in_data['W']
    label = in_data['label']
    # traditional fully connected layer
    out = X.dot(W.T)
    if is_train:
        # some modification
        for i in range(len(X)):
            yi = int(label[i])
            if out[i, yi] > 0:
                out[i, yi] *= self.k
    out_data['output'] = out

The Forward function is easy since it's normally a fully connected layer that may modify the output $f_{i,y_i}$. is_train is used to indicate whether current context is train or test. We only want to modify $f_{i, y_i}$ during training.

def backward(self, in_data, out_data, in_grad, out_grad):
    X = in_data['X']
    W = in_data['W']
    label = in_data['label']
    out = out_data['output']
    o_grad = out_grad['output']
    # traditional fully connected
    x_grad = o_grad.dot(W)
    w_grad = o_grad.T.dot(X)
    # gradient w.r.t. X
    for i in range(X.shape[0]):
        yi = int(label[i])
        if out[i, yi] > 0:
            x_grad[i] += self.k * W[yi] - W[yi]
    # gradient w.r.t W
    for j in range(W.shape[0]):
        for i in range(X.shape[1]):
            yi = int(label[i])
            if yi == j and out[i, yi] > 0:
                w_grad += self.k * X[i] - X[i]
    in_grad['X'] = x_grad
    in_grad['W'] = w_grad

Backward is a little tricky, we can reuse the output result of out to know if we have modify $f_{i, y_i}$. Also we can reuse the result of fully connected layer's backward operation.

$$ \frac {\partial J} {\partial x_i} = \sum_j \frac {\partial J} {\partial f_{i, j}} w_j + \frac {\partial J} {\partial f_{i, y_i}} (\frac {\partial f_{i, y_i}} {\partial x_i} - w_{y_i}) $$

$$ \frac {\partial J} {\partial w_j} = \sum_i \frac {\partial J} {\partial f_{i, j}} x_i + \sum_{i, j = y_i} \frac {\partial J} {\partial f_{i, y_i}} (\frac {\partial f_{i, y_i}} {\partial w_{y_i}} - x_i) $$

the first part of two formulas is exactly what fully connected layer does.

Gradient Check

Once you have write done the code, gradient check is important for you to verify the correctness of your implementation. The key idea is below.

$$ f'(x) = \frac {f(x + \Delta x) - f(x - \Delta x)} {2 \Delta x} $$

The formula evaluate the derivative at X. In this way, we can evaluate the gradient of data and parameter using Layer's Forward. We can also calculate this derivative using Layer's Backward we implement. For example, we can choose one element from X, we call Layer's Forward and Backward, and get the gradient from grad for this one element. Next, we modify this element to x-eps and x+eps, call Forward twice and get two f values. then we can evaluate the gradient. The calculated and evaluated gradient can be different but shouldn't differ to much.

The problem here is what if my Layer doesn't output a single value but a multi-dimension array, and where comes the gradient w.r.t. my Layer's output. The key is to plug a loss function to the output of the Layer. The most simple loss function we can choose is the L2 function.

$$ J = \frac {1} {2} \sum_i x_i^2 $$

$J$ is easy to calculate and the derivative too.

$$ \frac {\partial J} {\partial x_i} = x_i $$

$$ \frac {\partial J} {\partial X} = X $$

Thus, we can simply plug in this loss function to whatever the output of your Layer may output. The following Python code show an easy way to do gradient check on a Layer.

def gradient_check(layer, in_data, out_data, in_grad, out_grad):
    '''do gradient check for parameter X
    '''
    # loss function
    loss_it = lambda x: np.square(x).sum() / 2

    # suppose X is a 2 dimension array
    eps = 1e-4
    threshold = 1e-2
    for i in range(in_data['X'].shape[0]):
        for j in range(in_data['X'].shape[1]):
            # calculate gradient
            layer.forward(is_train=True, in_data, out_data)
            out_grad['output'] = out_data['output']
            layer.backward(in_data, out_data, in_grad, out_grad)
            gradient = in_grad['X'][i, j]

            # evaluate gradient
            in_data['X'][i, j] -= eps
            layer.forward(is_train=True, in_data, out_data)
            J1 = loss_it(out_grad['output'])
            in_data['X'][i, j] += 2 * eps
            layer.forward(is_train=True, in_data, out_data)
            J2 = loss_it(out_grad['output'])
            gradient_expect = (J2 - J1) / (2 * eps)

            # calculate relative error
            error = abs(gradient_expect - gradient) / (abs(gradient) + abs(gradient_expect))
            if error > threshold:
                print 'gradient check failed on X[%d, %d]'%(i, j)
            else:
                print 'gradient check pass'

You can refer to cs231n course note here for more information about gradient check.

Test within a toy model

After your implementation pass the gradient check, you should put your Layer into the DL framework you use. This brings other important things in. How to develop a new Layer for the DL framework? Most DL frameworks shall have documents about how to write custom Layer or Operator. They also may offer a demonstration of writing the Layer in Python or C++. Read the document and the code, you also need to understand how the framework process the data and basic idea of how the framework run your Layer. If you want to write the code in C++/CUDA, the best way you can go is to read the source code of Layer implementation in the framework. They're the best examples you can refer to.

It's also important that you should use a small network and data set to verify the efficiency of your implementation. Sometimes, passing the gradient check doesn't really mean your Layer implementation is perfect. The gradient check can't cover all situation. There might be bugs that only happens in a rarely situation. Or for some of your Layer inputs, your implementation may have some numeric issue like float underflow which cause the result wrong. Since gradient check is not perfect, it's always a good idea to deploy your Layer implementation on a toy model and see if it works the way your want (at least it shouldn't give you the wrong result).

Optimize your implementation

Once you verify the correctness and efficiency of your Layer implementation, you may want to optimize it to get a better performance. Since most framework support using Python to implement the Layer, you can still stick to Python and optimize the code more vectorized. Then, you may want to implement it using CUDA which makes your Layer can run on GPUs. Nowadays, we depends so much on GPUs to run deep learning framework to train neural networks. You should learn some knowledge about CUDA if you want the implementation of your Layer gets better performance.

Summary

In a summary, If you need to implement a custom Layer for the deep learning framework, you should implement it using Python or some other language you are familiar and easy to debug. Do gradient check to verify the correctness of your implementation. Next, you need to put the Layer into the DL framework you use, this requires much that you also need to know how the framework handle and represent the Layer and Data. Train a toy network after your Layer can work with the framework to verify the efficiency. After all, if the performance is poor, you may need to optimize the Layer using CUDA which makes your Layer run on GPUs. You can take a look at luoyetx/mx-lsoftmax for a reference. It's follow the pipeline I described above.

References

Caffe 源码阅读 Layer 加载机制

2016-02-04T00:00:00+08:00

Caffe 中的 Layer 是神经网络 Net 的基本结构，Caffe 内部维护一个注册表用于查找特定 Layer 对应的工厂函数。很多同学在 Windows 下使用 Caffe 遇到的一个问题就是运行 Caffe 相关的代码时出现无法找到 Layer，但是这个问题不会在 Linux 平台上出现，这个问题跟编译器有关，同时也是跟 Caffe 注册 Layer 的机制有关。

F0203 12:50:07.581297 11524 layer_factory.hpp:78] Check failed: registry.count(type) == 1 (0 vs. 1)
Unknown layer type: Convolution (known types: )

上面的错误是无法在注册表中找到 Convolution Layer 对应的工厂函数，程序直接崩溃。下面我们就来聊聊 Caffe 的 Layer 加载机制，以及为什么在 VC 下会出现这种问题。

Caffe 的 Layer 注册表其实就是一组键值对，key 为 Layer 的类型而 value 则对应其工厂函数。下面两组宏控制了 Layer 的注册动作。

#define REGISTER_LAYER_CREATOR(type, creator)                                  \
  static LayerRegisterer<float> g_creator_f_##type(#type, creator<float>);     \
  static LayerRegisterer<double> g_creator_d_##type(#type, creator<double>)    \

#define REGISTER_LAYER_CLASS(type)                                             \
  template <typename Dtype>                                                    \
  shared_ptr<Layer<Dtype> > Creator_##type##Layer(const LayerParameter& param) \
  {                                                                            \
    return shared_ptr<Layer<Dtype> >(new type##Layer<Dtype>(param));           \
  }                                                                            \
  REGISTER_LAYER_CREATOR(type, Creator_##type##Layer)

REGISTER_LAYER_CLASS 宏可以实现将特定 Layer 注册到全局注册表中，首先定义一个工厂函数用来产生 Layer 对象，然后调用 REGISTER_LAYER_CREATOR 将工厂函数和 Layer 的类型名进行注册，注册时只是用 Layer 的 float 和 double 类型，这是网络实际数据使用到的类型。两个静态变量一个对应 float，另一个对应 double，这两个变量的初始化，也就是它们的构造函数实际上完成 Layer 的注册动作。

template <typename Dtype>
class LayerRegisterer {
 public:
  LayerRegisterer(const string& type,
                  shared_ptr<Layer<Dtype> > (*creator)(const LayerParameter&)) {
    LayerRegistry<Dtype>::AddCreator(type, creator);
  }
};

LayerRegisterer 对象初始化时实际上又是调用相应类型的 LayerRegistry 类的静态方法 AddCreator。

typedef std::map<string, Creator> CreatorRegistry;

static CreatorRegistry& Registry() {
  static CreatorRegistry* g_registry_ = new CreatorRegistry();
  return *g_registry_;
}

注册表类型为 CreatorRegistry，实际类型为 std::map<string, Creator>。可以通过 Registry 函数获取注册表的全局单例。而注册的过程就是一个简单的 map 操作。

// Adds a creator.
static void AddCreator(const string& type, Creator creator) {
  CreatorRegistry& registry = Registry();
  CHECK_EQ(registry.count(type), 0)
    << "Layer type " << type << " already registered.";
  registry[type] = creator;
}

注册的过程大概就是上面说到的流程。Caffe 中的 Layer 采用静态变量初始化的方式来注册工厂函数到全局注册表中，整个注册过程依赖这些静态变量。那么问题来了，为什么 VC 中的代码无法在注册表中找到 Layer 对应的工厂函数？事实上，VC 中 Caffe 代码的全局注册表是空的，一条记录都没有，问题并不是出在这个全局注册表，而是那些完成注册动作的静态变量。由于这些静态变量存在的意义在于其构造函数完成 Layer 的注册动作，没有任何一段代码会去引用这些静态变量，这个坑在于 VC 默认会优化掉这些静态变量，那么所有这些静态变量对应的构造函数将无法执行，那么注册动作一个都不会触发，导致全局注册表为空，然后在构造网络 Net 时就会崩溃。

在 VC 下解决这个问题的关键是让 VC 编译器不将这些静态变量优化掉，可以在 Linker 的配置中设置依赖项输入，如下图所示。

通过上述的方法可以保证以静态库的方式链接 Caffe 代码时，Caffe 中的那些静态变量不会被优化掉。另外一种方式是直接将 Caffe 的源码加入到现有工程代码中，直接参与编译（不是编译生成静态库），这样也可以保证静态变量不被优化掉。

Caffe 的这种注册 Layer 的机制在 VC 下有点坑，但也不是不能解决，只要搞清楚 Caffe 内部的机制和 VC 的一些特征，还是很容易弄明白问题所在，进而寻求相应的解决方案。

Similarity Transform Between Face Shapes

2016-01-13T00:00:00+08:00

Many face alignment algorithm need to perform a similarity transform between training shapes and a particular shape, which is always a mean shape over ground truth shapes. During the trainin status, the algorithm will do a similarity transform between target shape residual and mean shape. This is required to calculate the transform parameters between two shapes.

When implementing the algorithm of 3000fps, I didn't really understand the math forumlas underneath. Today, I spend some time to study the math and find that Procrustes analysis make things work.

Let me describe the problem here. We need to perform a similarity transform such that $S_2 = cR(S_1)+t$. $S_2$ presents Shape2 and $S_1$ presents Shape1, $c$ is the scale ratio and $t$ is the bias, the most important term is $R$ which presents the rotation.

$$ R = \begin{bmatrix} cos\theta \quad -sin\theta \ sin\theta \quad cos\theta \ \end{bmatrix} $$

We first need to normalize the Shape, which can minus the mean point of a Shape and divided by the scale. The mean point can be easily calculated but the scale is pretty not intuitive. Actually, we can use $|S|$ or $|S|^2$ as Shape scale, according to Procrustes analysis, it matters little. After this step, we get bias $t$ and scale ratio $c$, what's left is all about $R$.

$R$ is a rotation matrix and we have many points to rotate. The target is to rotate the normalized Shape1 $S_1$ to normailzed Shape2 $S_2$, which will give us a minimum error between rotated normalized $S_1$ and normalized $S_2$. We can write a formula to present this. $[x_1, y_1, ...]$ presents the normalized $S_1$ and $[u_1, v_1, ...]$ presents the normalized $S_2$.

$$ R= \begin{bmatrix} a \quad -b \ b \quad a \ \end{bmatrix} $$

$$ E=\sum_{i}{|| \begin{bmatrix} a \quad -b \ b \quad a \ \end{bmatrix} \cdot \begin{bmatrix} x_i \ y_i \ \end{bmatrix} - \begin{bmatrix} u_i \ v_i \ \end{bmatrix} ||^2} $$

In order to minimize $E$, we can use least squares method to calculate parameter $a$ and $b$. Take the derivatives and make them all zeros will give us the answer.

$$ \frac{\partial E}{\partial a}=\sum_{i}{2x_i(ax_i-by_i-u_i)+2y_i(bx_i+ay_i-v_i)}=0 $$

$$ \frac{\partial E}{\partial b}=\sum_{i}{-2y_i(ax_i-by_i-u_i)+2x_i(bx_i+ay_i-v_i)}=0 $$

Solving the equations above will give us $a$ and $b$.

$$ \begin{bmatrix} a \ b \ \end{bmatrix}=\frac{1}{\sum{x_i^2+y_i^2}} \begin{bmatrix} \sum{x_iu_i+y_iv_i} \ \sum{x_iv_i-y_iu_i} \ \end{bmatrix} $$

$$ tan\theta=\frac{b}{a}=\frac{\sum{x_iv_i-y_iu_i}}{\sum{x_iu_i+y_iv_i}} $$

With $tan\theta$, we can get $R$.

References

Something You Should Kown About C/C++ Compiler

2015-12-07T00:00:00+08:00

我以前经常问身边的同学关于 C/C++ 代码编译链接的问题，问他们知不知道这方面的细节。非常遗憾，绝大部分人连编译和链接都分不清楚，感慨学校教的 C 语言课太水了，很多同学也不怎么敲代码=。=，我写这篇文章记录一下我自己对 C/C++ 编译器的理解，希望也能帮助到其他同学，希望你们在遇到编译链接错误时，不要慌张，冷静分析，找对排查问题的方向。

我们都知道编译代码的过程分成 4 个环节，预处理，编译，汇编，链接（生成可执行文件或者库文件）。首先我们需要明确几点。

头文件只参与预处理环节，跟编译和后续环节没有什么关系。
任何一个源文件都是独立编译，相互之间不干扰。
前三个环节并不涉及到第三方的库文件，但是在预处理环节会涉及到库的头文件。
预处理环节一般不会出错，哪怕你使用了一个未定义的宏，也不会出错，只会在编译时报错，而任何编译错误都不会跟库文件扯上任何关系。
汇编一般也不会出错，它只是编译器一个中间的隐性环节，把汇编代码编译成目标代码。
链接是天坑，标准库和第三方库的库文件都是在这个环节加入进来的。
我们遇到的问题，基本上都是编译出错或者链接出错。然而事情并没有那么简单。

预处理

预处理我们接触最多的就是包含头文件的 #include 和定义宏的 #define，同时还有规范性的头文件保护宏。在这个环节出现最多的问题是找不到头文件，然后编译器停止工作。这个问题是最容易被解决的，但是很多人可能并不是很了解编译器寻找头文件的流程。这里我们首先要区分 3 种头文件，第一种是标准库的头文件，第二种是代码中引用到的第三方库的头文件，第三种是自己代码编写的头文件。编译器在工作时是有一个头文件目录列表的，根据目录列表去寻找头文件，第一个目录便是当前代码的所在的目录，其次是编译器自行定义的目录（一般是标准库头文件所在的目录），最后是我们自己在编译时加上的目录列表，可以有多个，包括需要引用的第三方库的头文件目录和自己代码的头文件目录（可能你自己写的头文件跟源文件不在一个目录下）。

编译与汇编

编译环节出错那基本就是语法的问题，代码写得有问题直接导致编译器停止工作。这里不得不提一下 C++11 标准，并不是所有编译器都实现了新标准的所有规范，同时可能编译器之间实现的特性还有差异，这些在写跨编译器或跨平台代码事要特别注意，同一个编译器的不同版本之间也会略有差异。

链接

链接环节出现问题非常之多，类型也是五花八门，奇奇怪怪，会非常坑。但是归根结底主要是两种，第一是链接时找不到符号，第二是链接时找到了多个符号。很多同学碰到链接出错时编译器吐出来的一堆一堆乱七八糟的函数符号估计都很蛋疼，但是我们多数时候碰到的都是 undefined reference to xxx，即找不到符号。

符号

要搞明白链接时编译器是怎么工作的，我们就得先搞清楚 符号 在编译系统中的作用。一个符号可以指代一块内存或者一段代码。代码中与符号相关的几处地方如下。

变量的声明，告诉编译器有这么一个变量指代一块内存。
变量的定义，告诉编译器需要为这个变量分配一块内存。
函数的声明，告诉编译器有这么一段代码可以使用，输入输出规范如何，应该怎么调用。
函数的定义，告诉编译器这段代码的逻辑实现。
引用变量或函数，代码中使用某个变量或者调用某个函数。

extern int a; // declare
int a; // define

void foo(); // declare
void foo() { // define
  a = 0;
}

编译器会给每个变量和每个函数分配一个符号，这样做的好处是方便符号的重用（函数的重用），也利于项目代码的模块化，多个目标文件的链接。由于每个源文件代码都是独立编译的，并生成目标文件，编译器在处理这个源文件时，最后会在目标文件中指出它所需要的符号和它能够提供的符号，这样，链接器在链接一堆目标文件时（库所提供的目标文件和自己代码的目标文件）就能够为每个待确定的符号找到对应的符号，从而成功生成可执行文件或者库文件。

找不到符号

这个问题估计大部分同学在自己编译代码的时候都碰到过，绝大多数情况下都是编译时配置出错，没有告诉链接器应该去链接某个文件，而导致找不到符号。然而在有时候已经完全配置好了，还是会出现这种情况，即我知道这个库的符号全在这个目标文件中或者这个静态库或者这个动态库中，但是编译器还是报错说找不到符号。这种情况带出了一些更深层次的问题。

C 与 C++ 其实并没有想象中那么和谐

我们都知道 C++ 可以重载函数，类似于下面的这段代码。

void foo(int x) {
  x = 0;
}

void foo(float x) {
  x = 0.f;
}

C 中是不允许函数重名的，但是 C++ 中可以通过不同的输入参数类型和类型次序来重载同名函数，暂且不论重载带来的好处和坑，C++ 能这么做是因为 C++ 编译器会重写每个函数最后生成的符号，上面两个函数在编译完后会生成不同的符号，这样一来，对链接器来说其实函数名相同已经没有什么意义了。

vagrant@trusty64:~/test$ cat a.cpp
void foo(int x) {
    x = 0;
}

void foo(float x) {
    x = 0.f;
}
vagrant@trusty64:~/test$ gcc -c a.cpp -o a.o
vagrant@trusty64:~/test$ nm a.o
0000000000000010 T _Z3foof
0000000000000000 T _Z3fooi
vagrant@trusty64:~/test$

我们可以看到 gcc 编译出来的符号已经跟函数名不一样了，符号包含了更多的信息，比如符号的类型（这个符号是个函数），和函数对应参数的类型，相当复杂。

vagrant@trusty64:~/test$ cat a.c
void foo(int x) {
    x = 0;
}
vagrant@trusty64:~/test$ gcc -c a.c -o a.o
vagrant@trusty64:~/test$ nm a.o
0000000000000000 T foo
vagrant@trusty64:~/test$

相对来说 C 代码编译出来的符号是和函数名是一致的，同时符号中也不区分变量和函数。

以上的代码只是很简单的函数，如果加上命名空间，类函数等，编译器产生的符号会更加复杂，更加吓人，这也是为什么我们看到的链接出错中会有一长串的字符，因为 C++ 中的符号异常复杂，包含的信息太多。

编译器对 C 和 C++ 代码的处理方式不同，导致两者采用完全不一样的符号命名机制，这样会造成很多链接时的问题，所有我们可以看到好多 C 语言库的头文件里会写下面这种代码。

#ifdef __cplusplus
extern "C" {
#endif

......
......

#ifdef __cplusplus
}
#endif

通过这种方式告诉编译器，这个头文件的所有符号请按照 C 语言的规则进行生成，不要采用 C++ 那套符号重写机制。如果不采取这种措施，就会导致原本在库中是 foo 的符号被改写成 _Z3fooi 类似的形式而造成链接失败。

以上这种问题本质上是编译器产生的符号于实际库中的符号不一致。C 和 C++ 不同的符号机制会导致这种情况，但是还有其他问题也会导致这种情况，事实上，一般 C 库的作者在这方面都考虑到了的，在头文件中设置宏是可以解决这种问题的，而且这个不需要使用库的人自行设定。

C++ 编译器中符号的兼容性

多数情况下，我们使用的第三方库都是库的提供者事先编译好的，这就带来了一个很大的隐患。同一个函数在库中的符号和我们编译器要寻找的符号可能不一致，这个问题在 MSVC 上尤为突出。除去动态库静态库的差异，针对相同编译器的不同版本，同一个函数可能生成的符号会不一样，这是最最坑爹的地方。看看 OpenCV 里 VC10，VC11，VC12 的各个目录就知道这个差异是非常大的。相对而言，gcc 不同版本之间的兼容性似乎就好很多。当然 C 的代码相对于 C++ 就好很多了，不可能出现这种坑爹的情况。

总结

胡说八道了一堆，希望能够帮助你理解 C/C++ 代码编译时出现的问题，从而针对性地寻求相应的解决方案。

Basic Mathematics in Neural Networks

2015-11-16T00:00:00+08:00

Recently, I was reading the paper Notes on Convolution Neural Networks. The first part of the paper is talking about the traditional neural network, which is multi-layer fully connected network. It discusses the basic feature of multi-layer network and some formulas to present the feedforward pass and backpropagation pass. All the formulas are very simple but lack of the details about how things work. I am writing down this blog to record my derivation of these formulas.

A single layer in traditional neural network can be defined by the input $x$, the output $u$, the weight $W$ and the bias $b$, which we have $x \in R^n$, $W \in R^{m \cdot n}$, $b \in R^m$ and $u \in R^m$. And a layer can be defined like a function below.

$$ u = W \cdot x + b $$

We first need some derivative of this function which will be every helpful later. we first rewrite this function to each element of $u$.

$$ u_k = W_k \cdot x + b_k $$

$W_k$ is the k-th row of weight matrix $W$. We need three partial derivative $\frac{\partial u_k}{\partial b}$, $\frac{\partial u_k}{\partial W}$ and $\frac{\partial u_k}{\partial x}$. We also need another parital derivative of the function $u_k = W_k \cdot f(x) + b_k$, and the partial derivative is $\frac{\partial u_k}{\partial x}$.

The first term is $\frac{\partial u_k}{\partial b}$.

$$ \frac{\partial u_k}{\partial b_i} = \begin{cases} 0 \quad if \quad i \neq k \ 1 \quad if \quad i = k \ \end{cases} $$

$$ \frac{\partial u_k}{\partial b} = \begin{bmatrix} \frac{\partial u_k}{\partial b_1} \ . \ \frac{\partial u_k}{\partial b_k} \ . \ \frac{\partial u_k}{\partial b_m} \ \end{bmatrix} = \begin{bmatrix} 0 \ . \ 1 \ . \ 0 \ \end{bmatrix} \in R^m $$

which we have $1$ in $k$-th row.

The second term is $\frac{\partial u_k}{\partial W}$.

$$ \frac{\partial u_k}{\partial W_i} = \begin{cases} 0 \quad if \quad i \neq k \ x^T \quad if \quad i = k \ \end{cases} $$

$$ \frac{\partial u_k}{\partial W} = \begin{bmatrix} \frac{\partial u_k}{\partial W_1} \ . \ \frac{\partial u_k}{\partial W_k} \ . \ \frac{\partial u_k}{\partial W_m} \ \end{bmatrix} = \begin{bmatrix} 0 \ . \ x^T \ . \ 0 \ \end{bmatrix} \in R^{m \cdot n} $$

which we have $x^T$ in $k$-th row.

The third term is $\frac{\partial u_k}{\partial x}$.

$$ \frac{\partial u_k}{\partial x} = \begin{bmatrix} \frac{\partial u_k}{\partial x_1} \ . \ . \ \frac{\partial u_k}{\partial x_n} \ \end{bmatrix} = \begin{bmatrix} W_k1 \ . \ . \ W_kn \ \end{bmatrix} = W_k^T \in R^{n} $$

The fourth term is the partial derivative $\frac{\partial u_k}{\partial x}$ of function $u_k = W_k \cdot f(x) + b$.

$$ \frac{\partial u_k}{\partial x_i} = W_ki \cdot f^\prime(x_i) $$

$$ \frac{\partial u_k}{\partial x} = W_k^T \circ f^\prime(x) $$

the notation $\circ$ here is an element wise multiplication.

With the four derivative above, we can now derivate the backpropagation algorithm of traditioanl neural network. We define $l$-th layer's input $x^{l-1}$, the output $u^l$, the weights $W^l$ and the bias $b^l$, we also define the activation function $f$. And the neural network is combined with $L$ layers, and its input will be $x^0$ and the output will be $t = f(u^L)$. We also define the loss $E = \frac{1}{2} \cdot | t - y|_2^2$. The relationship of all these notations are listed below.

$$ x^{l} = f(u^l),\quad u^l = W^l \cdot x^{l-1} + b^l $$

For gradient descent, we need to calculate $\frac{\partial E}{\partial W^l}$ and $\frac{\partial E}{\partial b^l}$, and it is all backpropagation algorithm about. We first calculate $\frac{\partial E}{\partial b^l}$ and define a notation $\delta$ to help calculate these two partial derivative.

$$ \delta^l = \frac{\partial E}{\partial u^l} $$

for $l$-th layer, we are not care the dimension of the input and output, the infomation are all in the $x^l$ and $u^l$.

to calculate $\frac{\partial E}{\partial b^l}$, we use $\frac{\partial u_k}{\partial b}$

$$ \frac{\partial E}{\partial b^l} = \sum_{k}\frac{\partial E}{\partial u_k^l} \cdot \frac{\partial u_k^l}{\partial b^l}, $$

$$ = \sum_{k}\delta_k^l \cdot \begin{bmatrix} 0 \ . \ 1 \ . \ 0 \ \end{bmatrix} $$

$$ = \sum_{k} \begin{bmatrix} 0 \ . \ \delta_k^l \ . \ 0 \ \end{bmatrix} = \delta^l $$

to calculate $\frac{\partial E}{\partial W^l}$, we use $\frac{\partial u_k}{\partial W}$

$$ \frac{\partial E}{\partial W^l} = \sum_{k}\frac{\partial E}{\partial u^l_k} \cdot \frac{\partial u^l_k}{\partial W^l}, $$

$$ = \sum_{k}\delta^l_k \cdot \begin{bmatrix} 0 \ . \ {(x^{l-1})}^T \ . \ 0 \ \end{bmatrix} $$

$$ = \sum_{k} \begin{bmatrix} 0 \ . \ \delta^l_k \cdot {(x^{l-1})}^T \ . \ 0 \ \end{bmatrix} = \delta^l \cdot {(x^{l-1})}^T $$

Now, we can calculate the gradient for parameters $W^l$ and $b^l$, but they all depend on $\delta^l$ which is the core of backpropagation algorithm. And the algorithm is all about how to calculate $\delta^l$ from higher layer to lower layer, and the highest layer is the neural network's output layer which we put the loss $E$ in the algorithm.

We first calculate $\delta^l$, using $\frac{\partial u_k}{\partial x}$ of function $u = W \cdot x + b$ and $\frac{\partial u_k}{\partial x}$ in function $u = W \cdot f(x) + b$, and we also have $u^{l+1} = W^{l+1} \cdot f(u^l) + b^{l+1}$

$$ \delta^l = \frac{\partial E}{\partial u^l} = \sum_{k}\frac{\partial E}{\partial u^{l+1}_k} \cdot \frac{\partial u^{l+1}_k}{\partial u^l} $$

$$ = \sum_{k}\delta^{l+1}_k \cdot {(W^{l+1}_k)}^T \circ f^\prime(u^l) $$

$$ = {(W^{l+1})}^T \cdot \delta^{l+1} \circ f^\prime(u^l) $$

for the output layer, we have $t = f(u^L)$ and $E = \frac{1}{2} \cdot | t - y|_2^2$.

$$ \delta^L = \frac{\partial E}{\partial u^L} = f^\prime(u^L) \circ (t - y) $$

Finally, we get all these notations solved, and given an input $x^0$ and a target $y$ attached to it, which we have $x^0 \in R^n$ and $y \in R^m$. We first forwrad neural network and get all $u^l$, then calculate the top layer $\delta^L$, after all, we backword the error from top to bottom to calculate $\delta^l$, meanwhile update $W^l$ and $u^l$. This is really how the backpropagation algorithm works.

Let me put all notations together.

feedforward

$$ x^{l} = f(u^l),\quad u^l = W^l \cdot x^{l-1} + b^l $$

backpropagation

$$ \delta^L = f^\prime(u^L) \circ (t - y) $$

$$ \delta^l = {(W^{l+1})}^T \cdot \delta^{l+1} \circ f^\prime(u^l) $$

gradient of parameters

$$ \frac{\partial E}{\partial W^l} = \delta^l \cdot {(x^{l-1})}^T $$

$$ \frac{\partial E}{\partial b^l} = \delta^l $$

That's all.

References

Notes on Convolution Neural Networks

Caffe 源码阅读 Blob

2015-10-31T00:00:00+08:00

Blob 在 Caffe 中扮演了重要的角色，用于存储数据和网络参数，同时也在 CPU 和 GPU 之间做了数据同步。Blob 原本在 Caffe 中被表示为一个 4 维数组 (num x channel x height x width)，现在可以表示多维数组，最高维数由宏 kMaxBlobAxes 确定，目前 blob.hpp 中设置了 const int kMaxBlobAxes = 32;。Blob 类的代码主要集中在 blob.hpp 和 blob.cpp 中。

数据与相关操作函数

Blob 类主要包括如下成员

shared_ptr<SyncedMemory> data_; // data 数据
shared_ptr<SyncedMemory> diff_; // diff 数据
shared_ptr<SyncedMemory> shape_data_; // 每一维数据的大小
vector<int> shape_; // 跟 shape_data_ 一样
int count_; // 当前容纳的数据大小
int capacity_; // 最大能够容纳的数据大小

其中 SyncedMemory 主要用来实现数据在 CPU 和 GPU 上的管理。同时 Blob 类提供一组函数来操作这些数据。

const Dtype* cpu_data() const;
void set_cpu_data(Dtype* data);
const int* gpu_shape() const;
const Dtype* gpu_data() const;
const Dtype* cpu_diff() const;
const Dtype* gpu_diff() const;
Dtype* mutable_cpu_data();
Dtype* mutable_gpu_data();
Dtype* mutable_cpu_diff();
Dtype* mutable_gpu_diff();

我们可以通过这些函数拿到 Blob 内部的数据包括修改 Blob 的内部数据。其中的 Dtype 是泛型类型，在定义 Blob 变量时设置的，一般为 float 或者 double。

Blob 类在内部所存储的数据是一块连续的内存，为了表示多维数组，shape_ 和 shape_data_ 记录了每一维的大小，这样就能够很轻松地从给出的坐标中计算出 offset 从而得到那个点的数据。由于 Blob 主要还是用来表示 4 维数组 (最初就是这样的)，Blob 类中仍使用了 int num(); int channels(); int height(); int width(); 这些函数，其实 num 等价于 shape()[0]，channels 等价于 shape()[1]，height 等价于 shape()[2]，width 等价于 shape()[3]。计算 offset 时可以使用这四个数字或者直接给出坐标。

int offset(const int n, const int c = 0, const int h = 0, const int w = 0);
int offset(const vector<int>& indices);

有了 Blob 提供的这组函数和上一组函数，我们就可以轻易地操作 Blob 内部的数据了。

动态多维数组

Blob 类可以动态改变数组的尺寸，当拓展数组导致原有内存空间不足以存放下数据时 (count_ > capacity_)，就会重新分配内存。Blob 提供了一组 Reshape 函数来完成这个功能。

void Reshape(const int num, const int channels, const int height, const int width); // Deprecated
void Reshape(const vector<int>& shape);
void Reshape(const BlobShape& shape);
void ReshapeLike(const Blob& other);

Blob 类在初始化时并没有分配内存，也是通过调用 Reshape 来分配内存的。

template <typename Dtype>
void Blob<Dtype>::Reshape(const vector<int>& shape) {
  CHECK_LE(shape.size(), kMaxBlobAxes); // 检查维数
  count_ = 1; // 用于计算新的多维数组的大小
  shape_.resize(shape.size()); // 更新维数
  if (!shape_data_ || shape_data_->size() < shape.size() * sizeof(int)) {
    // shape_data_ 未初始化或者内存太小
    shape_data_.reset(new SyncedMemory(shape.size() * sizeof(int)));
  }
  int* shape_data = static_cast<int*>(shape_data_->mutable_cpu_data());
  for (int i = 0; i < shape.size(); ++i) {
    CHECK_GE(shape[i], 0);
    CHECK_LE(shape[i], INT_MAX / count_) << "blob size exceeds INT_MAX";
    count_ *= shape[i];
    shape_[i] = shape[i];
    shape_data[i] = shape[i];
  }
  if (count_ > capacity_) {
    // 内存不够
    capacity_ = count_;
    data_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
    diff_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
  }
}

SyncedMemory

Blob 事实上是对 SyncedMemory 的封装。SyncedMemory 完成了对内存的实际操作，包括数据在 CPU 和 GPU 上的同步。

enum SyncedHead { UNINITIALIZED, HEAD_AT_CPU, HEAD_AT_GPU, SYNCED };

void* cpu_ptr_; // cpu 数据
void* gpu_ptr_; // gpu 数据
size_t size_; // 数据大小
SyncedHead head_; // 数据同步状态
bool own_cpu_data_; // 是否拥有当前 cpu 数据
bool cpu_malloc_use_cuda_; // 是否采用 CUDA 来分配 CPU 数据，默认不用
bool own_gpu_data_; // 是否拥有当前 gpu 数据
int gpu_device_; // gpu 数据所在的显卡号

SyncedMemory 内部存放了两份数据，分别位于 CPU 和 GPU 上，用 cpu_ptr 和 gpu_ptr 表示。同时 SyncedMemory 也给出了一组函数来获取和设置实际数据。

const void* cpu_data();
void set_cpu_data(void* data);
const void* gpu_data();
void set_gpu_data(void* data);
void* mutable_cpu_data();
void* mutable_gpu_data();

head_ 表示了数据的同步状态，通过调用 to_cpu() 和 to_gpu() 来做同步。如果 head_ = UNINITIALIZED 则分配相应的内存。

inline void SyncedMemory::to_cpu() {
  switch (head_) {
  case UNINITIALIZED:
    CaffeMallocHost(&cpu_ptr_, size_, &cpu_malloc_use_cuda_); // 分配内存
    caffe_memset(size_, 0, cpu_ptr_); // 初始化为 0
    head_ = HEAD_AT_CPU;
    own_cpu_data_ = true;
    break;
  case HEAD_AT_GPU:
#ifndef CPU_ONLY
    if (cpu_ptr_ == NULL) {
      // 如果未初始化，则分配内存
      CaffeMallocHost(&cpu_ptr_, size_, &cpu_malloc_use_cuda_);
      own_cpu_data_ = true;
    }
    // 复制 GPU 数据到 CPU
    caffe_gpu_memcpy(size_, gpu_ptr_, cpu_ptr_);
    head_ = SYNCED;
#else
    NO_GPU;
#endif
    break;
  case HEAD_AT_CPU:
  case SYNCED:
    break;
  }
}

inline void SyncedMemory::to_gpu() {
#ifndef CPU_ONLY
  switch (head_) {
  case UNINITIALIZED:
    CUDA_CHECK(cudaGetDevice(&gpu_device_)); // 获取显卡号
    CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_)); // 在指定显卡上分配内存
    caffe_gpu_memset(size_, 0, gpu_ptr_); // 初始化为 0
    head_ = HEAD_AT_GPU;
    own_gpu_data_ = true;
    break;
  case HEAD_AT_CPU:
    if (gpu_ptr_ == NULL) {
      // 未初始化就在指定显卡上分配内存
      CUDA_CHECK(cudaGetDevice(&gpu_device_));
      CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
      own_gpu_data_ = true;
    }
    caffe_gpu_memcpy(size_, cpu_ptr_, gpu_ptr_); // 复制数据
    head_ = SYNCED;
    break;
  case HEAD_AT_GPU:
  case SYNCED:
    break;
  }
#else
  NO_GPU;
#endif
}

数据序列化

Blob 数据可以通过 Protobuf 来做相应的序列化操作，ToProto 和 FromProto 完成相应的序列化操作。

message BlobProto {
  optional BlobShape shape = 7;
  repeated float data = 5 [packed = true];
  repeated float diff = 6 [packed = true];
  repeated double double_data = 8 [packed = true];
  repeated double double_diff = 9 [packed = true];

  // 4D dimensions -- deprecated.  Use "shape" instead.
  optional int32 num = 1 [default = 0];
  optional int32 channels = 2 [default = 0];
  optional int32 height = 3 [default = 0];
  optional int32 width = 4 [default = 0];
}

小结

Caffe 通过 SyncedMemory 和 Blob 封装了底层数据，为 Caffe 框架上的其他组件提供最基础的数据抽象，后面的 Layer 参数，Net 参数以及 Solver 的参数等都是 Blob 数据，所以理解 Blob 抽象和管理数据的实现方式有助于后续 Caffe 源码的阅读，也是阅读 Caffe 源码的第一步。

参考资料

Caffe 源码

Caffe 源码阅读伊始

2015-10-21T00:00:00+08:00

Caffe 是一个深度学习的框架，以 C++ 编写，性能卓越，并且现在已经支持单机多 GPU 运算。这篇博文包括之后的文章记录了我自己阅读学习 Caffe 源码的过程，也借此鼓励自己坚持下去，好好向 Caffe 的作者学习。

深度学习在这几年火得不行，尤其是 CNN 已经成为了解决视觉方面难题的神兵利器，而在 CNN 框架的开源实现方面，Caffe 以其使用简单，性能卓越，CPU/GPU 无缝切换等优点，受到了众多开发人员和研究人员的关注。Caffe 源码托管在 Github 上，任何人都能够免费获取并使用它。

Caffe 概况

Caffe 中网络模型的描述及其求解都是通过 protobuf 定义的，并不需要通过敲代码来实现。同时，模型的参数也是通过 protobuf 实现加载和存储，包括 CPU 与 GPU 之间的无缝切换，都是通过配置来实现的，不需要通过硬编码的方式实现。原则上讲，如果只是使用 Caffe 来训练卷积神经网络的话，我们完全不需要接触或者了解 Caffe 的源码，只需要关注如何定义网络模型和求解参数的设置，并且准备好相应格式的训练数据就完了。Caffe 本身采用 C++ 编写，速度非常快，加上对 GPU 的支持，在各大 CNN 的实现中，速度还是处于领先地位的。同时 Caffe 本身也是支持纯 CPU 下的计算的，当我们在 GPU 下训练完网络，也可以很简单地切换到 CPU 下运行计算网络，只需简单修改一下 Caffe 的配置。Caffe 同时也有一个庞大的社区来支持和维护 Caffe 的代码，添加新的功能，修正 bug 等，社区非常活跃，Google Groups 上的讨论也非常多。

Caffe 整体结构

Caffe 代码本身非常模块化，主要由 4 部分组成 Blob Layer Net 和 Solver。

Blob 主要用来表示网络中的数据，包括训练数据，网络各层自身的参数，网络之间传递的数据都是通过 Blob 来实现的，同时 Blob 数据也支持在 CPU 与 GPU 上存储，能够在两者之间做同步。
Layer 是对神经网络中各种层的一个抽象，包括我们熟知的卷积层和下采样层，还有全连接层和各种激活函数层等等。同时每种 Layer 都实现了前向传播和反向传播，并通过 Blob 来传递数据。
Net 是对整个网络的表示，由各种 Layer 前后连接组合而成，也是我们所构建的网络模型。
Solver 定义了针对 Net 网络模型的求解方法，记录网络的训练过程，保存网络模型参数，中断并恢复网络的训练过程。自定义 Solver 能够实现不同的网络求解方式。

阅读 Caffe 代码可以通过由小到大，至上而下的方式来阅读学习，首先学习 Blob 的实现，然后查看 Layer 的定义并阅读各种类型的 Layer 的实现方式，最后阅读 Net 的代码来学习整个网络结构。而 Solver 的代码可以单独列出来学习，了解网络的求解优化过程。我也会以这种方式阅读 Caffe 源码并记录自己的阅读心得并做些总结，希望自己能够坚持下去。Fighting!!!

理论知识积累

如果没有理论知识的基础，我认为学习 Caffe 源码的意义不是很大，所以强烈建议大家事先学习一下神经网络相关的基础知识，并且简单地使用一下 Caffe 之后再阅读其相应的源码，这样收获会更多，意义也更大。如果想用 Caffe 练练手，可以参考我的另外一篇博文 Caffe 小试牛刀，利用 Caffe 做 Kaggle 上的手写体识别。

参考资料

Caffe

Face Alignment at 3000 FPS via Regressing Local Binary Features

2015-08-19T00:00:00+08:00

Face Alignment at 3000 FPS via Regressing Local Binary Features 这篇论文(下面简称 3000fps)实现了对人脸关键点的高速检测，而且预测的精度也是相当的高。本文首先讲解了 3000fps 整篇论文的思路和方法，然后具体谈谈如何利用 C++ 实现这篇论文中的方法。

论文解读

3000fps总体上采用了随机森林和全局线性回归相结合的方法，相对于使用卷积神经的深度学习方法，3000fps采用的算是传统的机器学习方法。CUHK 的 Deep Convolutional Network Cascade for Facial Point Detection 采用了级联卷积神经网络的方法来预测人脸关键点，我针对这篇论文有过相应的实现，采用了 Caffe 框架，并利用论文作者开放出来的数据集进行训练，预测的结果还是相当不错的，相关代码已托管在 github 上，请戳这里，欢迎指正。

我们回到 3000fps 这篇论文，论文中思路与前几年的论文 Face Alignment by Explicit Shape Regression(下面简称 ESR) 还有 Robust face landmark estimation under occlusion(下面简称 RCPR) 有共通之处。这三篇论文的总体思路都可以用下面这个公式来表达

$$S^{t} = S^{t-1} + R^{t}(I, S^{t-1})$$

这个公式包含了很多信息，我们先来认识几个名词。我们把关键点的集合称作形状，形状包含了关键点的位置信息，而这个位置信息一般可以用两种形式表示，第一种是关键点的位置相对于整张图像，第二种是关键点的位置相对于人脸框(标识出人脸在整个图像中的位置)。我们把第一种形状称作绝对形状，它的取值一般介于 0 到 w or h，第二种形状我们称作相对形状，它的取值一般介于 0 到 1。这两种形状可以通过人脸框来做转换。公式中的 $S^t$ 就表示了绝对形状，$R^t$ 表示一个回归器，$I$ 表示图像，$R^t$ 根据图像和形状的位置信息，预测出一个形变，并将它加到当前形状上组成一个新的形状。$t$ 表示级联层数，一般我们会通过多层级联来预测形状。

回归器 $R^t$

ESR 和 RCPR 采用了随机厥作 Regression，随机厥在叶子节点中存储了对应的形变，在预测过程中，当样本落入某个叶子节点时，就将其上存储的形变作为预测的输出，我们在这里不具体展开随机厥的相关内容。而在 3000fps 中使用了较为复杂的实现。首先，3000fps 并没用采用随机厥作为预测的单元，而是采用了随机树，并且用随机森林来做预测。其次 3000fps 并没有直接采用随机树叶子节点存储的形变量作为预测输出，而是将随机森林的输出组成一种特征(称作 LBF)，利用这个 LBF 来做预测。除了采用随机森林的结构来做预测，3000fps 还针对每个关键点给出一个随机森林来做预测，并将所有关键点对应的随机森林输出的局部特征相互连接起来，称作局部二值特征(LBF)，然后利用这个局部二值特征来做全局回归，用来预测形变。

上图描述了回归器 $R^t$ 的训练和预测过程。其中 $\Phi^{t}_{l}$ 表示第 $t$ 级中第 $l$ 个关键点所对应的随机森林，所有关键点的随机森林一起组成了 $\Phi^t$，它的输出为 LBF 特征。然后利用 LBF 特征来训练全局线性回归或者预测形变。

上图描述了 $R^t$ 生成 LBF 特征的过程，图的下半部分描述了单个关键点上随机森林输出了一个局部二值特征，然后把所有随机森林的输出前后连接起来组成一个非常大但又非常稀疏的 LBF 特征。这个特征只有 01 组成，且大部分是 0，特征非常稀疏。

Shape-indexed 特征

每个关键点都会对应一个随机森林，而每个随机森林是由多个相互独立的随机树组成。论文中的随机树采用的特征被称作 Shape-indexed 特征，ESR 和 RCPR 中也是用到了相同的特征，这个特征主要描述为人脸区域中两个点的像素差值。关于两个像素点的选取，三个方法使用到了不同的策略。

ESR 方法采用在两个关键点附近随机出两个点，做这两个点之间的差值作为 Shape-indexed 特征.

RCPR 方法采用选取两个关键点的中点外加一个随机偏置来生成特征点，用两个这样的特征点的差值作为 Shape-indexed 特征。

在 3000fps 中，由于随机森林是针对单个关键点的，所有随机树中使用到的特征点不会关联到其他关键点上，只在当前关键点的附近区域随机产生两个特征点，做像素差值来作为 Shape-index 特征。

3000fps 中随着级联的深入(即 $t$ 越来越大)，随机点的范围也会逐渐变小，以期获得更加准确的局部特征。

随机树的训练

上一节已经确定了随机树训练时用到的 Shape-indexed 特征。在训练随机树时，我们的输入是 $X={I, S}$ 而预测目标是 $Y=\Delta S$。实际在训练随机树时，树中的每个节点的训练过程都是一样的。我们在训练某个节点时，从事先随机生成好的 Shape-indexed 特征集合 $F$ 中选取一个(当然，你也可以临时随机生成一个特征集合，或整棵随机树使用一个特征集合或整个随机森林使用一个特征集合，我们这里假设这棵随机树使用一个特征集合)，选取的特征能够将所有样本点 $x$ 映射成一个实数集合，我们再随机一个阈值将样本点分配到左右子树中，而目的是希望左右子树中的样本点的 $y$ 具有相同的模式。特征选取可以用如下公式描述

$$f = \underset{f \in F}{\operatorname{argmax}}\Delta$$

$$\Delta = S(y | y \in Root) - [S(y | y \in Left) + S(y | y \in Right)]$$

$$ y \in \begin{cases} Left, & f(x) < \delta \ Right, & f(x) >= \delta \ \end{cases}$$

上述公式中 $F$ 表示特征函数集合，$f$ 表示选取到的特征函数(即利用随机到的特征点计算 Shape-index 特征)，$\delta$ 表示随机生成的阈值，$S$ 用来刻画样本点之间的相似度或者样本集合的熵(论文中采用了方差)。针对每个节点，训练数据 $(X, Y)$ 将会被分成两部分 $(X_1, Y_1)$ 和 $(X_2, Y_2)$，我们期望左右子树中的样本数据具有相同的模式($Y$ 的分布尽量固定下来，熵越小？)，这个论文中用了方差来刻画，所以选择特征函数 $f$ 时，我们希望方差减小最大。

随机树的每个节点都采用这种方法训练，而每棵随机树都是相互独立训练的，训练过程都是一样的，这样单个关键点的随机森林就训练完毕了。

全局线性回归训练

按照常理，我们可以在随机树的叶子节点上存储预测的形变，测试时可以将随机森林中每棵随机树的预测输出做平均或者加权平均，然而 3000fps 并没有这样做，它将随机树的输出表示成一个二值特征(详情见上面的图)，将所有随机树的二值特征前后连接起来组成一个二值特征，即 LBF 特征。论文中，利用这个特征做了一次线性回归，将形变作为预测目标，训练一个线性回归器来做预测。

$$W_t = \underset{W_t}{\operatorname{argmin}} |{\Delta S - W_t \cdot lbf}|_2 + \lambda |W_t|_2$$

线性回归可以用如上公式表示，$\Delta S$ 形变目标，$lbf$ 表示特征，$W_t$ 是线性回归的参数，$\lambda$ 用来抑制模型，防止出现过拟合。预测时采用下面的公式

$$\Delta S = W^t \cdot lbf$$

在 3000fps 论文中，多级级联回归的方法的每一级都可以按如上所讲的拆分两个部分，利用随机森林提取局部二值特征，再利用局部二值特征做全局线性回归预测形状增量 $\Delta S$。

关于 $S^0$

在之前的讨论中，我们并没有说明 $\Delta S$ 具体是绝对形状增量还是相对形状增量。在实际情况中，我们需要 $\Delta S$ 为相对形状增量，因为绝对形状的位置是相对于整个图像的，我们无法对数据的绝对形变的分布做约束(绝对形变虽然可以抹去位置的绝对信息，但人脸框的尺度无法约束)。在提取局部二值特征时，我们需要绝对形状下的图像像素信息，而在预测得到的则是相对形状增量，而这两者可以通过人脸框做相互之间的转换。

所有形变 $\Delta S$ 均是相对于当前形状而言，通过级联的方式叠加在一起，而初始形状 $S_0$ 与模型预测本身无关，但是这个 $S_0$ 模型预测过程中起来关键性作用。我们假设预测样本理论上的真实形状为 $S_g$，那么 $S_0$ 和 $S_g$ 两者之间的差异的大小将直接影响到预测结果的准确性。3000fps 中采用了训练样本的平均形状作为初始形状，而 RCPR 则选择从训练样本中随机选择初始形状。

一般来说，$S_0$ 是相对形状通过人脸框转变为绝对形状对应到当前的人脸中，那么人脸框的尺度对 $S_0$ 与 $S_g$ 之间的差异也起到了决定性的作用，所以我们一般都需要用相同的人脸检测方法来标记训练图像和预测图像上的人脸框，保证两者的人脸框尺度，从而提高 $S_0$ 的准确性。但是我们不得不承认算法本身仍旧受到 $S_0$ 很大的影响。包括 RCPR 和 ESR 方法也同样受制于 $S_0$。相比较而言，深度学习方法则没有太大的影响，一般可以先通过网络来预测得到 $S_0$，这时的 $S_0$ 和 $S_g$ 之间的误差能够做到非常小，进而再在 $S_0$ 的基础上做细微的修正，提高精度，深度级联卷积网络预测关键点就是采用了这个思路。虽然深度学习的方法能够摆脱 $S_0$ 的限制，但它仍然受制于人脸框的尺度，而且尺度对其模型的预测影响比其他传统方法也好不到哪里去。

论文实现

论文作者并没有开放源代码出来，但是已经有同学用 Matlab 实现了论文，并且开源到了 github 平台，项目地址在这里。同时也有同学参照 Matlab 代码利用 C++ 重新实现了一遍，项目地址在这里。我参考了这两份代码自己采用 C++ 重新实现了论文中的方法。

数据采集与预处理

搞机器学习的永远少不了跟数据打交道，还好在人脸关键点检测这方面的开放数据还是蛮多的，参考论文中使用到的数据集，我们可以从这里下载到人脸 68 个点的数据集，可供我们做训练和测试用。有了现成的数据就省去了我们自己搜刮数据这一步，而且数据集中的数据格式还是非常规范的。

数据的预处理一般都是对我们得到的数据的再加工，在送给模型做训练之前我们还需要打磨打磨我们的数据。这里我们需要对人脸框做重新定位，上面已经讨论过人脸框的尺度对 $S_0$ 的影响，我们应该在训练和检测的过程中采用相同的人脸检测器来标识人脸框，现在开源出来的用得最多的还是 OpenCV 自带的 VJ 检测器(虽然是个废=。=)。利用 VJ 检测器可以锁定关键点标识的人脸的位置，给出人脸框。

我们的训练数据大致就是人脸图像，人脸框，人脸真实形状这三种数据，$X = { I, BBox }$, $Y = { Shape }$。之前和一些同学交流实现时，有些同学出现的爆内存的情况，大致上是将训练图像全部载入到内存后做了一些处理。导致内存飙升致程序崩溃。其实考虑图像 $I$，我们并不需要整幅图像作为训练数据，一般只要将人脸附近一块图像区域截取出来作为 $I$ 即可，这本身并不影响我们的训练数据，只要相应的重新计算关键点位置即可，因为数据集中的图片大多都是高清无码大图，人脸可能只占图像的一小块，这种方法能够减少非常多的内存消耗。同时我们应该批量处理数据，一口气将所有数据载入内存而后在处理的方式会把程序的最大内存消耗开到最大，同样有可能出现崩溃。当然，如果你的程序是 64 位的，机子内存又是杠杠的，那就完全不用理会上面的这些优化操作了。

Data Augmentation

Data Augmentation 在现有数据集不够充足时，通过变换已有的数据来增加数据集的大小，可谓是一种经济又环保的方法。根据不同的数据和模型，我们可以采用不同的变换手段，如果目标具有对称不变性，那么水平翻转图像将是一种不错的手段，常见的还有旋转图像来增加数据集。针对人脸关键点定位，显然我们左右翻转人脸并不会影响人脸的结构(左右眼交错一下也没什么影响)，包括轻微旋转人脸也同样能够增加训练数据集。

在训练过程中，我们需要给每个训练样本一个初始形状，这个初始形状可以从样本中随机选取，通过选取多个初始形状，我们同样可以达到增加数据集的效果。

随机森林的并行训练

随机森林的构造和训练并没有什么特别之处，这里主要谈谈如何加速训练过程。我们知道随机森林中的每棵随机树都是相互独立训练的，而每个关键点对应的随机森林也是相互独立的，这样，我们就可以并行训练随机树。考虑到并行计算的编码问题，我们并不需要通过多线程并发的方式来实现并行计算，这里可以使用到 OpenMP 来实现并行计算，OpenMP 会将我们的代码通过它自定义的语法将其并行化，底层用的是系统级线程的实现方式，现在的编译器已经都内置了对 OpenMP 的支持，所以代码移植也很方便。OpenMP 的性能可能没有直接使用系统极线程的方式来得高，但是它简单易学，使用非常方便，可能在线程切换方面比直接使用系统线程库来的高效，但是对于编码而言却是非常简单，几条语句就可以将原本串行的代码变成并行，修改代码的代价非常低。

全局线性回归

训练完随机森林后，我们就能够得到训练数据的 LBF 特征了，根据这个 LBF 特征在加上相对形状增量这个预测目标，我们便可以训练全局线性回归模型了。由于大量的训练数据，再加上 LBF 特征的高度稀疏，论文中提到了利用双坐标下降的方法来训练高度稀疏的线性模型，本人对这个并不是十分了解，还好发明这个方法的人专门写了一套求解线性模型的库，并开源在 github 上，项目名称是 liblinear。这个库本身采用 C++ 编写，也提供了很对语言的绑定，这里我们可以直接采用它的 C++ 代码，根据相关文档准备好相应的数据，直接调用就可求解到模型参数，使用起来还是很方便的。

结果

3000fps 中的算法是一个级联的模型，每一级是随机森林加上全局回归，通过多次级联来求得相对形状增量，从而计算得到最终的形状。下面是我自己训练好的模型的预测结果。

总结

3000fps 这篇论文所用的方法除了有比较好的精度，关键在于其方法的预测速度相当的快。论文中采用的快速模型能够达到 3000fps 的速度预测 68 个点，速度非常恐怖。本人实现的结果是 300fps，CPU 单核 3.2GHz，内存 8G，论文中并没有提到其使用到的机器性能如何，只提到了其实现方法只使用了单核，我认为论文中的实现应该是在底层做过相应的优化才能达到如此高的速度，当然我们也没有必要可以追求速度，能够达到实时的性能就可以了(不过，谁都会认为越快越好)。

参考资料

Mini-Caffe: Porting Caffe to WIN32

2015-07-14T00:00:00+08:00

Recently, Our team has slowly dived to Deep Learning. We did a lot of works relatived to Deep Learning and Computer Vision. At first, we used traditional way to deal with our cv problems but these days, we turned to Convolutional Neural Network and Deep Learning for a better performance in face of our cv problems. Actually, CNN and DL has achieved the state-of-the-art of performance at various computer vision problems, and has been heavily studied in these days. I think our team starts to use CNN and DL to face our problems is very smart, and we should stick to DL for our future cv-related problems.

As we all konw, nowadays, DL has caused a widespread concern due to its extremely good performance compared with traditional ways. DL is a part of Machine Leaning and CNN is a part of DL. CNN is very suitable for computer vision(images). Because of Convolution Layer defined in CNN, 2-dimension input is allowed in CNN and CNN is a very powerful tool to deal with image-related machine learning tasks. Local connection and shared weight simplifies the neural network structure and makes it much easier to train the network. The Pooling Layer makes the network much more robust to the input. CNN could still use SGD to train the network as other models use. Above of all is part of goodness from DL, what's more important is that there are many excellent open source projects out of the world from researchers. They are very easy to access on github or google code(will be closed in future). Caffe is one of them. It is implemented in C++ and will be every fast with CUDA and CuDNN if you have a GPU on your machine. Of cause, there are other CNN implementations such as Theano in Python, Torch7 in Lua and DeepLearning4J in Java etc. Our team adopt Caffe not only it is fast in C++ but also the Python binding and Matlab binding provided by Caffe. It is easy to use caffe-tool to train CNN model and write Python and C++ code to test or run the CNN model, it costs a few time to write code with Python or C++ to run the models.

Installing Caffe on Linux or MacOSX is very easy, we just need to follow the document from Caffe and install all dependance it need and simply modify the Makefile to compile the Caffe source code, build Python binding is extremely easy with one command line provided by Caffe's Makefile. Beacause of the various Data Layer Caffe provides, there are many third party libraries needed to be install in order to compile the source code. But any way, who cares to install open source projects on Linux, they are far to much easy with package managers like apt-get and yum or others, we can even compile the third party libraries ourselves which costs little. The developers provide a stable version of Caffe running on Linux or Unix, but doesn't provide Windows support. Acctually, I'm strongly suggest you to deploy your Caffe enviroment on Linux Server with GPU supported. But we may develop our code on Windows or we may want to run our Caffe supported projects on Windows. It's a pity that official Caffe currently can not run on Windows. The community has develop an unofficial version of Caffe that can be installed and run on Windows, the project is here. There are also other different solutions make Caffe run on Windows, but all these projects attempt to make full Caffe environment run on Windows, they need all third party libraries to be installed which they may compile themselves or the a pre-compiled version. These projects are great since they make Caffe runnable on Windows but they are not the solution to the situation that our team faced, actually, they are super solution to our situation, all works I did is base on these projects and make some little changes that make Caffe more smaller and easier to run on Windows.

As I mentioned above, projects out there are aiming to give a full environment of Caffe on Windows. They are more likely wanting to train the Caffe models on Windows, so they porting all DataLayer of Caffe since it is important for CNN to support various data format as input(Like HDF5, LMDB, LevelDB etc). But our team is just wanting a test environment of Caffe on windows, we still train our CNN models on Linux, and all we need is a MemoryLayer in Caffe to read CNN's input from memory. So, I remove all other DataLayers in Caffe which also reduce the number of third party libraries we need(actually, most of third party libraries needed is to support various DataLayer). Since I remove many DataLayer source code, we only need OpenCV, Boost, protobuf, glog and gflags, and these libraries are very easy to install on Windows whether we can compile ourselves or install an pre-compiled version. After install the third party libraries, we can compile Caffe source code and get the runtime core of Caffe working.

I have created a github project here. It remove all the source code that not needed for Caffe's runtime core, which including various DataLayer source code, extra tools like train.cpp and some scripts for handling or preprocessing data. I also remove Matlab binding code but I still keep the Python binding code since I use Python a lot. I have not build Python binding due to the usage way on Windows, it still useful but not has much signification on Windows.

The project is working and our team now can develop and test Caffe-relative projects on Windows, otherwise, we usually develop on Windows and test on Linux Server which has a lot of trouble and unconvenient in this way.

Caffe 小试牛刀

2015-04-11T00:00:00+08:00

在 Deep Learning 如此火的今天，Caffe 的出现使得我们接触深度学习的门槛变得异常之低。在自己的笔记本或者远端服务器上部署这个 Caffe 框架也是异常简单，照着官方给出的安装文档可以在大部分 Linux 发行版和 Mac OS 上安装好这个深度学习框架。官方暂时没有给出 Windows 版本的 Caffe，不过社区已经有人移植到了 Windows 上，项目地址在 github 上。我们尽量还是在 *nix 平台上部署 Caffe 框架。

Caffe 简介

Caffe 是一个清晰而高效的深度 CNN 学习框架，其作者是博士毕业于 UC Berkeley 的贾扬清，目前在Google工作。Caffe 支持命令行，并提供了 Python 和 Matlab 接口方便开发者和研究人员调用，而且其框架本身可以在 CPU/GPU 之间无缝切换，非常方便。Caffe 的详细文档在这里。根据官方文档安装完各种依赖库后就可以编译 Caffe 框架了，有些依赖库可能软件源中的版本过低或者根本就没有，可以自己源码编译安装。

Kaggle 上的数字识别

Kaggle 是一个数据竞赛平台，提供各种需求和数据给全世界的参赛者。其中有一个比赛项目是 Digit Recognizer，就是著名的手写体数字识别。Yann LeCun 提出的 LeNet 已经能够很好地解决这个问题了，Caffe 官方的 Example 中就有 LeNet 的实现。我们就利用这个 CNN 网络模型再加上 Kaggle 提供的数据来走一边深度学习的流程，从数据的获得与清理，CNN 模型的训练，再到最后的数据预测。

利用 Caffe 解决 Kaggle 上的数字识别

Kaggle 给这个项目提供了两个数据，分别为训练数据和测试数据。但是我们拿到的数据并不是图片，而是 csv 格式的数据，至于数据的具体内容，Kaggle 官方有详细的说明，可以参考这里。

csv 数据预处理

Caffe 在训练时可以采用各种格式的输入数据(不同的 Data 层)，详细的格式参见官方文档。这里我采用了 HDF5 格式的输入数据，下面的代码将 csv 格式的数据转换成了 HDF5 格式的数据。代码采用了 Python 编写，利用 Pandas 来读取 cvs 数据，并用 h5py 来写 HDF5 格式的数据。

#!/usr/bin/env python

import os
import logging
import numpy as np
import pandas as pd
import h5py


DATA_ROOT = 'data'
join = os.path.join
TRAIN = join(DATA_ROOT, 'train.csv')
train_file = join(DATA_ROOT, 'mnist_train.h5')
test_file = join(DATA_ROOT, 'mnist_test.h5')

# logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
sh = logging.StreamHandler()
sh.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
sh.setFormatter(formatter)
logger.addHandler(sh)

# load data from train.csv
logger.info('Load data from %s', TRAIN)
df = pd.read_csv(TRAIN)
data = df.values

logger.info('Get %d Rows in dataset', len(data))

# random shuffle
np.random.shuffle(data)

# all dataset
labels = data[:, 0]
images = data[:, 1:]

# process data
images = images.reshape((len(images), 1, 28, 28))
images = images / 255.

# train dataset number
trainset = len(labels) * 3 / 4

# train dataset
labels_train = labels[:trainset]
images_train = images[:trainset]
# test dataset
labels_test = labels[trainset:]
images_test = images[trainset:]

# write to hdf5
if os.path.exists(train_file):
    os.remove(train_file)
if os.path.exists(test_file):
    os.remove(test_file)

logger.info('Write train dataset to %s', train_file)
with h5py.File(train_file, 'w') as f:
    f['label'] = labels_train.astype(np.float32)
    f['data'] = images_train.astype(np.float32)

logger.info('Write test dataset to %s', test_file)
with h5py.File(test_file, 'w') as f:
    f['label'] = labels_test.astype(np.float32)
    f['data'] = images_test.astype(np.float32)

logger.info('Done')

在这里，我把数据分割成了两部分，分别作为训练数据和测试数据(与 Kaggle 提供的 test.csv 数据不同，这里的测试数据是带有 label 的)，方便测试模型的准确性。

CNN 网络的训练

这里直接用 Caffe 自带的 Example 中的模型。网络的定义可以在 Caffe 源码目录中找到，这里我就不全贴了，只贴一下输入的 DataLayer。

layer {
  name: "mnist"
  type: "HDF5Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  hdf5_data_param {
    source: "data/mnist_train.txt"
    batch_size: 64
  }
}
layer {
  name: "mnist"
  type: "HDF5Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  hdf5_data_param {
    source: "data/mnist_test.txt"
    batch_size: 100
  }
}

其中的 data/mnist_train.txt 和 data/mnist_test.txt 记录数据文件的路径。下图是我用 Graphviz 画的 LeNet 网络图。

{% image fancybox center /assert/img/2015/04/lenet.jpg %}

有了数据和网络定义，我们还需要训练网络时的参数配置，这些配置数据写在 lenet_solver.prototxt 中。具体含义可以参考文档。

# The train/test net protocol buffer definition
# 网络的定义文件路径
net: "model/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
# 每次测试时迭代 100 次
test_iter: 100
# Carry out testing every 500 training iterations.
# 训练网络模型时，每迭代 500 次作一次测试
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
# 初始学习率，权值衰减
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
# 网络学习参数的衰减方式及其参数
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
# 每迭代 100 次显示网络的输出(loss 等数据)
display: 100
# The maximum number of iterations
# 训练迭代次数
max_iter: 10000
# snapshot intermediate results
# 每隔 5000 次迭代就把网络参数和网络的训练状态保存到文件系统
snapshot: 5000
snapshot_prefix: "model/"
# solver mode: CPU or GPU
# 采用 CPU
solver_mode: CPU

caffe train --solver=model/lenet_solver.prototxt 这条命令开始训练网络。下面是训练时的部分输出。

I0411 21:21:04.364305 31883 solver.cpp:266] Iteration 0, Testing net (#0)
I0411 21:21:07.918603 31883 solver.cpp:315]     Test net output #0: accuracy = 0.0966
I0411 21:21:07.918660 31883 solver.cpp:315]     Test net output #1: loss = 2.3217 (* 1 = 2.3217 loss)
I0411 21:21:07.979472 31883 solver.cpp:189] Iteration 0, loss = 2.39515
I0411 21:21:07.979532 31883 solver.cpp:204]     Train net output #0: loss = 2.39515 (* 1 = 2.39515 loss)
I0411 21:21:07.979560 31883 solver.cpp:464] Iteration 0, lr = 0.01
I0411 21:21:13.373544 31883 solver.cpp:189] Iteration 100, loss = 0.309522
I0411 21:21:13.373603 31883 solver.cpp:204]     Train net output #0: loss = 0.309522 (* 1 = 0.309522 loss)
I0411 21:21:13.373621 31883 solver.cpp:464] Iteration 100, lr = 0.00992565
I0411 21:21:18.770283 31883 solver.cpp:189] Iteration 200, loss = 0.342084
I0411 21:21:18.770339 31883 solver.cpp:204]     Train net output #0: loss = 0.342084 (* 1 = 0.342084 loss)
I0411 21:21:18.770357 31883 solver.cpp:464] Iteration 200, lr = 0.00985258
I0411 21:21:24.270835 31883 solver.cpp:189] Iteration 300, loss = 0.178883
I0411 21:21:24.270900 31883 solver.cpp:204]     Train net output #0: loss = 0.178883 (* 1 = 0.178883 loss)
I0411 21:21:24.270920 31883 solver.cpp:464] Iteration 300, lr = 0.00978075
I0411 21:21:29.655320 31883 solver.cpp:189] Iteration 400, loss = 0.0702766
I0411 21:21:29.655375 31883 solver.cpp:204]     Train net output #0: loss = 0.0702766 (* 1 = 0.0702766 loss)
I0411 21:21:29.655393 31883 solver.cpp:464] Iteration 400, lr = 0.00971013
I0411 21:21:35.007863 31883 solver.cpp:266] Iteration 500, Testing net (#0)
I0411 21:21:38.496989 31883 solver.cpp:315]     Test net output #0: accuracy = 0.9698
I0411 21:21:38.497042 31883 solver.cpp:315]     Test net output #1: loss = 0.0967276 (* 1 = 0.0967276 loss)
I0411 21:21:38.554558 31883 solver.cpp:189] Iteration 500, loss = 0.186758
I0411 21:21:38.554613 31883 solver.cpp:204]     Train net output #0: loss = 0.186758 (* 1 = 0.186758 loss)
I0411 21:21:38.554631 31883 solver.cpp:464] Iteration 500, lr = 0.00964069
I0411 21:21:43.980552 31883 solver.cpp:189] Iteration 600, loss = 0.112056
I0411 21:21:43.980610 31883 solver.cpp:204]     Train net output #0: loss = 0.112056 (* 1 = 0.112056 loss)
I0411 21:21:43.980628 31883 solver.cpp:464] Iteration 600, lr = 0.0095724
I0411 21:21:49.568586 31883 solver.cpp:189] Iteration 700, loss = 0.074904
I0411 21:21:49.568653 31883 solver.cpp:204]     Train net output #0: loss = 0.074904 (* 1 = 0.074904 loss)
I0411 21:21:49.568675 31883 solver.cpp:464] Iteration 700, lr = 0.00950522
I0411 21:21:54.960841 31883 solver.cpp:189] Iteration 800, loss = 0.220085
I0411 21:21:54.960911 31883 solver.cpp:204]     Train net output #0: loss = 0.220085 (* 1 = 0.220085 loss)
I0411 21:21:54.960932 31883 solver.cpp:464] Iteration 800, lr = 0.00943913
I0411 21:22:00.352416 31883 solver.cpp:189] Iteration 900, loss = 0.0172225
I0411 21:22:00.352488 31883 solver.cpp:204]     Train net output #0: loss = 0.0172226 (* 1 = 0.0172226 loss)
I0411 21:22:00.352511 31883 solver.cpp:464] Iteration 900, lr = 0.00937411
I0411 21:22:05.703879 31883 solver.cpp:266] Iteration 1000, Testing net (#0)
I0411 21:22:09.199872 31883 solver.cpp:315]     Test net output #0: accuracy = 0.9801
I0411 21:22:09.199944 31883 solver.cpp:315]     Test net output #1: loss = 0.0650562 (* 1 = 0.0650562 loss)
I0411 21:22:09.256795 31883 solver.cpp:189] Iteration 1000, loss = 0.118511
I0411 21:22:09.256847 31883 solver.cpp:204]     Train net output #0: loss = 0.118511 (* 1 = 0.118511 loss)
I0411 21:22:09.256867 31883 solver.cpp:464] Iteration 1000, lr = 0.00931012
...
...
...
I0411 21:30:26.140774 31883 solver.cpp:266] Iteration 9000, Testing net (#0)
I0411 21:30:29.663858 31883 solver.cpp:315]     Test net output #0: accuracy = 0.9898
I0411 21:30:29.663919 31883 solver.cpp:315]     Test net output #1: loss = 0.0369673 (* 1 = 0.0369673 loss)
I0411 21:30:29.715962 31883 solver.cpp:189] Iteration 9000, loss = 0.00257692
I0411 21:30:29.716016 31883 solver.cpp:204]     Train net output #0: loss = 0.00257717 (* 1 = 0.00257717 loss)
I0411 21:30:29.716032 31883 solver.cpp:464] Iteration 9000, lr = 0.00617924
I0411 21:30:35.261111 31883 solver.cpp:189] Iteration 9100, loss = 0.000706766
I0411 21:30:35.261175 31883 solver.cpp:204]     Train net output #0: loss = 0.000707015 (* 1 = 0.000707015 loss)
I0411 21:30:35.261193 31883 solver.cpp:464] Iteration 9100, lr = 0.00615496
I0411 21:30:40.733172 31883 solver.cpp:189] Iteration 9200, loss = 0.00721649
I0411 21:30:40.733232 31883 solver.cpp:204]     Train net output #0: loss = 0.00721672 (* 1 = 0.00721672 loss)
I0411 21:30:40.733252 31883 solver.cpp:464] Iteration 9200, lr = 0.0061309
I0411 21:30:46.430910 31883 solver.cpp:189] Iteration 9300, loss = 0.0106291
I0411 21:30:46.430974 31883 solver.cpp:204]     Train net output #0: loss = 0.0106294 (* 1 = 0.0106294 loss)
I0411 21:30:46.430991 31883 solver.cpp:464] Iteration 9300, lr = 0.00610706
I0411 21:30:52.084485 31883 solver.cpp:189] Iteration 9400, loss = 0.0217876
I0411 21:30:52.084548 31883 solver.cpp:204]     Train net output #0: loss = 0.0217879 (* 1 = 0.0217879 loss)
I0411 21:30:52.084563 31883 solver.cpp:464] Iteration 9400, lr = 0.00608343
I0411 21:30:57.599124 31883 solver.cpp:266] Iteration 9500, Testing net (#0)
I0411 21:31:01.165457 31883 solver.cpp:315]     Test net output #0: accuracy = 0.9908
I0411 21:31:01.165515 31883 solver.cpp:315]     Test net output #1: loss = 0.0361107 (* 1 = 0.0361107 loss)
I0411 21:31:01.221964 31883 solver.cpp:189] Iteration 9500, loss = 0.00431475
I0411 21:31:01.222023 31883 solver.cpp:204]     Train net output #0: loss = 0.00431501 (* 1 = 0.00431501 loss)
I0411 21:31:01.222040 31883 solver.cpp:464] Iteration 9500, lr = 0.00606002
I0411 21:31:06.748987 31883 solver.cpp:189] Iteration 9600, loss = 0.00301128
I0411 21:31:06.749049 31883 solver.cpp:204]     Train net output #0: loss = 0.00301154 (* 1 = 0.00301154 loss)
I0411 21:31:06.749068 31883 solver.cpp:464] Iteration 9600, lr = 0.00603682
I0411 21:31:12.305821 31883 solver.cpp:189] Iteration 9700, loss = 0.0178924
I0411 21:31:12.305883 31883 solver.cpp:204]     Train net output #0: loss = 0.0178927 (* 1 = 0.0178927 loss)
I0411 21:31:12.305903 31883 solver.cpp:464] Iteration 9700, lr = 0.00601382
I0411 21:31:18.102248 31883 solver.cpp:189] Iteration 9800, loss = 0.0116095
I0411 21:31:18.102319 31883 solver.cpp:204]     Train net output #0: loss = 0.0116097 (* 1 = 0.0116097 loss)
I0411 21:31:18.102339 31883 solver.cpp:464] Iteration 9800, lr = 0.00599102
I0411 21:31:24.297734 31883 solver.cpp:189] Iteration 9900, loss = 0.0111304
I0411 21:31:24.297801 31883 solver.cpp:204]     Train net output #0: loss = 0.0111307 (* 1 = 0.0111307 loss)
I0411 21:31:24.297826 31883 solver.cpp:464] Iteration 9900, lr = 0.00596843
I0411 21:31:29.688841 31883 solver.cpp:334] Snapshotting to model/_iter_10000.caffemodel
I0411 21:31:29.713232 31883 solver.cpp:342] Snapshotting solver state to model/_iter_10000.solverstate
I0411 21:31:29.741745 31883 solver.cpp:248] Iteration 10000, loss = 0.0402425
I0411 21:31:29.741792 31883 solver.cpp:266] Iteration 10000, Testing net (#0)
I0411 21:31:33.262156 31883 solver.cpp:315]     Test net output #0: accuracy = 0.9902
I0411 21:31:33.262218 31883 solver.cpp:315]     Test net output #1: loss = 0.0349098 (* 1 = 0.0349098 loss)
I0411 21:31:33.262231 31883 solver.cpp:253] Optimization Done.
I0411 21:31:33.262240 31883 caffe.cpp:134] Optimization Done.

我们可以发现到后面的准确率达到了 99% 以上，我怀疑是过拟合了。这里用 CPU(Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz) 训练的时间只要 10 分钟左右，速度还是相当快的。这样就能得到训练好的网络参数用来做数据预测。

预测 test.csv 中的数据

test.csv 文件的数据格式与 train.csv 差不多，只是没有 label，因为这些 label 需要我们来预测。我采用了 Python 作预测，Caffe 为 Python 提供了相应的借口，编译 Caffe 时记得顺带编译 Python 模块，当然前提是你先装好相应的依赖库，numpy 肯定逃不掉，具体过程请看文档。

我们先从 test.csv 中加载图像数据，做相应的预处理后交给 Caffe 做预测。初始化 Caffe 时需要上一步中的网络模型和训练得到的网络模型参数。

#!/usr/bin/env python

import os
import logging
import numpy as np
import pandas as pd
import caffe


DATA_ROOT = 'data'
MODEL_ROOT = 'model'
join = os.path.join
TEST = join(DATA_ROOT, 'test.csv')
OUTPUT = join(DATA_ROOT, 'result.csv')
CAFFE_MODEL = join(MODEL_ROOT, 'mnist.caffemodel')
CAFFE_SOLVER = join(MODEL_ROOT, 'lenet.prototxt')

logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
sh = logging.StreamHandler()
sh.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
sh.setFormatter(formatter)
logger.addHandler(sh)

# load test dataset
logger.info('Load test dataset from %s', TEST)
df = pd.read_csv(TEST)
data = df.values

data = data.reshape((len(data), 28, 28, 1))
data = data / 255.

# set caffe net
net = caffe.Classifier(CAFFE_SOLVER, CAFFE_MODEL)

# predict
logger.info('Start predict')
BATCH_SIZE = 100
iter_k = 0
labels = []
while True:
    logger.info('ITER %d', iter_k)
    batch = data[iter_k*BATCH_SIZE: (iter_k+1)*BATCH_SIZE]
    if batch.size == 0:
        break
    result = net.predict(batch)
    for label in np.argmax(result, 1):
        labels.append(label)
    iter_k = iter_k + 1
logger.info('Prediction Done')

# write to file
logger.info('Save result to %s', OUTPUT)
if os.path.exists(OUTPUT):
    os.remove(OUTPUT)
with open(OUTPUT, 'w') as fd:
    fd.write('ImageId,Label\n')
    for idx, label in enumerate(labels):
        fd.write(str(idx+1))
        fd.write(',')
        fd.write(str(label))
        fd.write('\n')

这里，我把预测结果按照 Kaggle 的要求写到了文件中，然后上传到 Kaggle 的评分系统中。

结果准确率有 95.5%，还是相当不错的。

小结

DL 已经相当流行了，Caffe 可以大大降低了入门的门槛。大牛们的论文都很开放，也开源了很多代码出来，方便我们这些门外汉学习和入门。大家有兴趣可以多接触接触。另外，我感觉到训练数据在 DL 的重要性可能已经超过了 DL 网络模型本身(虽然 DL 到现在也还是很难解释清楚其中的机理，但是它的成果还是能傲视群雄)。

Redis 源码之简单动态字符串

2015-03-14T00:00:00+08:00

Redis 并没有使用 C 中的字符数组，而是自己实现了一个简单动态字符串 SDS 结构。Redis 利用一块连续内存空间实现了动态字符串（字符串也是以'\0'结束，与部分 C 字符串操作函数兼容）。与 sds 相关的源码在 sds.h 和 sds.c 两个文件中。

SDS 结构定义

Redis 中定义了 sds 类型，实际为 char* 类型，结构体 sdshdr 可以理解为 sds 类型的头信息。sdshdr 和 sds 在内存中是一个整体，由 zmalloc(Redis 自己实现的一个 malloc 版本) 申请得到。

typedef char *sds; // sds 类型，实际等于 sdshdr 结构体中的 buf

struct sdshdr {
    unsigned int len; // 整个字节数组buf的长度，实际 buf 指向的内存区域有 len+1 个字节（'\0'永远占一个字节）
    unsigned int free; // buf 中剩余的字节数
    char buf[]; // 指向字节数组的指针
};

SDS 基本操作函数

sds 字符串的创建与释放

Redis 在初始化 sds 字符串时就会为 buf 申请一块连续的内存，紧跟在 buf 后面。

/* Create a new sds string with the content specified by the 'init' pointer
 * and 'initlen'.
 * If NULL is used for 'init' the string is initialized with zero bytes.
 * 利用 init 指针所指向的内容和 initlen 指定的长度来初始化 sds 字符串
 * 如果 init 为 NULL，则用字节'\0'来填充 sds 字符串
 *
 * The string is always null-termined (all the sds strings are, always) so
 * even if you create an sds string with:
 * 这个字符串永远都是以 NULL 结尾，哪怕你用下面的方式调用
 *
 * mystring = sdsnewlen("abc",3);
 *
 * You can print the string with printf() as there is an implicit \0 at the
 * end of the string. However the string is binary safe and can contain
 * \0 characters in the middle, as the length is stored in the sds header.
 * 由于字符串以'\0'结尾，你可以使用 printf() 来打印字符串，但字符串本身是二进制安全的，
 * 可以存放字节0，实际存放的字节数在 sds 头信息中 */
sds sdsnewlen(const void *init, size_t initlen) {
    struct sdshdr *sh;

    // 申请一整块内存，内存前几个字节存放 sdshdr 信息，后面的即为字节数组
    if (init) {
        sh = zmalloc(sizeof(struct sdshdr)+initlen+1);
    } else {
        sh = zcalloc(sizeof(struct sdshdr)+initlen+1); // 以字节0填充
    }
    if (sh == NULL) return NULL;
    sh->len = initlen;
    sh->free = 0;
    if (initlen && init)
        memcpy(sh->buf, init, initlen); // 复制数据
    sh->buf[initlen] = '\0';
    return (char*)sh->buf;
}

释放 sds 字符串的内存十分简单，因为 sdshdr 和 sds 为一个整体，因此释放时之需提供 sdshdr 的地址，即 sds - sizeof(struct sdshdr)。释放的函数 zfree 也是 Redis 自己实现的 free 版本

/* Free an sds string. No operation is performed if 's' is NULL. */
void sdsfree(sds s) {
    if (s == NULL) return;
    zfree(s-sizeof(struct sdshdr));
}

sds 字符串的动态调整

sds 字符串最关键的就是它的长度的动态调整，包括字节数组的拓展和压缩。

Redis 提供了一个函数用来拓展 sds 字符串的长度，这个函数用来拓展一个 sds 字符串的空余长度，由输入参数 addlen 控制，表示需要有 addlen 个空余字节数。实际 Redis 在操作时会判断字符串的长度，如果拓展后的总长度（已用和未用的字节数）少于 1M，则新的总长度为拓展后的 2 倍，否则新的总长度为拓展后总长度加 1M。这样做的目的就是为了减少 Redis 内存分配的次数，下次需要拓展时可能就不需要分配内存了。

/* Enlarge the free space at the end of the sds string so that the caller
 * is sure that after calling this function can overwrite up to addlen
 * bytes after the end of the string, plus one more byte for nul term.
 *
 * Note: this does not change the *length* of the sds string as returned
 * by sdslen(), but only the free buffer space we have. */
sds sdsMakeRoomFor(sds s, size_t addlen) {
    struct sdshdr *sh, *newsh;
    size_t free = sdsavail(s);
    size_t len, newlen;

    if (free >= addlen) return s; // 已有的空闲字节数大于要求的字节数，没必要拓展
    len = sdslen(s);
    sh = (void*) (s-(sizeof(struct sdshdr))); // sdshdr 指针
    newlen = (len+addlen);
    if (newlen < SDS_MAX_PREALLOC) // SDS_MAX_PREALLOC = 1024*1024
        newlen *= 2;
    else
        newlen += SDS_MAX_PREALLOC;
    newsh = zrealloc(sh, sizeof(struct sdshdr)+newlen+1); // '\0'需要固定的一个字节
    if (newsh == NULL) return NULL; // 分配失败

    newsh->free = newlen - len; // 更新空闲长度，不包括'\0'
    return newsh->buf;
}

sds 字符串既然有拓展，当然也会有压缩。Redis 使用了 sdsRemoveFreeSpace 函数用来回收 sds 字符串中的所有空余空间，实际操作时 Redis 会重新分配一块内存，将原有的数据复制到新的内存区域，并释放原有空间。

/* Reallocate the sds string so that it has no free space at the end. The
 * contained string remains not altered, but next concatenation operations
 * will require a reallocation.
 * 重新分配 sds 字符串的空间，使它没有空余空间。
 * 字符串中的内容不会改变，但是下次作字符串连接操作时，又会重新分配内存
 *
 * After the call, the passed sds string is no longer valid and all the
 * references must be substituted with the new pointer returned by the call.
 * 函数调用后，输入的指针 s 将会变成无效的 */
sds sdsRemoveFreeSpace(sds s) {
    struct sdshdr *sh;

    sh = (void*) (s-(sizeof(struct sdshdr)));
    sh = zrealloc(sh, sizeof(struct sdshdr)+sh->len+1); // 重新分配内存
    sh->free = 0;
    return sh->buf;
}

sds 字符串的操作函数

sds 字符串除了部分兼容 C 的字符串操作函数，Redis 自身也实现了一些 sds 字符串操作函数，包括字符串连接函数 sdscat 系列，整数转换成字符串，格式化字符串，字符串分割等很多操作。

sdscat 系列函数中最基础的就是 sds sdscatlen(sds s, const void *t, size_t len) 函数，它将指针 t 指向的内存空间的 len 个字节连接到 s 字符串上。其他的 cat 操作都是基于这个函数的。

/* Append the specified binary-safe string pointed by 't' of 'len' bytes to the
 * end of the specified sds string 's'.
 *
 * After the call, the passed sds string is no longer valid and all the
 * references must be substituted with the new pointer returned by the call. */
sds sdscatlen(sds s, const void *t, size_t len) {
    struct sdshdr *sh;
    size_t curlen = sdslen(s);

    s = sdsMakeRoomFor(s,len); // 分配足够的内存空间
    if (s == NULL) return NULL;
    sh = (void*) (s-(sizeof(struct sdshdr))); // 内存变更过后的 sdshdr 指针
    memcpy(s+curlen, t, len); // 复制
    sh->len = curlen+len;
    sh->free = sh->free-len;
    s[curlen+len] = '\0';
    return s;
}

sds sdstrim(sds s, const char *cset) 实现了字符串左右的 trim 操作。void sdsrange(sds s, int start, int end) 实现了字符串的子串操作，支持负索引。void sdstolower(sds s) 和 void sdstoupper(sds s) 调用了 C 内置的大小写转换函数用来实现 sds 字符串的大小写转换。Redis 还实现了其他很多字符串操作，很值得学习。

SDS 小结

sds 字符串是 Redis 中最基础的结构，使用一块连续内存来存放字符串的元信息和数据，减少了内存操作，申请与释放都只需要一次操作，而有了字符串的元信息，很多字符串的操作就能够得到优化，最简单的例子就是计算字符串的长度，C 中的 strlen 函数复杂度为 O(n)，而 Redis 中计算长度只是 O(1) 的操作。SDS 结构只是 Redis 中最基础的数据结构，不依赖其他 Redis 的数据结构，源码相对简单，但其中的设计仍然值得好好体味和学习。

json vs simplejson vs ujson

2015-01-03T00:00:00+08:00

本文为原创翻译，原文地址在这里

JSON已经毫无争议地成为现在最常用的数据交换格式。Python中有两个常用的库来处理json数据，一个是Python标准库中自带的json，另一个则是simplejson，这个库是纯Python实现，并做了相应的优化。这篇博文的目的是向大家介绍ultrajson，也叫做Ultra JSON，这个库使用C实现的，执行速度非常快。

我们对三个常用的json操作做了性能测评，这三个操作是load，loads，dumps。我们创建一个字典类型，包含_id_，name，address_这三个键。再利用json.dumps()将字典数据编码并保存到一个文件中。然后我们分别用json.loads()和json.load()从文件中加载数据。通过_10000，50000，100000，200000，_1000000_个这样的字典数据，我们来测试三个库在这些操作上的时间消耗。

利用dumps操作一个一个保存数据

利用json.dumps()操作一个一个地保存字典数据，我们得到了如下数据。

我们发现json的性能比simplejson要高，但是ultrajson的速度将近是json的4倍。

利用dumps操作直接保存所有数据

在这个测试中，我们把所有字典数据放在一个list列表中，并用json.dumps()保存这个list列表。

simplejson和json表现得差不多，但是ultrajson依旧比它们快1.5倍。接下来我们看看这三个库在load和loads操作上的对比。

利用load操作加载数据

我们用load操作来加载数据，这个数据是一个列表，里面放着字典数据。

我们惊奇地发现simplejson比另外两个库表现得都要好。ultrajson的性能很接近simplejson，而它们的速度都将近是json的4倍。

利用loads操作加载数据

这个测试中，我们利用json.loads()从文件中加载数据。

ultrajson又一次打败了其他两个库，比json快将近6倍，比simplejson快3倍。

做完这些测试之后，结果很明显。在任何情况下都应该使用simplejson来替代json，而且simplejson这个库本身受到很好的维护。如果你追求速度，那么可以使用ultrajson，但是你要记住，这个库在不是序列化数据的情况下表现并不好。当然，如果你只是处理文本数据的话，那就没什么可以担忧的了。

Scrapy 使用小记 2

2014-12-31T00:00:00+08:00

编写Spider

有了Scrapy为我们创建的初始项目，在这个基础上，我们就可以开始编写spider了。我们编写的spider将放在settings.py中指定的模块中，默认是在spiders模块下。我们需要创建一个文件来写我们的spider，Scrapy启动时会查找相应的spider并加载。

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)

这是Scrapy官方教程中的一个spider例子。name表示了爬虫的名字，allowed_domains里存放这个爬虫允许访问的域名，start_urls用来生成url请求，并将请求结果送入爬虫的parse方法作处理。这个类必须继承自scrapy.Spider，这样，Scrapy才知道这个类代表了爬虫。

编写Item

Item代表了从网页中提取出来的信息，我们可以从一个网页中提取到多个item，也可能是多种item，这取决于我们想要从页面中提取的信息。

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

这段代码同样来自Scrapy的教程，我们自定义的Item需要继承自scrapy.item，每一个字段都是Field类型，可以保存任意类型的Python对象。其实我们可以把Item当作Python的dict使用，而在这里定义的属性名就是它的关键字。

Item与Spider相结合

有了Item，我们就可以在Spider中编写处理页面的逻辑了，也就是从页面中提取信息并包装成Item，然后抛给Scrapy框架作后续处理。在这里，我们不使用Scrapy教程中的例子，我们自己动手写写，来抓取想要的页面。

我们来抓取Pixiv网站上当天排名前50的插画信息。Pixiv是一家日本的插画交流网站，聚集了很多一流的绘画高手，当然插画的内容主要都是二次元的。不管这么多了，我们就来抓抓看。在动手写之前，我们需要分析分析如何抓取，具体来说就是要向哪个url发出请求，在得到请求结果之后得分析页面来提取我们想要的信息。其实P站(即Pixiv)已经有一个url可以直接从这里获取当天插画排名的json数据。

http://www.pixiv.net/ranking.php?mode=daily&content=illust&format=json。

我们可以访问下面这个链接来看看今天的插画排名。

http://www.pixiv.net/ranking.php?mode=daily&content=illust

通过这里的链接，我们看到的直接是一个网页了，而不是得到json格式的数据。方便起见，我们就直接抓json格式的数据。有时候我们不一定能够直接得到json格式的数据，而是要通过html文件，通过分析html源码来分析出数据，Scrapy也提供了相应的方法来帮助我们分析页面。在简单情况下，我们可以使用Python标准库中的正则表达式模块re，直接从html中提取信息，当这个过程很复杂时，我推荐使用Scrapy为我们提供的工具来分析页面，或者使用第三方页面分析库，比如Beautifuloup。

我们先来写Item，简单起见，我们只提取插画的标题和插画的url地址。

import scrapy
from scrapy import Field

class IllustrationItem(scrapy.Item):
    """Item for Illustration on Pixiv
    """
    title = Field()
    url = Field()

就是这么的简单，而我们需要抓取的页面上面已经提到了，再得到这个json数据后，我们可以直接用Python的json模块格式化数据，并一个一个提取信息，包装成Illustrationtem，抛给外层Scrapy框架。

import json
import scrapy
from spixiv.items import IllustrationItem

class PixivSpider(scrapy.Spider):
    """Spider for daily top illustrations on Pixiv
    """
    name = 'pixiv'
    allowed_domains = ['pixiv.net']
    start_urls = ['http://www.pixiv.net/ranking.php?mode=daily&amp;content=illust&amp;format=json']

    def parse(self, response):
        jsondata = json.loads(response.body)

        date = jsondata['date']
        for one in jsondata['contents']:
            item = IllustrationItem()
            item['title'] = one['title']
            item['url'] = one['url']
            yield item

我们使用Python标准库中的json模块来操作json格式的数据。这里的Item使用就像Python的dict一样，我们也可以给它的字段赋值复杂类型的对象，比如序列，当然，在后续有处理Item的代码必须得知道每个字段的类型。我们使用yield向外层框架抛出IllustrationItem，以便Scrapy对其作后续处理。

开始抓取数据

Scrapy提供了一个简单的命令来启动我们的爬虫。

$ scrapy crawl [options] spider

这里的options可以来配置scrapy的行为，而spider则是我们爬虫的名字。直接在项目的根目录下(含有scrapy.cfg的目录)运行scrapy crawl pixiv，Scrapy默认会把Item的信息输出到控制台中。

2015-01-01 00:04:48+0800 [pixiv] DEBUG: Scraped from &lt;200 http://www.pixiv.net/ranking.php?mode=daily&amp;content=illust&amp;format=json&gt;
        {'title': u'\u5bb6\u65cf\u3068\u592b\u5a66\u3068',
         'url': u'http://i2.pixiv.net/c/240x480/img-master/img/2014/12/29/23/08/03/47845529_p0_master1200.jpg'}
2015-01-01 00:04:48+0800 [pixiv] DEBUG: Scraped from &lt;200 http://www.pixiv.net/ranking.php?mode=daily&amp;content=illust&amp;format=json&gt;
        {'title': u'\u843d\u66f8\u304d\u307e\u3068\u3081 No.8',
         'url': u'http://i1.pixiv.net/c/240x480/img-master/img/2014/12/29/00/44/17/47829688_p0_master1200.jpg'}
2015-01-01 00:04:48+0800 [pixiv] DEBUG: Scraped from &lt;200 http://www.pixiv.net/ranking.php?mode=daily&amp;content=illust&amp;format=json&gt;
        {'title': u'BB\u3061\u3083\u3093',
         'url': u'http://i3.pixiv.net/c/240x480/img-master/img/2014/12/29/17/16/52/47839178_p0_master1200.jpg'}
2015-01-01 00:04:48+0800 [pixiv] DEBUG: Scraped from &lt;200 http://www.pixiv.net/ranking.php?mode=daily&amp;content=illust&amp;format=json&gt;
        {'title': u'-\u6708\u306e\u60f3\u3044-',
         'url': u'http://i1.pixiv.net/c/240x480/img-master/img/2014/12/29/00/00/08/47828516_p0_master1200.jpg'}

我们也可以通过options参数将Item保存成json格式的数据。

$ scrapy crawl pixiv -o illustration.json

这样，Item的数据就会以json格式保存到illustration.json这个文件中了。

使用ItemPipeline对Item做进一步处理

Scrapy默认会帮我们建一个ItemPipeline，它什么也没有处理。而且在项目中使用Pipeline还必须配置一下settings.py这个文件

ITEM_PIPELINES = {
    'spixiv.pipelines.SpixivPipeline': 300,
}

这里指出了Pipeline定义的位置，是pipelines下的一个类。300是一个优先级数，因为可能不只一个Pipeline想要处理Item，它的取值范围是0～1000，值越小，优先级越高。

class SpixivPipeline(object):
    def process_item(self, item, spider):
        return item

这个类是创建项目时Scrapy自动帮我们建的，这也表示我们的Pipleline需要实现process_item这个方法，方法的参数item表示了一个Item实例，spider表示了一个Spider实例，两个代表从spider抛出了一个item。这样我们可以根据spider和item的类型来处理item。比如将item存入数据库中做持久的保存。

秦时明月之君临天下

2014-12-29T00:00:00+08:00

等了将近2年，这个月底你终于出来了。自从去年年初的万里长城完结，到暑期的电影跳票，再今年暑假的龙腾万里上映，现在君临天下也终于归来了。这时间等得我都快把之前的剧情都忘光了。不过，虽然秦时明月已经出了四季，但其实剧情还真没多少(剧情节奏太慢了==)。

秦时明月也算是国产动漫中的精品良作了(纳米核心也不错，两者都走的3D路线)。在日漫横扫动漫界的现在，还在坚持看的国产动漫已经很少了。在2D上玩不过十一区，那就在3D上跟他们拼，咱们不缺技术不缺剧情(只缺钱！)。但愿玄机科技不负众望，继续把秦时明月做下去做好。

这一季的OP依旧是熟悉的《月光》。我就不吐槽第一集的剧情了，据说还是月更......

Scrapy 使用小记 1

2014-12-28T00:00:00+08:00

Scrapy是一套由Python编写的爬虫框架，基于异步事件驱动的Twisted库。在Scrapy的框架下，我们可以很方便地编写爬虫来抓取页面。Scrapy官方文档中有一个简单的教程。通过这个教程，我们可以基本了解到如何在Scrapy提供的框架下编写代码。

安装Scrapy

在Windows下安装Scrapy可能会比较费劲，主要是因为Scrapy依赖的一些库是用C写的，哪怕你在Windows下配置了gcc或者是vc的编译器，还是会因为缺少相应库的头文件而出现编译错误。Scrapy官方文档针对Windows有相应的安装指南。如果你不嫌麻烦的话可以照着安装指南来。不过在Windows下我比较推荐一个Python发行包pythonxy。这个链接估计是常年被墙，大家可以通过别的方法下载这个发行包，我这里提供一个百度网盘。pythonxy其实上是一个Python科学计算包集合，里面提供了很多Python的开发库，这些库很多都是有C的拓展，不过pythoxy已经帮我们编译好了，而且还集成了其他有用的Python库，包括Scrapy依赖的库。 Linux和Unix可以很方便的通过pip命令安装Scrapy，大部分*nix发行版中的Python都含有Scrapy依赖的库，所以我们可以直接使用pip。当然，如果在Windows中安装了pythonxy，我们也可以通过pip来安装。

$ pip install scrapy

这样我们就装好了Scrapy库。

创建Scrapy Project

我们可以通过Scrapy提供的命令行工具轻松地创建初始项目，这个对我们开发者来说相当的友好，就像django提供的命令行工具一样。我们通过观察初始项目的目录结构和文件命名，可以大致上了解整个项目的结构。

$ scrapy startproject spixiv

通过这条命令，我们可以初始化一个scrapy项目。这里的spixiv是项目的名称。我们看看Scrapy为我们生成了哪些东西。

  spixiv
    ├── scrapy.cfg
    └── spixiv
      ├── __init__.py
      ├── items.py
      ├── pipelines.py
      ├── settings.py
      └── spiders
        └── __init__.py

我们可以看到顶级目录是spixiv，下面有一个scrapy.cfg文件和一个spixiv目录(这个目录也是一个Python包)。scrapy.cfg这个文件一般不用去理会它，他只是向Scrapy提供了项目的配置信息(真正的配置信息其实在settings.py文件中，scrapy.cfg只是在其中有一个字段指向了这个settings.py文件)。

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# http://doc.scrapy.org/en/latest/topics/scrapyd.html

[settings]
default = spixiv.settings

[deploy]
#url = http://localhost:6800/
project = spixiv

以上是scrapy.cfg文件中的所有内容。我们代码的编写主要在spixiv目录下。

分析初始项目结构

个人非常喜欢框架提供startproject这种类似的工具，因为这往往就是这个框架下项目的最佳组织结构。这里不得不提一下django提供的初始项目结构，非常的模块化，从中我们也可以窥探到这些框架自身的组织结构和运行流程。下面我们来分析分析Scrapy为我们创建的初始项目

scrapy.cfg

这个文件在上一节中已经提到过，它只是给Scrapy命令行工具提供一个基础的配置信息(这也是为什么我们后面运行scrapy命令时必须和这个文件在同一目录下)。里面的default字段提供了项目配置的文件。

spixiv/settings.py

BOT_NAME = 'spixiv'

SPIDER_MODULES = ['spixiv.spiders']
NEWSPIDER_MODULE = 'spixiv.spiders'

这个文件里才是真正的项目配置(其实也没有多少东西)，BOT_NAME指明我们的项目名称(爬虫机器人？)，SPIDER_MODULES告诉Scrapy框架应该在哪些模块中寻找我们编写的爬虫。NEWSPIDER_MODULE这个字段其实可有可无，如果你需要Scrapy为你生成Spider模板的话，那么Scrapy生成的代码就会被写在这里设置的模块下。

spixiv/items.py

import scrapy

class SpixivItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

这个文件中主要用来编写我们需要爬取的信息，我们将对抓取到的信息抽象并包装为一个一个的item，而这些item的定义就可以放在这个文件中。后面谈到Scrapy整个框架的流程时，我们可以看到这样做的好处。

spixiv/pipelines.py

class SpixivPipeline(object):
    def process_item(self, item, spider):
        return item

这个文件里定义了对item的处理行为，默认没有做处理。如果要对item做额外的处理，可以在这里编写代码逻辑，还要在settings.py中添加相应的字段让Scrapy框架来加载我们的处理逻辑(默认不会加载)。

以上就是Scrapy为我们创建的项目结构，非常简洁的结构，下面我们就可以开始编写Spider了。

zhangjie

Derivatives for L-Softmax

Basic

For $x$

For $w$

HandWrite

Reference

Reinforcement Learning Notes

Learning note on Markov Decision Process.

Markov Property

Markov Process or Markov Chain

Markov Reward Process

Value Function

Bellman Equation for MRP

Markov Decision Process

Policy

Bellman Equation for value function

Optimal Value Function

Summary

Learning note on Planning by Dynamic Programming.

Bellman Expectation Equation

Bellman Optimality Equation

Iterative Policy Evaluation

Policy Iteration

Value Iteration

Summary

Learning note on Model-Free Prediction.

Monte-Carlo Reinforcement Learning

Temporal-Difference Learning

Differences between MC and TD

Monte-Carlo Backup

Temporal-Difference Backup

Dynamic Programming Backup

$TD(\lambda)$

Learning note on Model Free Control.

On-policy Monte-Carlo Control

On-policy TD Control

Sarsa$(\lambda)$

Off-policy Learning

Summary

References

Way to implement custom Layer for Deep Learning framework

Difference between Operator and Layer

Write down formulas

Implement the Layer

Gradient Check

Test within a toy model

Optimize your implementation

Summary

References

Caffe 源码阅读 Layer 加载机制

Similarity Transform Between Face Shapes

References

Something You Should Kown About C/C++ Compiler

预处理

编译与汇编

链接

符号

找不到符号

C 与 C++ 其实并没有想象中那么和谐

C++ 编译器中符号的兼容性

总结

Basic Mathematics in Neural Networks

References

Caffe 源码阅读 Blob

数据与相关操作函数

动态多维数组

SyncedMemory

数据序列化

小结

参考资料

Caffe 源码阅读 伊始

Caffe 概况

Caffe 整体结构

理论知识积累

参考资料

Face Alignment at 3000 FPS via Regressing Local Binary Features

论文解读

回归器 $R^t$

Shape-indexed 特征

Caffe 源码阅读伊始