Brief Introduction to Cyclic Networks

The Artificial Neural Networks (ANN) have evolved tremendously with a variety of networks to suit for applications according their individual properties. The ANN have a simple structure consisting of nodes (also called processing units) connected to each other via weights. The network gets stimulated by giving input to few are all nodes, and this stimulation, also called activation spreads through entire network.

The way in which layers are connected and fed categorizes ANNs in to feed-forward networks (FFN) or feed-back networks (FBN). The FFNs are acyclic in nature i.e. just one forward travelling of weights and biases; whereas the FBNs are cyclically connected i.e. some layers have a connection coming from the other layers recursively. Most renown FBNs are RNN (Recurrent Neural Networks) and FNNs are MLP (Multi-layer Perceptrons), RBF (Radial Basis Function network), Kohonen maps and Hopfield nets. Let’s begin with MLPs to progressively understand the RNNs.

Multi-layer Perceptron Networks (MLP)

The multiple layers feeding output from one layer to another are connected via weights. The pattern data to be trained is fed to the input layer which gets propagated via hidden units to the output unit. This constitutes the forward pass in the network. The hidden units are activated via a non-linear activation. MLPs are preferred for pattern classification task as their output depends on current input only. It maps input vector to output vector. Usually a single layer MLP are considered to approximate any continuous function with sufficient number of neurons; and hence the name Universal function approximators.

During the backward pass, the weights are updated in the direction of negative slope using gradient descent method. The backpropagation algorithm is used to compute the derivatives of objective function with reference to output units.

Multi layer Perceptron vs Recurrent Neural Network
Multi layer Perceptron vs Recurrent Neural Network

Recurrent Neural Networks

The RNN can be considered as extensions to MLPs that can map from whole history of previous inputs to every output. The forward and backward pass for RNNs are similar to MLPs. During the forward pass, the only difference from MLPs is that stimulation to hidden layers come from both previous hidden layer back in time and the external current input. Consider an input sequence of length M presented to an RNN with I input units, H hidden units, and K output units. Let  f_i^t  be the value of input i at time t, and let p_j^t and  q_j^t be respectively the network input to unit j at time t and the activation of unit j at time t. Not going into much mathematics but few essential equations for hidden units in the forward pass are as follows:

p_h^t=\sum_{i=1}^I w_i^t f_i^t + \sum_{h’=1}^H w_{h’h} q_h^{t−1}

A differentiable activation function which is non-linear is applied,

q_h^t=θ_h(p_h^t)

The hidden activations for the complete sequence can be calculated using above set of equations recursively starting from t=1. The algorithm requires initial values of q_i^0  which can be considered as zero but researchers suggest a non zero value to make RNN more robust and stable. Further network inputs given to the output units can be computed as follows:

p_k^t=\sum_{h=1}^H w_{hk} q_h^t

Now in the backward pass, the weigth derivatives need to be computed for which two well known algorithms Real Time recurrent learning (RTRL) and backpropagation through time (BPTT) are used. We focus on BPTT since it is both conceptually simpler and more efficient in computation time (though not in memory). Similar to normal back-propagation, BPTT involves application of the chain rule repeatedly, except for the fact that the objective function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next time step, i.e.

\delta_h^t =\theta’p_h^t(\sum_{k=1}^K \theta_k^t w_{hk} + \sum_{h’=1}^H \delta_{h’}^{t+1} w_{hh’})

where,

\delta_j^t = \frac{\delta O}{ δp_j^t}

The entire sequence of \delta terms can be calculated by starting at t = T and applying above equation recursively, decrementing t at each step. (Note that, \delta_j^{T+1} =0 \forall j). Finally, bearing in mind that the weights to and from each unit in the hidden layer are the same at every time step, we sum over the whole sequence to get the derivatives with respect to each of the network weights

\frac{\delta O}{\delta w_{ij}} = \sum_{t=1}^T \frac{\delta O}{\delta p_j^t} \frac{\delta p_j^t}{\delta w_{ij}} = \sum_{t=1}^T \delta_j^t q_i^t

The coming sections would cover the RNN variants, problem of vanishing gradients and implementations.

Categories: Deep Learning