The field of Deep Learning is often confused with in terms of building algorithms and understanding them keeping AI and Machine Learning in mind. This post is an attempt to give a light but conceptual description of Deep Learning (DL) and its application in day-to-day AI problems.

I was deeply (just like deep learning) inspired by the MIT6.S191 class whose ideas I would be describing in a simpler way. Also, I have inserted small python snippets for helpful understanding from programming point of view.

Evolution of Deep Learning

Deep Learning have revolutsonarized a whole lots of areas in research and development including:

  • Automobile industry such as autonomous vehicles (don’t forget Tesla’s latest manless car)
  • Medical and Healthcare including diagnostics and 
  • Reinforcement Learning and generative modelling
  • Robotics
  • Natural Language Processing
  • Finance and security

But before moving any further into DL, let us define learning from different perspective. Intelligence is ability to process information to make future decisions. We want machines to learn to do so using AI.

Machine Learning (ML) is subset of AI where machines learn without being explicitly programmed.

Deep Learning is subset of ML that further extracts data in the form patterns using Neural Networks.

Need of Deep Learning

The traditional features are not scalable, vulnerable to discontinuities and also consume time to extract (and we don’t want to spend time extracting information but rather on the training the network )

We may go for the raw data to learn features at various levels such as low level (like edges and lines), mid-level (Eyes and Nose) and High level (facial structure) features as in case of face recognition.

All of this is possible only because of past efforts by various researchers which is dependent on large datasets, good GPUs and high-end libraries such as Tensorflow and Torch.

Lets begin with the basic perceptron (every neuron in DL is called a perceptron)

Forward propagation

Multi Input Single Output

Individual inputs x_1, x_2 ,… x_k are multiplied with their respective weights and then added up in summation. Then this entire summation is passed through a non-linear activation function g(.) to produce the output \hat{y} .

\hat{y} = g \left( \sum_{j=1}^k{x_j w_j} \right)

Bias allows to shift activation function left or right.

Multi Input Single Output with bias
\hat{y} = g \left( w_0 + \sum_{j=1}^k{x_j w_j} \right)

Converting input and weights to vectors 

\hat{y} = g \left( w_0 + X^T W \right) 
where \, \, X^T = \begin{pmatrix} x_1 \\.\\ .\\ x_k \end{pmatrix} \, \, W = \begin{pmatrix} w_1 \\. \\. \\ w_k \end{pmatrix}

Sigmoid function: takes any number and gives out another number between 0 and 1 >> predict probabilities

g(a) = \sigma (a) = \frac{1}{1+e^{-a}}
Sigmoid Activation function
# To plot simple sigmoid activation function
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x_axis = np.linspace(-10, 10, 1000)
y_axis = 1 / (1 + np.exp(-x_axis) )
plt.figure(figsize=(10, 5))
plt.plot(x_axis, y_axis)
plt.legend(['sigmoid function'])

Other activations: Hyperbolic Tangent, rectified Linear unit (piecewise linear )

g(a) = \frac{e^a-e^{-a}}{e^a+e^{-a}} 
Hyperbolic Tangential Activation function
# To plot tanh activation function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x_axis = np.linspace(-10, 10, 1000)
y_axis = ( 2 / (1 + np.exp(-2*x_axis) ) ) -1

plt.figure(figsize=(10, 5))
plt.plot(x_axis, y_axis)
plt.legend(['hyperbolic tangent'])
g(a) = max(0,a)
ReLu Activation function
# To plot ReLu activation function

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x_axis = np.linspace(-10, 10, 1000)
y_axis = np.maximum(0, x_axis)

plt.figure(figsize=(10, 5))
plt.plot(x_axis, y_axis)
# Tensorflow snippet

For more detailed study refer to this link.

Need of Activation Functions

Main purpose of an activation function is to introduce non-linearities into network. Data in general is non-linear. A linear activation function will be able to produce only a single straight line. Hence the non-linearity will allow us to approximate arbitrarily complex functions in order to draw arbitrarily complex decision boundaries in the space.

Multiple Input Multi Output perceptron

Multi Input Multi Output Perceptron
y_1 = g(z_1) \, \, and \, \,y_2 =g(z_2)
z_j = w_{0,j} + \sum_{i=1}^{k} x_i w_{i,j}
# Tensorflow Snippet

from tf.keras.layers import *
inputs = Inputs(k)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
model = Model(inputs, outputs)

Deep Neural Network

Deep Neural Network
z_{l,m} = w_{0,m}^{(l)} + \sum_{i=1}^{d_{l-1}}w_{i,m}^{(l)}

Example : Let us consider an example where I need to predict the boys that fail in the class.

x_1 = Total \,number\, of\, lectures \,attended
x_2 = Hours \,spent \,on \,self \,study\, and \,homework

Single hidden layer network, consider output in first pass is 0.9 for the input [5,9] which is not true in a way. This is because the network was not trained, not even once to learn anything from it. Hence we compute loss to correct the wrong predictions  and indirectly, teach  the network when it makes the mistake.

Quantifying Loss measures the cost due to wrong predictions when compared with the actual predictions.

where, \, f(x^{(j)},W) = predicted\, and \, \, y^{(j)} = actual

Empirical Loss also called as Cost Function/Objective Function/ Empirical Risk, measures entire loss over the whole dataset.

J(W) = \frac{1}{p} \sum_{j=1}^{p} \mathcal{L}(f(x^{(j)},W),y^{(j)})

Losses according to output tasks: Cross entropy Loss (when output is a prediction(0 or 1)) and Mean squared error loss (when output is a continuous real number)

# Tensorflow Snippet

# Binary Cross Entropy Loss
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(model.y, model.pred) )
# Mean Square Error Loss
loss = tf.reduce_mean( tf.square(tf.subtract(model.y, model.pred) )

Training Neural Networks

Loss Optimisation

We need to calculate the weights that contributes to minimum loss.

W^{*} = \underset{w}argmin \sum_{j=1}^{p} \mathcal{L}(f(x^{(j)},W),y^{(j)})
W^* = \underset{w} argmin \,\,J(W) \\
where, W= \{W^{(0)},W^{(1)}, . .\}

\rightarrow Loss is a function of weights of the network

Steps to compute J(w_0,w_1)

  1. Randomly choose the initial value of (w_0,w_1)
  2. Compute Gradient \delta \, J(W) / \delta W
  3. Take small steps in opposite direction of gradient
  4. Repeat till convergence

A good read on weight initialisation and activation function using python is here.

Gradient Descent Algorithm


  1. Random initialisation of weights \sim N(0,\sigma^2)
  2. Continue until Convergence
    1. Compute Gradient, \delta J(W)/ \delta W
    2. Update weights, Update weights, W \leftarrow W –\eta \delta J(W)/\delta W
  3. Return Weights
# Tensorflow snippet
weights = tf.random_normal(shape, stddev=sigma)
grads = tf.gradients(ys=loss, xs=weights)
weights_new = weights.assign(weights – lr * grads)

When there are too many local minima, functions can be difficult to optimize.

  • Optimisation through gradient descent
W \leftarrow W- \eta (\delta J(W) / \delta W)
  • Learning Rates
  • Small learning rate converges slowly and gets stuck in false local minima
  • Large learning rates overshoot, become unstable and diverge
  • Stable learning rates converge smoothly and avoid local minima
  • Either you tweak \eta by setting values and compute the loss or employ adaptive techniques to change \eta according to gradients

Stochastic Gradient Descent

In the GD algorithm, the \delta J(W) / \delta W can be sometimes difficult to compute. Hence we go for SGD algorithm.


  1. Random initialization of weights ~ N(0,\sigma^2)
  2. Continue until Convergence
    • Select single data point k
    • Compute Gradient, \delta J_k(W)/ \delta W
    • Update weights, W \leftarrow W –\eta \delta J(W)/\delta W
  3. Return weights

\delta J_k(W)/\delta W is easy to compute but susceptible to noise.


  1. Random initialisation of weights ~ N(0,sigma^2)
  2. Continue until Convergence
    • Select Batch D data points
    • Compute Gradient, \delta J(W)/ \delta W = 1/D \sum_{i=1}^D \delta J_i(W)/\delta W
    • Update weights, W \leftarrow W –\eta \delta J(W)/\delta W
  3. Return Weights

\delta J(W)/ \delta W = \frac{1}{D} \sum_{i=1}^D \delta J_i(W)/\delta W is faster to compute and better in terms of gradient estimation.

  • Mini batches of dataset lead to faster training due to parallel computation and also GPUs speed.

Problem of Trained Neural Network

Underfit : Model parameters are not good enough to learn the data

Overfit : Complex, extra parameters does not help in generalising the learning process

Somewhere between under and over fitting is the ideal fit which we all intend to reach. For more detailed description on this concept you may refer to this amazing article.


The procedure that constrains optimisation issue to avoid complex models. In order to generalize our model on unknown data, we opt for regularization.

Dropout and Early Stopping

  • While training the network, randomly initialise some activations to zero.
  • Drop 50\% of activations in the hidden layers.
  • This method forces the network to not fully rely on single node.
  • End the training whenever overfit is set to occur.
# Tensorflow Snippet