NEURAL NETWORK TUTORIAL | Toronto 2017-08-06T04:41:42+00:00

Neural Network Tutorial | Neural Networks Fundamentals using TensorFlow

TensorFlow, Deep Learning  Machine Learning /AI



Neural Network Tutorial This course will give you knowledge in neural networks and generally in machine learning algorithm,  deep learning (algorithms and applications).

This training is more focus on fundamentals but will help you choose the right technology: TensorFlow, Caffe, Teano, DeepDrive, Keras, etc. The examples are made in TensorFlow.

The average salary for neural network with tensorflow  is $175,119



In Class: $9,999
Next Session: 15th Jul 2017

Online: $3,999
Next Session: 15th Jul 2017

Home / All courses / Tensorflow /Neural Networks Fundamentals using TensorFlow 

neural network tutorial | Fundamentals using TensorFlow

Instructor: John Doe, Lamar George




Neural Network tutorial with Tensor Flow is an open source software library for numerical computation using data flow graphs. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API.

This addresses common commercial machine learning problems using Google’s Tensor Flow library. It will not only help you discover what Tensor Flow is and how to use it but will also show you the unbelievable things that can be done in machine learning with the help of examples/real-world use cases. We start off with the basic installation of Tensor flow, moving on to covering the unique features of the library such as Data Flow Graphs, training, and visualization of performance with Tensor Board—all within an example-rich context using problems from the multiple sources. The focus is on introducing new concepts through problems that are coded and solved over the course of each section.


neural network tutorial TENSORFLOW BASICS

Lecture1.1 Creation, Initializing, Saving, and Restoring TensorFlow variables
Lecture1.2 Feeding, Reading and Preloading TensorFlow Data
Lecture1.3 How to use TensorFlow infrastructure to train models at scale
Lecture1.4 Visualizing and Evaluating models with TensorBoard

neural network tutorial – TENSORFLOW MECHANICS

Lecture2.1 1.Inputs and Placeholders
Lecture2.2 2.Build the GraphS
Lecture2.3 Inference
Lecture2.4 Loss
Lecture2.5 Training
Lecture2.6 3.Train the Model
Lecture2.7 The Graph
Lecture2.8 The Session
Lecture2.9 Train Loop
Lecture2.10 4.Evaluate the Model
Lecture2.11 Build the Eval Graph
Lecture2.12 Eval Output

neural network tutorial – THE PERCEPTRON

Lecture3.1 Activation functions
Lecture3.2 The perceptron learning algorithm
Lecture3.3 Binary classification with the perceptron
Lecture3.4 Document classification with the perceptron
Lecture3.5 Limitations of the perceptron
Lecture3.6 Minimizing the cost function
Lecture3.7 Forward propagation
Lecture3.8 Back propagation


Neural Network Tutorial

Lecture4.1 Kernels and the kernel trick
Lecture4.2 Maximum margin classification and support vectors
Lecture4.3 Nonlinear decision boundaries

neural network tutorial -ARTIFICIAL NEURAL NETWORKS

Lecture5.1 Nonlinear decision boundaries
Lecture5.2 Feedforward and feedback artificial neural networks
Lecture5.3 Multilayer perceptrons
Lecture5.4 Improving the way neural networks learn


Lecture6.1 Goals
Lecture6.2 Model Architecture
Lecture6.3 Principles
Lecture6.4 Code Organization
Lecture6.5 Launching and Training the Model
Lecture6.6 Evaluating a Model

Online: $3,999
Next Batch: starts from 17th July 2017

In Class: $9,999
Locations: New York City, D.C., Bay Area
Next Batch: starts from 17th July 2017




John Doe
Learning Scientist & Master Trainer John Doe has been a professional educator for the past 20 years. He’s taught, tutored, and coached over 1000 students, and he holds degrees in Physics and Literature from Northwestern University. He has spent the last 4 years studying how people learn to code and develop applications.


Lamar George
Learning Scientist & Master Trainer He has been a professional educator for the past 20 years. He’s taught, tutored, and coached over 1000 students, and he holds degrees in Physics and Literature from Northwestern University. He has spent the last 4 years studying how people learn to code and develop applications.


Skill level: Intermediate
Language: English
Certificate: No
Assessments: Self
Prerequisites: Basic Python programming






data science Bootcamp
Deep Learning with Tensor Flow In-Class or Online

Good grounding in basic machine learning. Programming skills in any language (ideally Python/R).

Instructors: John Doe, Lamar George
50 hours
Lectures:  25

Neural Networks Fundamentals using Tensor Flow as Example Training (In-Class or Online) 

Good grounding in basic machine learning. Programming skills in any language (ideally Python/R).

Instructors: John Doe, Lamar George
50 hours
Lectures:  25

Deep learning tutorial

Tensor Flow for Image Recognition Bootcamp (In-Class and Online)

Good grounding in basic machine learning. Programming skills in any language (ideally Python/R).

Instructors: John Doe, Lamar George
50 hours
Lectures:  25




What is the duration of the course?

Advanced Course like Neural Networks- Fundamentals using Tensorflow Training duration largely depends on trainee requirements, it is always recommended to consult one of our advisors for specific course duration.

What If I Miss A Class?

Neural Network Tutorial – record each LIVE class session you undergo through and we will share the recordings of each session/class.

Can I Request For A Support Session If I Find Difficulty In Grasping Topics?

If you have any queries you can contact our 24/7 dedicated support to raise a ticket. We provide you email support and solution to your queries. If the query is not resolved by email we can arrange for a one-on-one session with our trainers.

What Kind Of Projects Will I Be Working On As Part Of The Training?

You will work on real world projects wherein you can apply your knowledge and skills that you acquired through our training. We have multiple projects that thoroughly test your skills and knowledge of various aspect and components making you perfectly industry-ready.

How Will I Execute The Practical?

Our Trainers will provide the Environment/Server Access to the students and we ensure practical real-time experience and training by providing all the utilities required for the in-depth understanding of the course.

Are These Classes Conducted Via Live Online Streaming?

Yes. All the training sessions are LIVE Online Streaming using either through WebEx or GoToMeeting, thus promoting one-on-one trainer student Interaction.

Will the course fetch me a job?

The Neural Networks- Fundamentals using Tensorflow Training by BigDataGuys will not only increase your CV potential but will offer you a global exposure with enormous growth potential.




[thim-courses limit=”4″ featured=”” order=”latest”]

One of the most striking facts about neural networks is that they can compute any function at all. That is, suppose someone hands you some complicated, wiggly function, f(x)f(x):


No matter what the function, there is guaranteed to be a neural network so that for every possible input, xx, the value f(x)f(x) (or some close approximation) is output from the network, e.g.:


This result holds even if the function has many inputs, f=f(x1,…,xm)f=f(x1,…,xm), and many outputs. For instance, here’s a network computing a function with m=3m=3 inputs and n=2n=2 outputs:


This result tells us that neural networks have a kind of universality. No matter what function we want to compute, we know that there is a neural network which can do the job.


What’s more, this universality theorem holds even if we restrict our networks to have just a single layer intermediate between the input and the output neurons – a so-called single hidden layer. So even very simple network architectures can be extremely powerful.

neural network tutorial

The universality theorem is well known by people who use neural networks. But why it’s true is not so widely understood. Most of the explanations available are quite technical. For instance, one of the original papers proving the result* *Approximation by superpositions of a sigmoidal function, by George Cybenko (1989). The result was very much in the air at the time, neural network tutorial and several groups proved closely related results. Cybenko’s paper contains a useful discussion of much of that work. Another important early paper is Multilayer feedforward networks are universal approximators, by Kurt Hornik, Maxwell Stinchcombe, and Halbert White (1989). This paper uses the Stone-Weierstrass theorem to arrive at similar results. did so using the Hahn-Banach theorem, the Riesz Representation theorem, and some Fourier analysis. If you’re a mathematician the argument is not difficult to follow, but it’s not so easy for most people. That’s a pity, since the underlying reasons for universality are simple and beautiful.


In this chapter I give a simple and mostly visual explanation of the universality theorem. We’ll go step by step through the underlying ideas. You’ll understand why it’s true that neural networks can compute any function. You’ll understand some of the limitations of the result. And you’ll understand how the result relates to deep neural networks.


To follow the material in the chapter, you do not need to have read earlier chapters in this book. Instead, the chapter is structured to be enjoyable as a self-contained essay. Provided you have just a little basic familiarity with neural networks, you should be able to follow the explanation. I will, however, provide occasional links to earlier material, to help fill in any gaps in your knowledge.

neural network tutorial

Universality theorems are a commonplace in computer science, so much so that we sometimes forget how astonishing they are. But it’s worth reminding ourselves: the ability to compute an arbitrary function is truly remarkable. Almost any process you can imagine can be thought of as function computation. Consider the problem of naming a piece of music based on a short sample of the piece. That can be thought of as computing a function. Or consider the problem of translating a Chinese text into English. Again, neural network tutorial that can be thought of as computing a function* *Actually, computing one of many functions, since there are often many acceptable translations of a given piece of text.. Or consider the problem of taking an mp4 movie file and generating a description of the plot of the movie, and a discussion of the quality of the acting. Again, that can be thought of as a kind of function computation* *Ditto the remark about translation and there being many possible functions.. Universality means that, in principle, neural networks can do all these things and many more.

neural network tutorial

Of course, just because we know a neural network exists that can (say) translate Chinese text into English, that doesn’t mean we have good techniques for constructing or even recognizing such a network. This limitation applies also to traditional universality theorems for models such as Boolean circuits. But, as we’ve seen earlier in the book, neural networks have powerful algorithms for learning functions. That combination of learning algorithms + universality is an attractive mix. Up to now, the book has focused on the learning algorithms. In this chapter, we focus on universality, and what it means.


Two caveats


Before explaining why the universality theorem is true, I want to mention two caveats to the informal statement “a neural network can compute any function”.


First, this doesn’t mean that a network can be used to exactly compute any function. Rather, we can get an approximation that is as good as we want. By increasing the number of hidden neurons we can improve the approximation. For instance, earlier I illustrated a network computing some function f(x)f(x) using three hidden neurons. For most functions only a low-quality approximation will be possible using three hidden neurons. By increasing the number of hidden neurons (say, to five) we can typically get a better approximation:


And we can do still better by further increasing the number of hidden neurons.


To make this statement more precise, suppose we’re given a function f(x)f(x) which we’d like to compute to within some desired accuracy ϵ>0ϵ>0. The guarantee is that by using enough hidden neurons we can always find a neural network whose output g(x)g(x) satisfies |g(x)−f(x)|<ϵ|g(x)−f(x)|<ϵ, for all inputs xx. In other words, the approximation will be good to within the desired accuracy for every possible input.

neural network tutorial

The second caveat is that the class of functions which can be approximated in the way described are the continuous functions. If a function is discontinuous, i.e., makes sudden, sharp jumps, then it won’t in general be possible to approximate using a neural net. This is not surprising, since our neural networks compute continuous functions of their input. However, even if the function we’d really like to compute is discontinuous, it’s often the case that a continuous approximation is good enough. If that’s so, then we can use a neural network. In practice, this is not usually an important limitation.


Summing up, a more precise statement of the universality theorem is that neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision. In this chapter we’ll actually prove a slightly weaker version of this result, neural network tutorial using two hidden layers instead of one. In the problems I’ll briefly outline how the explanation can, with a few tweaks, be adapted to give a proof which uses only a single hidden layer.


Universality with one input and one output


To understand why the universality theorem is true, let’s start by understanding how to construct a neural network which approximates a function with just one input and one output:


It turns out that this is the core of the problem of universality. Once we’ve understood this special case it’s actually pretty easy to extend to functions with many inputs and many outputs.


To build insight into how to construct a network to compute ff, let’s start with a network containing just a single hidden layer, with two hidden neurons, and an output layer containing a single output neuron:


To get a feel for how components in the network work, let’s focus on the top hidden neuron. In the diagram below, click on the weight, ww, and drag the mouse a little ways to the right to increase ww. You can immediately see how the function computed by the top hidden neuron changes:

neural network tutorial

As we learnt earlier in the book, what’s being computed by the hidden neuron is σ(wx+b)σ(wx+b), where σ(z)≡1/(1+e−z)σ(z)≡1/(1+e−z) is the sigmoid function. Up to now, we’ve made frequent use of this algebraic form. But for the proof of universality we will obtain more insight by ignoring the algebra entirely, and instead manipulating and observing the shape shown in the graph. This won’t just give us a better feel for what’s going on, it will also give us a proof* *Strictly speaking, neural network tutorial the visual approach I’m taking isn’t what’s traditionally thought of as a proof. But I believe the visual approach gives more insight into why the result is true than a traditional proof. And, of course, that kind of insight is the real purpose behind a proof. Occasionally, there will be small gaps in the reasoning I present: places where I make a visual argument that is plausible, but not quite rigorous. If this bothers you, then consider it a challenge to fill in the missing steps. But don’t lose sight of the real purpose: to understand why the universality theorem is true. of universality that applies to activation functions other than the sigmoid function.


To get started on this proof, try clicking on the bias, bb, in the diagram above, and dragging to the right to increase it. You’ll see that as the bias increases the graph moves to the left, but its shape doesn’t change.


Next, click and drag to the left in order to decrease the bias. You’ll see that as the bias decreases the graph moves to the right, but, again, its shape doesn’t change.


Next, decrease the weight to around 22 or 33. You’ll see that as you decrease the weight, the curve broadens out. You might need to change the bias as well, in order to keep the curve in-frame.


Finally, increase the weight up past w=100w=100. As you do, the curve gets steeper, until eventually it begins to look like a step function. Try to adjust the bias so the step occurs near x=0.3x=0.3. The following short clip shows what your result should look like. Click on the play button to play (or replay) the video:


neural network tutorial

We can simplify our analysis quite a bit by increasing the weight so much that the output really is a step function, to a very good approximation. Below I’ve plotted the output from the top hidden neuron when the weight is w=999w=999. Note that this plot is static, and you can’t change parameters such as the weight.


neural network tutorial


It’s actually quite a bit easier to work with step functions than general sigmoid functions. The reason is that in the output layer we add up contributions from all the hidden neurons. It’s easy to analyze the sum of a bunch of step functions, but rather more difficult to reason about what happens when you add up a bunch of sigmoid shaped curves. And so it makes things much easier to assume that our hidden neurons are outputting step functions. More concretely, we do this by fixing the weight ww to be some very large value, and then setting the position of the step by modifying the bias. Of course, treating the output as a step function is an approximation, but it’s a very good approximation, and for now we’ll treat it as exact. I’ll come back later to discuss the impact of deviations from this approximation.


At what value of xx does the step occur? Put another way, how does the position of the step depend upon the weight and bias?


To answer this question, try modifying the weight and bias in the diagram above (you may need to scroll back a bit). Can you figure out how the position of the step depends on ww and bb? With a little work you should be able to convince yourself that the position of the step is proportional to bb, and inversely proportional to ww.


In fact, the step is at position s=−b/ws=−b/w, as you can see by modifying the weight and bias in the following diagram:


neural network tutorial

It will greatly simplify our lives to describe hidden neurons using just a single parameter, ss, which is the step position, s=−b/ws=−b/w. Try modifying ss in the following diagram, in order to get used to the new parameterization:



As noted above, we’ve implicitly set the weight ww on the input to be some large value – big enough that the step function is a very good approximation. We can easily convert a neuron parameterized in this way back into the conventional model, by choosing the bias b=−wsb=−ws.


Up to now we’ve been focusing on the output from just the top hidden neuron. Let’s take a look at the behavior of the entire network. In particular, neural network tutorial we’ll suppose the hidden neurons are computing step functions parameterized by step points s1s1 (top neuron) and s2s2 (bottom neuron). And they’ll have respective output weights w1w1 and w2w2. Here’s the network:


neural network tutorial

What’s being plotted on the right is the weighted output w1a1+w2a2w1a1+w2a2 from the hidden layer. Here, a1a1 and a2a2 are the outputs from the top and bottom hidden neurons, respectively* *Note, by the way, that the output from the whole network is σ(w1a1+w2a2+b)σ(w1a1+w2a2+b), where bb is the bias on the output neuron. Obviously, this isn’t the same as the weighted output from the hidden layer, which is what we’re plotting here. We’re going to focus on the weighted output from the hidden layer right now, neural network tutorial and only later will we think about how that relates to the output from the whole network.. These outputs are denoted with aas because they’re often known as the neurons’ activations.


Try increasing and decreasing the step point s1s1 of the top hidden neuron. Get a feel for how this changes the weighted output from the hidden layer. It’s particularly worth understanding what happens when s1s1 goes past s2s2. You’ll see that the graph changes shape when this happens, neural network tutorial since we have moved from a situation where the top hidden neuron is the first to be activated to a situation where the bottom hidden neuron is the first to be activated.


Similarly, try manipulating the step point s2s2 of the bottom hidden neuron, and get a feel for how this changes the combined output from the hidden neurons.


Try increasing and decreasing each of the output weights. Notice how this rescales the contribution from the respective hidden neurons. What happens when one of the weights is zero?


Finally, try setting w1w1 to be 0.80.8 and w2w2 to be −0.8−0.8. You get a “bump” function, which starts at point s1s1, ends at point s2s2, and has height 0.80.8. For instance, the weighted output might look like this:




Of course, we can rescale the bump to have any height at all. Let’s use a single parameter, hh, to denote the height. To reduce clutter I’ll also remove the “s1=…s1=…” and “w1=…w1=…” notations.



Try changing the value of hh up and down, to see how the height of the bump changes. Try changing the height so it’s negative, and observe what happens. And try changing the step points to see how that changes the shape of the bump.


You’ll notice, by the way, that we’re using our neurons in a way that can be thought of not just in graphical terms, but in more conventional programming terms, as a kind of if-then-else statement, e.g.:


if input >= step point:

add 1 to the weighted output


add 0 to the weighted output

For the most part I’m going to stick with the graphical point of view. But in what follows you may sometimes find it helpful to switch points of view, neural network tutorial and think about things in terms of if-then-else.


We can use our bump-making trick to get two bumps, by gluing two pairs of hidden neurons together into the same network:



I’ve suppressed the weights here, simply writing the hh values for each pair of hidden neurons. Try increasing and decreasing both hh values, and observe how it changes the graph. Move the bumps around by changing the step points.

neural network tutorial

More generally, we can use this idea to get as many peaks as we want, of any height. In particular, we can divide the interval [0,1][0,1] up into a large number, NN, neural network tutorial of subintervals, and use NN pairs of hidden neurons to set up peaks of any desired height. Let’s see how this works for N=5N=5. That’s quite a few neurons, so I’m going to pack things in a bit. Apologies for the complexity of the diagram: I could hide the complexity by abstracting away further, but I think it’s worth putting up with a little complexity, for the sake of getting a more concrete feel for how these networks work.



You can see that there are five pairs of hidden neurons. The step points for the respective pairs of neurons are 0,1/50,1/5, then 1/5,2/51/5,2/5, and so on, out to 4/5,5/54/5,5/5. These values are fixed – they make it so we get five evenly spaced bumps on the graph.


Each pair of neurons has a value of hh associated to it. Remember, the connections output from the neurons have weights hh and −h−h (not marked). Click on one of the hh values, and drag the mouse to the right or left to change the value. As you do so, neural network tutorial watch the function change. By changing the output weights we’re actually designing the function!

neural network tutorial

Contrariwise, try clicking on the graph, and dragging up or down to change the height of any of the bump functions. As you change the heights, you can see the corresponding change in hh values. And, although it’s not shown, there is also a change in the corresponding output weights, which are +h+h and −h−h.


In other words, we can directly manipulate the function appearing in the graph on the right, and see that reflected in the hh values on the left. A fun thing to do is to hold the mouse button down and drag the mouse from one side of the graph to the other. As you do this you draw out a function, and get to watch the parameters in the neural network adapt.


Time for a challenge.


Let’s think back to the function I plotted at the beginning of the chapter:


I didn’t say it at the time, but what I plotted is actually the function



plotted over xx from 00 to 11, and with the yy axis taking values from 00 to 11.


That’s obviously not a trivial function.


You’re going to figure out how to compute it using a neural network.


In our networks above we’ve been analyzing the weighted combination ∑jwjaj∑jwjaj output from the hidden neurons. We now know how to get a lot of control over this quantity. But, as I noted earlier, this quantity is not what’s output from the network. What’s output from the network is σ(∑jwjaj+b)σ(∑jwjaj+b) where bb is the bias on the output neuron. Is there some way we can achieve control over the actual output from the network?


The solution is to design a neural network whose hidden layer has a weighted output given by σ−1∘f(x)σ−1∘f(x), where σ−1σ−1 is just the inverse of the σσ function. That is, we want the weighted output from the hidden layer to be:


If we can do this, then the output from the network as a whole will be a good approximation to f(x)f(x)* *Note that I have set the bias on the output neuron to 00..

neural network tutorial

Your challenge, then, is to design a neural network to approximate the goal function shown just above. To learn as much as possible, I want you to solve the problem twice. The first time, please click on the graph, directly adjusting the heights of the different bump functions. You should find it fairly easy to get a good match to the goal function. How well you’re doing is measured by the average deviation between the goal function and the function the network is actually computing. Your challenge is to drive the average deviation as low as possible. You complete the challenge when you drive the average deviation to 0.400.40 or below.


Once you’ve done that, neural network tutorial click on “Reset” to randomly re-initialize the bumps. The second time you solve the problem, resist the urge to click on the graph. Instead, modify the hh values on the left-hand side, and again attempt to drive the average deviation to 0.400.40 or below.


neural network tutorial

You’ve now figured out all the elements necessary for the network to approximately compute the function f(x)f(x)! It’s only a coarse approximation, neural network tutorial but we could easily do much better, merely by increasing the number of pairs of hidden neurons, allowing more bumps.


In particular, it’s easy to convert all the data we have found back into the standard parameterization used for neural networks. Let me just recap quickly how that works.


The first layer of weights all have some large, constant value, say w=1000w=1000.


The biases on the hidden neurons are just b=−wsb=−ws. So, for instance, for the second hidden neuron s=0.2s=0.2 becomes b=−1000×0.2=−200b=−1000×0.2=−200.


The final layer of weights are determined by the hh values. So, for instance, the value you’ve chosen above for the first hh, h=h= -1.2, means that the output weights from the top two hidden neurons are -1.2 and 1.2, respectively. And so on, for the entire layer of output weights.


Finally, the bias on the output neuron is 00.


That’s everything: we now have a complete description of a neural network which does a pretty good job computing our original goal function. And we understand how to improve the quality of the approximation by improving the number of hidden neurons.


What’s more, there was nothing special about our original goal function, f(x)=0.2+0.4×2+0.3sin(15x)+0.05cos(50x)f(x)=0.2+0.4×2+0.3sin⁡(15x)+0.05cos⁡(50x). We could have used this procedure for any continuous function from [0,1][0,1] to [0,1][0,1]. In essence, neural network tutorial we’re using our single-layer neural networks to build a lookup table for the function. And we’ll be able to build on this idea to provide a general proof of universality.


Many input variables


Let’s extend our results to the case of many input variables. This sounds complicated, but all the ideas we need can be understood in the case of just two inputs. So let’s address the two-input case.


We’ll start by considering what happens when we have two inputs to a neuron:


Here, we have inputs xx and yy, with corresponding weights w1w1 and w2w2, and a bias bb on the neuron. Let’s set the weight w2w2 to 00, and then play around with the first weight, w1w1, and the bias, bb, to see how they affect the output from the neuron:




As you can see, with w2=0w2=0 the input yy makes no difference to the output from the neuron. It’s as though xx is the only input.

neural network tutorial

Given this, what do you think happens when we increase the weight w1w1 to w1=100w1=100, with w2w2 remaining 00? If you don’t immediately see the answer, ponder the question for a bit, and see if you can figure out what happens. Then try it out and see if you’re right. I’ve shown what happens in the following movie:



Just as in our earlier discussion, as the input weight gets larger the output approaches a step function. The difference is that now the step function is in three dimensions. Also as before, we can move the location of the step point around by modifying the bias. The actual location of the step point is sx≡−b/w1sx≡−b/w1.


Let’s redo the above using the position of the step as the parameter:




Here, we assume the weight on the xx input has some large value – I’ve used w1=1000w1=1000 – and the weight w2=0w2=0. The number on the neuron is the step point, neural network tutorial and the little xx above the number reminds us that the step is in the xx direction. Of course, it’s also possible to get a step function in the yy direction, by making the weight on the yy input very large (say, w2=1000w2=1000), and the weight on the xx equal to 00, i.e., w1=0w1=0:



neural network tutorial

The number on the neuron is again the step point, and in this case the little yy above the number reminds us that the step is in the yy direction. I could have explicitly marked the weights on the xx and yy inputs, but decided not to, since it would make the diagram rather cluttered. But do keep in mind that the little yy marker implicitly tells us that the yy weight is large, and the xx weight is 00.


We can use the step functions we’ve just constructed to compute a three-dimensional bump function. To do this, we use two neurons, each computing a step function in the xx direction. Then we combine those step functions with weight hh and −h−h, respectively, where hh is the desired height of the bump. It’s all illustrated in the following diagram:


x=1y=1Weighted output from hidden layer


Try changing the value of the height, hh. Observe how it relates to the weights in the network. And see how it changes the height of the bump function on the right.


Also, try changing the step point 0.300.30 associated to the top hidden neuron. Witness how it changes the shape of the bump. What happens when you move it past the step point 0.700.70 associated to the bottom hidden neuron?


We’ve figured out how to make a bump function in the xx direction. Of course, we can easily make a bump function in the yy direction, by using two step functions in the yy direction. Recall that we do this by making the weight large on the yy input, neural network tutorial and the weight 00 on the xx input. Here’s the result:


x=1y=1Weighted output from hidden layer


This looks nearly identical to the earlier network! The only thing explicitly shown as changing is that there’s now little yy markers on our hidden neurons. That reminds us that they’re producing yy step functions, not xx step functions, neural network tutorial and so the weight is very large on the yy input, and zero on the xx input, not vice versa. As before, I decided not to show this explicitly, in order to avoid clutter.


Let’s consider what happens when we add up two bump functions, one in the xx direction, the other in the yy direction, both of height hh:


x=1y=1Weighted output from hidden layer


To simplify the diagram I’ve dropped the connections with zero weight. For now, I’ve left in the little xx and yy markers on the hidden neurons, neural network tutorial to remind you in what directions the bump functions are being computed. We’ll drop even those markers later, since they’re implied by the input variable.

neural network tutorial

Try varying the parameter hh. As you can see, this causes the output weights to change, and also the heights of both the xx and yy bump functions.


What we’ve built looks a little like a tower function:


x=1y=1Tower function

If we could build such tower functions, then we could use them to approximate arbitrary functions, just by adding up many towers of different heights, and in different locations:


x=1y=1Many towers

Of course, we haven’t yet figured out how to build a tower function. What we have constructed looks like a central tower, of height 2h2h, with a surrounding plateau, of height hh.


But we can make a tower function. Remember that earlier we saw neurons can be used to implement a type of if-then-else statement:


if input >= threshold:

output 1


output 0

That was for a neuron with just a single input. What we want is to apply a similar idea to the combined output from the hidden neurons:


if combined output from hidden neurons >= threshold:

output 1


output 0

If we choose the threshold appropriately – say, a value of 3h/23h/2, which is sandwiched between the height of the plateau and the height of the central tower – we could squash the plateau down to zero, and leave just the tower standing.


Can you see how to do this? Try experimenting with the following network to figure it out. Note that we’re now plotting the output from the entire network, not just the weighted output from the hidden layer. This means we add a bias term to the weighted output from the hidden layer, neural network tutorial and apply the sigma function. Can you find values for hh and bb which produce a tower? This is a bit tricky, so if you think about this for a while and remain stuck, here’s two hints: (1) To get the output neuron to show the right kind of if-then-else behaviour, we need the input weights (all hh or −h−h) to be large; and (2) the value of bb determines the scale of the if-then-else threshold.



neural network tutorial

With our initial parameters, the output looks like a flattened version of the earlier diagram, with its tower and plateau. To get the desired behaviour, neural network tutorial we increase the parameter hh until it becomes large. That gives the if-then-else thresholding behaviour. Second, to get the threshold right, we’ll choose b≈−3h/2b≈−3h/2. Try it, and see how it works!


Here’s what it looks like, when we use h=10h=10:



Even for this relatively modest value of hh, we get a pretty good tower function. And, of course, we can make it as good as we want by increasing hh still further, and keeping the bias as b=−3h/2b=−3h/2.


Let’s try gluing two such networks together, in order to compute two different tower functions. To make the respective roles of the two sub-networks clear I’ve put them in separate boxes, below: each box computes a tower function, neural network tutorial using the technique described above. The graph on the right shows the weighted output from the second hidden layer, that is, it’s a weighted combination of tower functions.


x=1y=1Weighted output


In particular, you can see that by modifying the weights in the final layer you can change the height of the output towers.


The same idea can be used to compute as many towers as we like. We can also make them as thin as we like, and whatever height we like. As a result, we can ensure that the weighted output from the second hidden layer approximates any desired function of two variables:


x=1y=1Many towers

In particular, by making the weighted output from the second hidden layer a good approximation to σ−1∘fσ−1∘f, we ensure the output from our network will be a good approximation to any desired function, ff.


What about functions of more than two variables?


Let’s try three variables x1,x2,x3x1,x2,x3. The following network can be used to compute a tower function in four dimensions:



Here, the x1,x2,x3x1,x2,x3 denote inputs to the network. The s1,t1s1,t1 and so on are step points for neurons – that is, all the weights in the first layer are large, and the biases are set to give the step points s1,t1,s2,…s1,t1,s2,…. The weights in the second layer alternate +h,−h+h,−h, where hh is some very large number. And the output bias is −5h/2−5h/2.

neural network tutorial

This network computes a function which is 11 provided three conditions are met: x1x1 is between s1s1 and t1t1; x2x2 is between s2s2 and t2t2; and x3x3 is between s3s3 and t3t3. The network is 00 everywhere else. That is, it’s a kind of tower which is 11 in a little region of input space, and 00 everywhere else.


By gluing together many such networks we can get as many towers as we want, and so approximate an arbitrary function of three variables. Exactly the same idea works in mm dimensions. neural network tutorial The only change needed is to make the output bias (−m+1/2)h(−m+1/2)h, in order to get the right kind of sandwiching behavior to level the plateau.

neural network tutorial

Okay, so we now know how to use neural networks to approximate a real-valued function of many variables. What about vector-valued functions f(x1,…,xm)∈Rnf(x1,…,xm)∈Rn? Of course, such a function can be regarded as just nn separate real-valued functions, f1(x1,…,xm),f2(x1,…,xm)f1(x1,…,xm),f2(x1,…,xm), and so on. So we create a network approximating f1f1, another network for f2f2, and so on. And then we simply glue all the networks together. So that’s also easy to cope with.



neural network tutorial

We’ve seen how to use networks with two hidden layers to approximate an arbitrary function. Can you find a proof showing that it’s possible with just a single hidden layer? neural network tutorial As a hint, try working in the case of just two input variables, and showing that: (a) it’s possible to get step functions not just in the xx or yy directions, neural network tutorial but in an arbitrary direction; (b) by adding up many of the constructions from part (a) it’s possible to approximate a tower function which is circular in shape, rather than rectangular; (c) using these circular towers, it’s possible to approximate an arbitrary function. To do part (c) it may help to use ideas from a bit later in this chapter.

Extension beyond sigmoid neurons


We’ve proved that networks made up of sigmoid neurons can compute any function. Recall that in a sigmoid neuron the inputs x1,x2,…x1,x2,… result in the output σ(∑jwjxj+b)σ(∑jwjxj+b), where wjwj are the weights, bb is the bias, and σσ is the sigmoid function:


What if we consider a different type of neuron, one using some other activation function, s(z)s(z):


That is, we’ll assume that if our neurons has inputs x1,x2,…x1,x2,…, weights w1,w2,…w1,w2,… and bias bb, then the output is s(∑jwjxj+b)s(∑jwjxj+b).


We can use this activation function to get a step function, just as we did with the sigmoid. Try ramping up the weight in the following, say to w=100w=100:


neural network tutorial

Just as with the sigmoid, this causes the activation function to contract, and ultimately it becomes a very good approximation to a step function. Try changing the bias, and you’ll see that we can set the position of the step to be wherever we choose. And so we can use all the same tricks as before to compute any desired function.


What properties does s(z)s(z) need to satisfy in order for this to work? We do need to assume that s(z)s(z) is well-defined as z→−∞z→−∞ and z→∞z→∞. These two limits are the two values taken on by our step function. We also need to assume that these limits are different from one another. If they weren’t, there’d be no step, simply a flat graph! But provided the activation function s(z)s(z) satisfies these properties, neurons based on such an activation function are universal for computation.

neural network tutorial



Earlier in the book we met another type of neuron known as a rectified linear unit. Explain why such neurons don’t satisfy the conditions just given for universality. Find a proof of universality showing that rectified linear units are universal for computation.

Suppose we consider linear neurons, i.e., neurons with the activation function s(z)=zs(z)=z. Explain why linear neurons don’t satisfy the conditions just given for universality. Show that such neurons can’t be used to do universal computation.

Fixing up the step functions

neural network tutorial

Up to now, we’ve been assuming that our neurons can produce step functions exactly. That’s a pretty good approximation, but it is only an approximation. In fact, there will be a narrow window of failure, illustrated in the following graph, in which the function behaves very differently from a step function:



In these windows of failure the explanation I’ve given for universality will fail.


Now, it’s not a terrible failure. By making the weights input to the neurons big enough we can make these windows of failure as small as we like. Certainly, neural network tutorial we can make the window much narrower than I’ve shown above – narrower, indeed, than our eye could see. So perhaps we might not worry too much about this problem.


Nonetheless, it’d be nice to have some way of addressing the problem.


In fact, the problem turns out to be easy to fix. Let’s look at the fix for neural networks computing functions with just one input and one output. The same ideas work also to address the problem when there are more inputs and outputs.


In particular, suppose we want our network to compute some function, ff. As before, we do this by trying to design our network so that the weighted output from our hidden layer of neurons is σ−1∘f(x)σ−1∘f(x):


If we were to do this using the technique described earlier, we’d use the hidden neurons to produce a sequence of bump functions:


Again, I’ve exaggerated the size of the windows of failure, in order to make them easier to see. It should be pretty clear that if we add all these bump functions up we’ll end up with a reasonable approximation to σ−1∘f(x)σ−1∘f(x), except within the windows of failure.

neural network tutorial

Suppose that instead of using the approximation just described, we use a set of hidden neurons to compute an approximation to half our original goal function, i.e., to σ−1∘f(x)/2σ−1∘f(x)/2. Of course, this looks just like a scaled down version of the last graph:

neural network tutorial

And suppose we use another set of hidden neurons to compute an approximation to σ−1∘f(x)/2σ−1∘f(x)/2, but with the bases of the bumps shifted by half the width of a bump:

Now we have two different approximations to σ−1∘f(x)/2σ−1∘f(x)/2. If we add up the two approximations we’ll get an overall approximation to σ−1∘f(x)σ−1∘f(x). That overall approximation will still have failures in small windows. But the problem will be much less than before. neural network tutorial The reason is that points in a failure window for one approximation won’t be in a failure window for the other. And so the approximation will be a factor roughly 22 better in those windows.


We could do even better by adding up a large number, MM, of overlapping approximations to the function σ−1∘f(x)/Mσ−1∘f(x)/M. Provided the windows of failure are narrow enough, a point will only ever be in one window of failure. And provided we’re using a large enough number MM of overlapping approximations, neural network tutorial the result will be an excellent overall approximation.




The explanation for universality we’ve discussed is certainly not a practical prescription for how to compute using neural networks! In this, it’s much like proofs of universality for NAND gates and the like. For this reason, I’ve focused mostly on trying to make the construction clear and easy to follow, and not on optimizing the details of the construction. However, neural network tutorial you may find it a fun and instructive exercise to see if you can improve the construction.


Although the result isn’t directly useful in constructing networks, it’s important because it takes off the table the question of whether any particular function is computable using a neural network. The answer to that question is always “yes”. So the right question to ask is not whether any particular function is computable, but rather what’s a good way to compute the function.


The universality construction we’ve developed uses just two hidden layers to compute an arbitrary function. Furthermore, as we’ve discussed, it’s possible to get the same result with just a single hidden layer. Given this, you might wonder why we would ever be interested in deep networks, i.e., networks with many hidden layers. neural network tutorial Can’t we simply replace those networks with shallow, single hidden layer networks?


Chapter acknowledgments: Thanks to Jen Dodd and Chris Olah for many discussions about universality in neural networks. My thanks, in particular, neural network tutorial to Chris for suggesting the use of a lookup table to prove universality. The interactive visual form of the chapter is inspired by the work of people such as Mike Bostock, Amit Patel, Bret Victor, and Steven Wittens.

neural network tutorial

While in principle that’s possible, there are good practical reasons to use deep networks. As argued in Chapter 1, neural network tutorialdeep networks have a hierarchical structure which makes them particularly well adapted to learn the hierarchies of knowledge that seem to be useful in solving real-world problems. Put more concretely, when attacking problems such as image recognition, it helps to use a system that understands not just individual pixels, but also increasingly more complex concepts: from edges to simple geometric shapes, all the way up through complex, multi-object scenes. In later chapters, we’ll see evidence suggesting that deep networks do a better job than shallow networks at learning such hierarchies of knowledge. To sum up: universality tells us that neural networks can compute any function; and empirical evidence suggests that deep networks are the networks best adapted to learn the functions useful in solving many real-world problems.

Imagine you’re an engineer who has been asked to design a computer from scratch. One day you’re working away in your office, designing logical circuits, setting out AND gates, OR gates, and so on, when your boss walks in with bad news. The customer has just added a surprising design requirement: the circuit for the entire computer must be just two layers deep:



You’re dumbfounded, and tell your boss: “The customer is crazy!”


Your boss replies: “I think they’re crazy, too. But what the customer wants, they get.”


In fact, there’s a limited sense in which the customer isn’t crazy. Suppose you’re allowed to use a special logical gate which lets you AND together as many inputs as you want. And you’re also allowed a many-input NAND gate, that is, a gate which can AND multiple inputs and then negate the output. With these special gates it turns out to be possible to compute any function at all using a circuit that’s just two layers deep.


But just because something is possible doesn’t make it a good idea. In practice, when solving circuit design problems (or most any kind of algorithmic problem), we usually start by figuring out how to solve sub-problems, and then gradually integrate the solutions. In other words, we build up to a solution through multiple layers of abstraction.


For instance, suppose we’re designing a logical circuit to multiply two numbers. Chances are we want to build it up out of sub-circuits doing operations like adding two numbers. The sub-circuits for adding two numbers will, in turn, be built up out of sub-sub-circuits for adding two bits. Very roughly speaking our circuit will look like:



That is, our final circuit contains at least three layers of circuit elements. In fact, it’ll probably contain more than three layers, as we break the sub-tasks down into smaller units than I’ve described. But you get the general idea.


So deep circuits make the process of design easier. But they’re not just helpful for design. There are, in fact, mathematical proofs showing that for some functions very shallow circuits require exponentially more circuit elements to compute than do deep circuits. For instance, a famous series of papers in the early 1980s* *The history is somewhat complex, so I won’t give detailed references. See Johan Håstad’s 2012 paper On the correlation of parity and small-depth circuits for an account of the early history and references. showed that computing the parity of a set of bits requires exponentially many gates, if done with a shallow circuit. On the other hand, if you use deeper circuits it’s easy to compute the parity using a small circuit: you just compute the parity of pairs of bits, then use those results to compute the parity of pairs of pairs of bits, and so on, building up quickly to the overall parity. Deep circuits thus can be intrinsically much more powerful than shallow circuits.


Up to now, this book has approached neural networks like the crazy customer. Almost all the networks we’ve worked with have just a single hidden layer of neurons (plus the input and output layers):



These simple networks have been remarkably useful: in earlier chapters we used networks like this to classify handwritten digits with better than 98 percent accuracy! Nonetheless, intuitively we’d expect networks with many more hidden layers to be more powerful:



Such networks could use the intermediate layers to build up multiple layers of abstraction, just as we do in Boolean circuits. For instance, if we’re doing visual pattern recognition, then the neurons in the first layer might learn to recognize edges, the neurons in the second layer could learn to recognize more complex shapes, say triangle or rectangles, built up from edges. The third layer would then recognize still more complex shapes. And so on. These multiple layers of abstraction seem likely to give deep networks a compelling advantage in learning to solve complex pattern recognition problems. Moreover, just as in the case of circuits, there are theoretical results suggesting that deep networks are intrinsically more powerful than shallow networks* *For certain problems and network architectures this is proved in On the number of response regions of deep feed forward networks with piece-wise linear activations, by Razvan Pascanu, Guido Montúfar, and Yoshua Bengio (2014). See also the more informal discussion in section 2 of Learning deep architectures for AI, by Yoshua Bengio (2009)..


How can we train such deep networks? In this chapter, we’ll try training deep networks using our workhorse learning algorithm – stochastic gradient descent by backpropagation. But we’ll run into trouble, with our deep networks not performing much (if at all) better than shallow networks.


That failure seems surprising in the light of the discussion above. Rather than give up on deep networks, we’ll dig down and try to understand what’s making our deep networks hard to train. When we look closely, we’ll discover that the different layers in our deep network are learning at vastly different speeds. In particular, when later layers in the network are learning well, early layers often get stuck during training, learning almost nothing at all. This stuckness isn’t simply due to bad luck. Rather, we’ll discover there are fundamental reasons the learning slowdown occurs, connected to our use of gradient-based learning techniques.


As we delve into the problem more deeply, we’ll learn that the opposite phenomenon can also occur: the early layers may be learning well, but later layers can become stuck. In fact, we’ll find that there’s an intrinsic instability associated to learning by gradient descent in deep, many-layer neural networks. This instability tends to result in either the early or the later layers getting stuck during training.


This all sounds like bad news. But by delving into these difficulties, we can begin to gain insight into what’s required to train deep networks effectively. And so these investigations are good preparation for the next chapter, where we’ll use deep learning to attack image recognition problems.


The vanishing gradient problem


So, what goes wrong when we try to train a deep network?


To answer that question, let’s first revisit the case of a network with just a single hidden layer. As per usual, we’ll use the MNIST digit classification problem as our playground for learning and experimentation* *I introduced the MNIST problem and data here and here..


If you wish, you can follow along by training networks on your computer. It is also, of course, fine to just read along. If you do wish to follow live, then you’ll need Python 2.7, Numpy, and a copy of the code, which you can get by cloning the relevant repository from the command line:


git clone

If you don’t use git then you can download the data and code here. You’ll need to change into the src subdirectory.

Then, from a Python shell we load the MNIST data:


>>> import mnist_loader

>>> training_data, validation_data, test_data = \

… mnist_loader.load_data_wrapper()

We set up our network:


>>> import network2

>>> net = network2.Network([784, 30, 10])

This network has 784 neurons in the input layer, corresponding to the 28×28=78428×28=784 pixels in the input image. We use 30 hidden neurons, as well as 10 output neurons, corresponding to the 10 possible classifications for the MNIST digits (‘0’, ‘1’, ‘2’, ……, ‘9’).


Let’s try training our network for 30 complete epochs, using mini-batches of 10 training examples at a time, a learning rate η=0.1η=0.1, and regularization parameter λ=5.0λ=5.0. As we train we’ll monitor the classification accuracy on the validation_data* *Note that the networks is likely to take some minutes to train, depending on the speed of your machine. So if you’re running the code you may wish to continue reading and return later, not wait for the code to finish executing.:


>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0,

… evaluation_data=validation_data, monitor_evaluation_accuracy=True)

We get a classification accuracy of 96.48 percent (or thereabouts – it’ll vary a bit from run to run), comparable to our earlier results with a similar configuration.


Now, let’s add another hidden layer, also with 30 neurons in it, and try training with the same hyper-parameters:


>>> net = network2.Network([784, 30, 30, 10])

>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0,

… evaluation_data=validation_data, monitor_evaluation_accuracy=True)

This gives an improved classification accuracy, 96.90 percent. That’s encouraging: a little more depth is helping. Let’s add another 30-neuron hidden layer:


>>> net = network2.Network([784, 30, 30, 30, 10])

>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0,

… evaluation_data=validation_data, monitor_evaluation_accuracy=True)

That doesn’t help at all. In fact, the result drops back down to 96.57 percent, close to our original shallow network. And suppose we insert one further hidden layer:


>>> net = network2.Network([784, 30, 30, 30, 30, 10])

>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0,

… evaluation_data=validation_data, monitor_evaluation_accuracy=True)

The classification accuracy drops again, to 96.53 percent. That’s probably not a statistically significant drop, but it’s not encouraging, either.


This behaviour seems strange. Intuitively, extra hidden layers ought to make the network able to learn more complex classification functions, and thus do a better job classifying. Certainly, things shouldn’t get worse, since the extra layers can, in the worst case, simply do nothing* *See this later problem to understand how to build a hidden layer that does nothing.. But that’s not what’s going on.


So what is going on? Let’s assume that the extra hidden layers really could help in principle, and the problem is that our learning algorithm isn’t finding the right weights and biases. We’d like to figure out what’s going wrong in our learning algorithm, and how to do better.


To get some insight into what’s going wrong, let’s visualize how the network learns. Below, I’ve plotted part of a [784,30,30,10][784,30,30,10] network, i.e., a network with two hidden layers, each containing 3030 hidden neurons. Each neuron in the diagram has a little bar on it, representing how quickly that neuron is changing as the network learns. A big bar means the neuron’s weights and bias are changing rapidly, while a small bar means the weights and bias are changing slowly. More precisely, the bars denote the gradient ∂C/∂b∂C/∂b for each neuron, i.e., the rate of change of the cost with respect to the neuron’s bias. Back in Chapter 2 we saw that this gradient quantity controlled not just how rapidly the bias changes during learning, but also how rapidly the weights input to the neuron change, too. Don’t worry if you don’t recall the details: the thing to keep in mind is simply that these bars show how quickly each neuron’s weights and bias are changing as the network learns.


To keep the diagram simple, I’ve shown just the top six neurons in the two hidden layers. I’ve omitted the input neurons, since they’ve got no weights or biases to learn. I’ve also omitted the output neurons, since we’re doing layer-wise comparisons, and it makes most sense to compare layers with the same number of neurons. The results are plotted at the very beginning of training, i.e., immediately after the network is initialized. Here they are* *The data plotted is generated using the program The same program is also used to generate the results quoted later in this section.:



The network was initialized randomly, and so it’s not surprising that there’s a lot of variation in how rapidly the neurons learn. Still, one thing that jumps out is that the bars in the second hidden layer are mostly much larger than the bars in the first hidden layer. As a result, the neurons in the second hidden layer will learn quite a bit faster than the neurons in the first hidden layer. Is this merely a coincidence, or are the neurons in the second hidden layer likely to learn faster than neurons in the first hidden layer in general?


To determine whether this is the case, it helps to have a global way of comparing the speed of learning in the first and second hidden layers. To do this, let’s denote the gradient as δlj=∂C/∂bljδjl=∂C/∂bjl, i.e., the gradient for the jjth neuron in the llth layer* *Back in Chapter 2 we referred to this as the error, but here we’ll adopt the informal term “gradient”. I say “informal” because of course this doesn’t explicitly include the partial derivatives of the cost with respect to the weights, ∂C/∂w∂C/∂w.. We can think of the gradient δ1δ1 as a vector whose entries determine how quickly the first hidden layer learns, and δ2δ2 as a vector whose entries determine how quickly the second hidden layer learns. We’ll then use the lengths of these vectors as (rough!) global measures of the speed at which the layers are learning. So, for instance, the length ‖δ1‖‖δ1‖ measures the speed at which the first hidden layer is learning, while the length ‖δ2‖‖δ2‖ measures the speed at which the second hidden layer is learning.


With these definitions, and in the same configuration as was plotted above, we find ‖δ1‖=0.07…‖δ1‖=0.07… and ‖δ2‖=0.31…‖δ2‖=0.31…. So this confirms our earlier suspicion: the neurons in the second hidden layer really are learning much faster than the neurons in the first hidden layer.


What happens if we add more hidden layers? If we have three hidden layers, in a [784,30,30,30,10][784,30,30,30,10] network, then the respective speeds of learning turn out to be 0.012, 0.060, and 0.283. Again, earlier hidden layers are learning much slower than later hidden layers. Suppose we add yet another layer with 3030 hidden neurons. In that case, the respective speeds of learning are 0.003, 0.017, 0.070, and 0.285. The pattern holds: early layers learn slower than later layers.


We’ve been looking at the speed of learning at the start of training, that is, just after the networks are initialized. How does the speed of learning change as we train our networks? Let’s return to look at the network with just two hidden layers. The speed of learning changes as follows:



To generate these results, I used batch gradient descent with just 1,000 training images, trained over 500 epochs. This is a bit different than the way we usually train – I’ve used no mini-batches, and just 1,000 training images, rather than the full 50,000 image training set. I’m not trying to do anything sneaky, or pull the wool over your eyes, but it turns out that using mini-batch stochastic gradient descent gives much noisier (albeit very similar, when you average away the noise) results. Using the parameters I’ve chosen is an easy way of smoothing the results out, so we can see what’s going on.


In any case, as you can see the two layers start out learning at very different speeds (as we already know). The speed in both layers then drops very quickly, before rebounding. But through it all, the first hidden layer learns much more slowly than the second hidden layer.


What about more complex networks? Here’s the results of a similar experiment, but this time with three hidden layers (a [784,30,30,30,10][784,30,30,30,10] network):



Again, early hidden layers learn much more slowly than later hidden layers. Finally, let’s add a fourth hidden layer (a [784,30,30,30,30,10][784,30,30,30,30,10] network), and see what happens when we train:



Again, early hidden layers learn much more slowly than later hidden layers. In this case, the first hidden layer is learning roughly 100 times slower than the final hidden layer. No wonder we were having trouble training these networks earlier!


We have here an important observation: in at least some deep neural networks, the gradient tends to get smaller as we move backward through the hidden layers. This means that neurons in the earlier layers learn much more slowly than neurons in later layers. And while we’ve seen this in just a single network, there are fundamental reasons why this happens in many neural networks. The phenomenon is known as the vanishing gradient problem* *See Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, by Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber (2001). This paper studied recurrent neural nets, but the essential phenomenon is the same as in the feedforward networks we are studying. See also Sepp Hochreiter’s earlier Diploma Thesis, Untersuchungen zu dynamischen neuronalen Netzen (1991, in German)..


Why does the vanishing gradient problem occur? Are there ways we can avoid it? And how should we deal with it in training deep neural networks? In fact, we’ll learn shortly that it’s not inevitable, although the alternative is not very attractive, either: sometimes the gradient gets much larger in earlier layers! This is the exploding gradient problem, and it’s not much better news than the vanishing gradient problem. More generally, it turns out that the gradient in deep neural networks is unstable, tending to either explode or vanish in earlier layers. This instability is a fundamental problem for gradient-based learning in deep neural networks. It’s something we need to understand, and, if possible, take steps to address.


One response to vanishing (or unstable) gradients is to wonder if they’re really such a problem. Momentarily stepping away from neural nets, imagine we were trying to numerically minimize a function f(x)f(x) of a single variable. Wouldn’t it be good news if the derivative f′(x)f′(x) was small? Wouldn’t that mean we were already near an extremum? In a similar way, might the small gradient in early layers of a deep network mean that we don’t need to do much adjustment of the weights and biases?


Of course, this isn’t the case. Recall that we randomly initialized the weight and biases in the network. It is extremely unlikely our initial weights and biases will do a good job at whatever it is we want our network to do. To be concrete, consider the first layer of weights in a [784,30,30,30,10][784,30,30,30,10] network for the MNIST problem. The random initialization means the first layer throws away most information about the input image. Even if later layers have been extensively trained, they will still find it extremely difficult to identify the input image, simply because they don’t have enough information. And so it can’t possibly be the case that not much learning needs to be done in the first layer. If we’re going to train deep networks, we need to figure out how to address the vanishing gradient problem.


What’s causing the vanishing gradient problem? Unstable gradients in deep neural nets


To get insight into why the vanishing gradient problem occurs, let’s consider the simplest deep neural network: one with just a single neuron in each layer. Here’s a network with three hidden layers:



Here, w1,w2,…w1,w2,… are the weights, b1,b2,…b1,b2,… are the biases, and CC is some cost function. Just to remind you how this works, the output ajaj from the jjth neuron is σ(zj)σ(zj), where σσ is the usual sigmoid activation function, and zj=wjaj−1+bjzj=wjaj−1+bj is the weighted input to the neuron. I’ve drawn the cost CC at the end to emphasize that the cost is a function of the network’s output, a4a4: if the actual output from the network is close to the desired output, then the cost will be low, while if it’s far away, the cost will be high.


We’re going to study the gradient ∂C/∂b1∂C/∂b1 associated to the first hidden neuron. We’ll figure out an expression for ∂C/∂b1∂C/∂b1, and by studying that expression we’ll understand why the vanishing gradient problem occurs.


I’ll start by simply showing you the expression for ∂C/∂b1∂C/∂b1. It looks forbidding, but it’s actually got a simple structure, which I’ll describe in a moment. Here’s the expression (ignore the network, for now, and note that σ′σ′ is just the derivative of the σσ function):



The structure in the expression is as follows: there is a σ′(zj)σ′(zj) term in the product for each neuron in the network; a weight wjwj term for each weight in the network; and a final ∂C/∂a4∂C/∂a4 term, corresponding to the cost function at the end. Notice that I’ve placed each term in the expression above the corresponding part of the network. So the network itself is a mnemonic for the expression.


You’re welcome to take this expression for granted, and skip to the discussion of how it relates to the vanishing gradient problem. There’s no harm in doing this, since the expression is a special case of our earlier discussion of backpropagation. But there’s also a simple explanation of why the expression is true, and so it’s fun (and perhaps enlightening) to take a look at that explanation.


Imagine we make a small change Δb1Δb1 in the bias b1b1. That will set off a cascading series of changes in the rest of the network. First, it causes a change Δa1Δa1 in the output from the first hidden neuron. That, in turn, will cause a change Δz2Δz2 in the weighted input to the second hidden neuron. Then a change Δa2Δa2 in the output from the second hidden neuron. And so on, all the way through to a change ΔCΔC in the cost at the output. We have



This suggests that we can figure out an expression for the gradient ∂C/∂b1∂C/∂b1 by carefully tracking the effect of each step in this cascade.


To do this, let’s think about how Δb1Δb1 causes the output a1a1 from the first hidden neuron to change. We have a1=σ(z1)=σ(w1a0+b1)a1=σ(z1)=σ(w1a0+b1), so



That σ′(z1)σ′(z1) term should look familiar: it’s the first term in our claimed expression for the gradient ∂C/∂b1∂C/∂b1. Intuitively, this term converts a change Δb1Δb1 in the bias into a change Δa1Δa1 in the output activation. That change Δa1Δa1 in turn causes a change in the weighted input z2=w2a1+b2z2=w2a1+b2 to the second hidden neuron:



Combining our expressions for Δz2Δz2 and Δa1Δa1, we see how the change in the bias b1b1 propagates along the network to affect z2z2:



Again, that should look familiar: we’ve now got the first two terms in our claimed expression for the gradient ∂C/∂b1∂C/∂b1.


We can keep going in this fashion, tracking the way changes propagate through the rest of the network. At each neuron we pick up a σ′(zj)σ′(zj) term, and through each weight we pick up a wjwj term. The end result is an expression relating the final change ΔCΔC in cost to the initial change Δb1Δb1 in the bias:



Dividing by Δb1Δb1 we do indeed get the desired expression for the gradient:




Why the vanishing gradient problem occurs: To understand why the vanishing gradient problem occurs, let’s explicitly write out the entire expression for the gradient:



Excepting the very last term, this expression is a product of terms of the form wjσ′(zj)wjσ′(zj). To understand how each of those terms behave, let’s look at a plot of the function σ′σ′:


















Derivative of sigmoid function

The derivative reaches a maximum at σ′(0)=1/4σ′(0)=1/4. Now, if we use our standard approach to initializing the weights in the network, then we’ll choose the weights using a Gaussian with mean 00 and standard deviation 11. So the weights will usually satisfy |wj|<1|wj|<1. Putting these observations together, we see that the terms wjσ′(zj)wjσ′(zj) will usually satisfy |wjσ′(zj)|<1/4|wjσ′(zj)|<1/4. And when we take a product of many such terms, the product will tend to exponentially decrease: the more terms, the smaller the product will be. This is starting to smell like a possible explanation for the vanishing gradient problem.


To make this all a bit more explicit, let’s compare the expression for ∂C/∂b1∂C/∂b1 to an expression for the gradient with respect to a later bias, say ∂C/∂b3∂C/∂b3. Of course, we haven’t explicitly worked out an expression for ∂C/∂b3∂C/∂b3, but it follows the same pattern described above for ∂C/∂b1∂C/∂b1. Here’s the comparison of the two expressions:



The two expressions share many terms. But the gradient ∂C/∂b1∂C/∂b1 includes two extra terms each of the form wjσ′(zj)wjσ′(zj). As we’ve seen, such terms are typically less than 1/41/4 in magnitude. And so the gradient ∂C/∂b1∂C/∂b1 will usually be a factor of 1616 (or more) smaller than ∂C/∂b3∂C/∂b3. This is the essential origin of the vanishing gradient problem.

Of course, this is an informal argument, not a rigorous proof that the vanishing gradient problem will occur. There are several possible escape clauses. In particular, we might wonder whether the weights wjwj could grow during training. If they do, it’s possible the terms wjσ′(zj)wjσ′(zj) in the product will no longer satisfy |wjσ′(zj)|<1/4|wjσ′(zj)|<1/4. Indeed, if the terms get large enough – greater than 11 – then we will no longer have a vanishing gradient problem. Instead, the gradient will actually grow exponentially as we move backward through the layers. Instead of a vanishing gradient problem, we’ll have an exploding gradient problem.


The exploding gradient problem: Let’s look at an explicit example where exploding gradients occur. The example is somewhat contrived: I’m going to fix parameters in the network in just the right way to ensure we get an exploding gradient. But even though the example is contrived, it has the virtue of firmly establishing that exploding gradients aren’t merely a hypothetical possibility, they really can happen.


There are two steps to getting an exploding gradient. First, we choose all the weights in the network to be large, say w1=w2=w3=w4=100w1=w2=w3=w4=100. Second, we’ll choose the biases so that the σ′(zj)σ′(zj) terms are not too small. That’s actually pretty easy to do: all we need do is choose the biases to ensure that the weighted input to each neuron is zj=0zj=0 (and so σ′(zj)=1/4σ′(zj)=1/4). So, for instance, we want z1=w1a0+b1=0z1=w1a0+b1=0. We can achieve this by setting b1=−100∗a0b1=−100∗a0. We can use the same idea to select the other biases. When we do this, we see that all the terms wjσ′(zj)wjσ′(zj) are equal to 100∗14=25100∗14=25. With these choices we get an exploding gradient.


The unstable gradient problem: The fundamental problem here isn’t so much the vanishing gradient problem or the exploding gradient problem. It’s that the gradient in early layers is the product of terms from all the later layers. When there are many layers, that’s an intrinsically unstable situation. The only way all layers can learn at close to the same speed is if all those products of terms come close to balancing out. Without some mechanism or underlying reason for that balancing to occur, it’s highly unlikely to happen simply by chance. In short, the real problem here is that neural networks suffer from an unstable gradient problem. As a result, if we use standard gradient-based learning techniques, different layers in the network will tend to learn at wildly different speeds.




In our discussion of the vanishing gradient problem, we made use of the fact that |σ′(z)|<1/4|σ′(z)|<1/4. Suppose we used a different activation function, one whose derivative could be much larger. Would that help us avoid the unstable gradient problem?

The prevalence of the vanishing gradient problem: We’ve seen that the gradient can either vanish or explode in the early layers of a deep network. In fact, when using sigmoid neurons the gradient will usually vanish. To see why, consider again the expression |wσ′(z)||wσ′(z)|. To avoid the vanishing gradient problem we need |wσ′(z)|≥1|wσ′(z)|≥1. You might think this could happen easily if ww is very large. However, it’s more difficult than it looks. The reason is that the σ′(z)σ′(z) term also depends on ww: σ′(z)=σ′(wa+b)σ′(z)=σ′(wa+b), where aa is the input activation. So when we make ww large, we need to be careful that we’re not simultaneously making σ′(wa+b)σ′(wa+b) small. That turns out to be a considerable constraint. The reason is that when we make ww large we tend to make wa+bwa+b very large. Looking at the graph of σ′σ′ you can see that this puts us off in the “wings” of the σ′σ′ function, where it takes very small values. The only way to avoid this is if the input activation falls within a fairly narrow range of values (this qualitative explanation is made quantitative in the first problem below). Sometimes that will chance to happen. More often, though, it does not happen. And so in the generic case we have vanishing gradients.




Consider the product |wσ′(wa+b)||wσ′(wa+b)|. Suppose |wσ′(wa+b)|≥1|wσ′(wa+b)|≥1. (1) Argue that this can only ever occur if |w|≥4|w|≥4. (2) Supposing that |w|≥4|w|≥4, consider the set of input activations aa for which |wσ′(wa+b)|≥1|wσ′(wa+b)|≥1. Show that the set of aa satisfying that constraint can range over an interval no greater in width than



(3) Show numerically that the above expression bounding the width of the range is greatest at |w|≈6.9|w|≈6.9, where it takes a value ≈0.45≈0.45. And so even given that everything lines up just perfectly, we still have a fairly narrow range of input activations which can avoid the vanishing gradient problem.

Identity neuron: Consider a neuron with a single input, xx, a corresponding weight, w1w1, a bias bb, and a weight w2w2 on the output. Show that by choosing the weights and bias appropriately, we can ensure w2σ(w1x+b)≈xw2σ(w1x+b)≈x for x∈[0,1]x∈[0,1]. Such a neuron can thus be used as a kind of identity neuron, that is, a neuron whose output is the same (up to rescaling by a weight factor) as its input. Hint: It helps to rewrite x=1/2+Δx=1/2+Δ, to assume w1w1 is small, and to use a Taylor series expansion in w1Δw1Δ.

Unstable gradients in more complex networks


We’ve been studying toy networks, with just one neuron in each hidden layer. What about more complex deep networks, with many neurons in each hidden layer?



In fact, much the same behaviour occurs in such networks. In the earlier chapter on backpropagation we saw that the gradient in the llth layer of an LL layer network is given by:




Here, Σ′(zl)Σ′(zl) is a diagonal matrix whose entries are the σ′(z)σ′(z) values for the weighted inputs to the llth layer. The wlwl are the weight matrices for the different layers. And ∇aC∇aC is the vector of partial derivatives of CC with respect to the output activations.


This is a much more complicated expression than in the single-neuron case. Still, if you look closely, the essential form is very similar, with lots of pairs of the form (wj)TΣ′(zj)(wj)TΣ′(zj). What’s more, the matrices Σ′(zj)Σ′(zj) have small entries on the diagonal, none larger than 1414. Provided the weight matrices wjwj aren’t too large, each additional term (wj)TΣ′(zl)(wj)TΣ′(zl) tends to make the gradient vector smaller, leading to a vanishing gradient. More generally, the large number of terms in the product tends to lead to an unstable gradient, just as in our earlier example. In practice, empirically it is typically found in sigmoid networks that gradients vanish exponentially quickly in earlier layers. As a result, learning slows down in those layers. This slowdown isn’t merely an accident or an inconvenience: it’s a fundamental consequence of the approach we’re taking to learning.


Other obstacles to deep learning


In this chapter we’ve focused on vanishing gradients – and, more generally, unstable gradients – as an obstacle to deep learning. In fact, unstable gradients are just one obstacle to deep learning, albeit an important fundamental obstacle. Much ongoing research aims to better understand the challenges that can occur when training deep networks. I won’t comprehensively summarize that work here, but just want to briefly mention a couple of papers, to give you the flavor of some of the questions people are asking.


As a first example, in 2010 Glorot and Bengio* *Understanding the difficulty of training deep feedforward neural networks, by Xavier Glorot and Yoshua Bengio (2010). See also the earlier discussion of the use of sigmoids in Efficient BackProp, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-Robert Müller (1998). found evidence suggesting that the use of sigmoid activation functions can cause problems training deep networks. In particular, they found evidence that the use of sigmoids will cause the activations in the final hidden layer to saturate near 00 early in training, substantially slowing down learning. They suggested some alternative activation functions, which appear not to suffer as much from this saturation problem.


As a second example, in 2013 Sutskever, Martens, Dahl and Hinton* *On the importance of initialization and momentum in deep learning, by Ilya Sutskever, James Martens, George Dahl and Geoffrey Hinton (2013). studied the impact on deep learning of both the random weight initialization and the momentum schedule in momentum-based stochastic gradient descent. In both cases, making good choices made a substantial difference in the ability to train deep networks.


These examples suggest that “What makes deep networks hard to train?” is a complex question. In this chapter, we’ve focused on the instabilities associated to gradient-based learning in deep networks. The results in the last two paragraphs suggest that there is also a role played by the choice of activation function, the way weights are initialized, and even details of how learning by gradient descent is implemented. And, of course, choice of network architecture and other hyper-parameters is also important. Thus, many factors can play a role in making deep networks hard to train, and understanding all those factors is still a subject of ongoing research. This all seems rather downbeat and pessimism-inducing. But the good news is that in the next chapter we’ll turn that around, and develop several approaches to deep learning that to some extent manage to overcome or route around all these challenges.