So, why do ResNets work so well?
Let's go through one example that illustrates why ResNets work so well,
at least in the sense of how you can make them deeper and deeper without really
hurting your ability to at least get them to do well on the training set.
And hopefully as you've understood from the third course in this sequence,
doing well on the training set is usually a prerequisite to doing
well on your hold up or on your depth or on your test sets.
So, being able to at least train ResNet to do well on
the training set is a good first step toward that. Let's look at an example.
What we saw on the last video was that if you make a network deeper,
it can hurt your ability to train the network to do well on the training set.
And that's why sometimes you don't want a network that is too deep.
But this is not true or at least is much less true when you training a ResNet.
So let's go through an example.
Let's say you have X feeding in to
some big neural network and just outputs some activation a[l].
Let's say for this example that you are going to modify
the neural network to make it a little bit deeper.
So, use the same big NN,
and this output's a[l],
and we're going to add a couple extra layers to this network so
let's add one layer there and another layer there.
And just for output a[l+2].
Only let's make this a ResNet block,
a residual block with that extra short cut.
And for the sake our argument,
let's say throughout this network we're using the value activation functions.
So, all the activations are going to be greater than or equal to zero,
with the possible exception of the input X.
Right. Because the value activation output's numbers that are either zero or positive.
Now, let's look at what's a[l+2] will be.
To copy the expression from the previous video,
a[l+2] will be value apply to z[l+2],
and then plus a[l] where is this addition of a[l]
comes from the short circle from the skip connection that we just added.
And if we expand this out,
this is equal to g of w[l+2],
times a of [l+1], plus b[l+2].
So that's z[l+2] is equal to that, plus a[l].
Now notice something, if you are using L two regularisation away to K,
that will tend to shrink the value of w[l+2].
If you are applying way to K to B that will also shrink this although
I guess in practice sometimes you do and sometimes you don't apply way to K to B,
but W is really the key term to pay attention to here.
And if w[l+2] is equal to zero.
And let's say for the sake of argument that B is also equal to zero,
then these terms go away because they're equal to zero,
and then g of a[l],
this is just equal to a[l] because we assumed we're using the value activation function.
And so all of the activations are all negative and so,
g of a[l] is the value applied to a non-negative quantity,
so you just get back, a[l].
So, what this shows is that the identity function is easy for residual block to learn.
And it's easy to get a[l+2] equals to a[l] because of this skip connection.
And what that means is that adding these two layers in your neural network,
it doesn't really hurt your neural network's ability to do as
well as this simpler network without these two extra layers,
because it's quite easy for it to learn the identity function to just copy
a[l] to a[l+2] using despite the addition of these two layers.
And this is why adding two extra layers,
adding this residual block to somewhere in
the middle or the end of this big neural network it doesn't hurt performance.
But of course our goal is to not just not hurt performance,
is to help performance and so you can imagine that if all of
these heading units if they actually learned something useful then
maybe you can do even better than learning the identity function.
And what goes wrong in very deep plain nets in very deep network without
this residual of the skip connections is
that when you make the network deeper and deeper,
it's actually very difficult for it to choose parameters that learn
even the identity function which is why a lot of layers
end up making your result worse rather than making your result better.
And I think the main reason the residual network works is
that it's so easy for these extra layers to learn
the identity function that you're kind of guaranteed that it doesn't hurt
performance and then a lot the time you maybe get lucky and then even helps performance.
At least is easier to go from a decent baseline of not
hurting performance and then great in decent can only improve the solution from there.
So, one more detail in the residual network that's
worth discussing which is through this edition here,
we're assuming that z[l+2] and a[l] have the same dimension.
And so what you see in ResNet is a lot of use of same convolutions
so that the dimension of this is
equal to the dimension I guess of this layer or the outputs layer.
So that we can actually do this short circle connection,
because the same convolution preserve dimensions,
and so makes that easier for you to carry out
this short circle and then carry out this addition of two equal dimension vectors.
In case the input and output have different dimensions so for example,
if this is a 128 dimensional and Z or therefore,
a[l] is 256 dimensional as an example.
What you would do is add an extra matrix and then call that Ws over here,
and Ws in this example would be a[l] 256 by 128 dimensional matrix.
So then Ws times a[l] becomes 256 dimensional and
this addition is now between
two 256 dimensional vectors and there are few things you could do with Ws,
it could be a matrix of parameters we learned,
it could be a fixed matrix that just implements
zero paddings that takes a[l] and then zero
pads it to be 256 dimensional and either of those versions I guess could work.
So finally, let's take a look at ResNets on images.
So these are images I got from the paper by Harlow.
This is an example of a plain network and in which you input an image
and then have a number of conv layers
until eventually you have a softmax output at the end.
To turn this into a ResNet,
you add those extra skip connections.
And I'll just mention a few details,
there are a lot of three by three convolutions here and most of these are
three by three same convolutions
and that's why you're adding equal dimension feature vectors.
So rather than a fully connected layer,
these are actually convolutional layers but because the same convolutions,
the dimensions are preserved and so the z[l+2] plus a[l] by addition makes sense.
And similar to what you've seen in a lot of NetRes before,
you have a bunch of convolutional layers and then there are
occasionally pulling layers as well or pulling a pulling likely is.
And whenever one of those things happen,
then you need to make an adjustment to the dimension which we saw on the previous slide.
You can do of the matrix Ws,
and then as is common in these networks,
you have <unknown> pool,
and then at the end you now have
a fully connected layer that then makes a prediction using a softmax.
So that's it for ResNet.
Next, there's a very interesting idea
behind using neural networks with one by one filters,
one by one convolutions.
So, one could use a one by one convolution.
Let's take a look at the next video.
Very, very deep neural networks are difficult to train because
of vanishing and exploding gradient types of problems.
In this video, you'll learn about
skip connections which allows you to take the activation from
one layer and suddenly feed it to another layer even much deeper in the neural network.
And using that, you'll build ResNet which enables you to train very, very deep networks.
Sometimes even networks of over 100 layers. Let's take a look.
ResNets are built out of something called a residual block,
let's first describe what that is.
Here are two layers of a neural network where you
start off with some activations in layer a[l],
then goes a[l+1] and then deactivation two layers later is a[l+2].
So let's to through the steps in this computation you have a[l],
and then the first thing you do is you apply this linear operator to it,
which is governed by this equation.
So you go from a[l] to compute z[l
+1] by multiplying by the weight matrix and adding that bias vector.
After that, you apply the ReLU nonlinearity, to get a[l+1].
And that's governed by this equation where a[l+1] is g(z[l+1]).
Then in the next layer,
you apply this linear step again,
so is governed by that equation.
So this is quite similar to this equation we saw on the left.
And then finally, you apply another ReLU operation which is
now governed by that equation where G here would be the ReLU nonlinearity.
And this gives you a[l+2].
So in other words,
for information from a[l] to flow to a[l+2],
it needs to go through all of these steps which I'm going to call
the main path of this set of layers.
In a residual net,
we're going to make a change to this.
We're going to take a[l],
and just first forward it, copy it,
match further into the neural network to here,
and just at a[l],
before applying to non-linearity, the ReLU non-linearity.
And I'm going to call this the shortcut.
So rather than needing to follow the main path,
the information from a[l] can now follow
a shortcut to go much deeper into the neural network.
And what that means is that this last equation
goes away and we instead have that the output
a[l+2] is the ReLU non-linearity g applied to z[l+2] as before,
but now plus a[l].
So, the addition of this a[l] here,
it makes this a residual block.
And in pictures, you can also modify this picture on
top by drawing this picture shortcut to go here.
And we are going to draw it as it going into this second layer here
because the short cut is actually added before the ReLU non-linearity.
So each of these nodes here,
whwre there applies a linear function and a ReLU.
So a[l] is being injected after the linear part but before the ReLU part.
And sometimes instead of a term short cut,
you also hear the term skip connection,
and that refers to a[l] just skipping over a layer or kind of skipping over
almost two layers in order to process information deeper into the neural network.
So, what the inventors of ResNet,
so that'll will be Kaiming He, Xiangyu Zhang,
Shaoqing Ren and Jian Sun.
What they found was that using residual blocks
allows you to train much deeper neural networks.
And the way you build a ResNet is by taking many of these residual blocks,
blocks like these, and stacking them together to form a deep network.
So, let's look at this network.
This is not the residual network,
this is called as a plain network.
This is the terminology of the ResNet paper.
To turn this into a ResNet,
what you do is you add all those
skip connections although those short like a connections like so.
So every two layers ends up with
that additional change that we saw on
the previous slide to turn each of these into residual block.
So this picture shows five residual blocks stacked together,
and this is a residual network.
And it turns out that if you use
your standard optimization algorithm such as
a gradient descent or one of
the fancier optimization algorithms to the train or plain network.
So without all the extra residual,
without all the extra short cuts or skip connections I just drew in.
Empirically, you find that as you increase the number of layers,
the training error will tend to decrease after
a while but then they'll tend to go back up.
And in theory as you make a neural network deeper,
it should only do better and better on the training set.
Right. So, the theory, in theory,
having a deeper network should only help.
But in practice or in reality,
having a plain network, so no ResNet,
having a plain network that is very deep means that
all your optimization algorithm just has a much harder time training.
And so, in reality,
your training error gets worse if you pick a network that's too deep.
But what happens with ResNet is that even as the number of layers gets deeper,
you can have the performance of the training error kind of keep on going down.
Even if we train a network with over a hundred layers.
And then now some people experimenting with networks of
over a thousand layers although I don't see that it used much in practice yet.
But by taking these activations be it X of
these intermediate activations and allowing it to go much deeper in the neural network,
this really helps with the vanishing and exploding gradient problems
and allows you to train
much deeper neural networks without really appreciable loss in performance,
and maybe at some point, this will plateau, this will flatten out,
and it doesn't help that much deeper and deeper networks.
But ResNet is not even effective at helping train very deep networks.
So you've now gotten an overview of how ResNets work.
And in fact, in this week's programming exercise,
you get to implement these ideas and see it work for yourself.
But next, I want to share with you better intuition or
even more intuition about why ResNets work so well,
let's go on to the next