Your very own neural network

#Your-very-own-neural-network

In this notebook, we're going to build a neural network using naught but pure numpy and steel nerves. It's going to be fun, I promise!

img

Here goes our main class: a layer that can .forward() and .backward().

The road ahead

#The-road-ahead

We're going to build a neural network that classifies MNIST digits. To do so, we'll need a few building blocks:

  • Dense layer - a fully-connected layer, @@0@@
  • ReLU layer (or any other nonlinearity you want)
  • Loss function - crossentropy
  • Backprop algorithm - a stochastic gradient descent with backpropageted gradients

Let's approach them one at a time.

Nonlinearity layer

#Nonlinearity-layer

This is the simplest layer you can get: it simply applies a nonlinearity to each element of your network.

Instant primer: lambda functions

#Instant-primer:-lambda-functions

In python, you can define functions in one line using the lambda syntax: lambda param1, param2: expression

For example: f = lambda x, y: x+y is equivalent to a normal function:

1
2
def f(x,y):
    return x+y

For more information, click here.

Dense layer

#Dense-layer

Now let's build something more complicated. Unlike nonlinearity, a dense layer actually has something to learn.

A dense layer applies affine transformation. In a vectorized form, it can be described as: @@0@@

Where

Both W and b are initialized during layer creation and updated each time backward is called.

Testing the dense layer

#Testing-the-dense-layer

Here we have a few tests to make sure your dense layer works properly. You can just run them, get 3 "well done"s and forget they ever existed.

... or not get 3 "well done"s and go fix stuff. If that is the case, here are some tips for you:

  • Make sure you compute gradients for W and b as sum of gradients over batch, not mean over gradients. Grad_output is already divided by batch size.
  • If you're debugging, try saving gradients in class fields, like "self.grad_w = grad_w" or print first 3-5 weights. This helps debugging.
  • If nothing else helps, try ignoring tests and proceed to network training. If it trains alright, you may be off by something that does not affect network training.

The loss function

#The-loss-function

Since we want to predict probabilities, it would be logical for us to define softmax nonlinearity on top of our network and compute loss given predicted probabilities. However, there is a better way to do so.

If you write down the expression for crossentropy as a function of softmax logits (a), you'll see:

@@0@@

If you take a closer look, ya'll see that it can be rewritten as:

@@1@@

It's called Log-softmax and it's better than naive log(softmax(a)) in all aspects:

  • Better numerical stability
  • Easier to get derivative right
  • Marginally faster to compute

So why not just use log-softmax throughout our computation and never actually bother to estimate probabilities.

Here you are! We've defined the both loss functions for you so that you could focus on neural network part.

Let't find a stable version of cross entropy

Let't find a stable version of cross entropy gradient

Full network

#Full-network

Now let's combine what we've just built into a working neural network. As we announced, we're gonna use this monster to classify handwritten digits, so let's get them loaded.

Loading output library...

We'll define network as a list of layers, each applied on top of previous one. In this setting, computing predictions and training becomes trivial.

Loading output library...

Instead of tests, we provide you with a training loop that prints training and validation accuracies on every epoch.

If your implementation of forward and backward are correct, your accuracy should grow from 90~93% to >97% with the default network.

Training loop

#Training-loop

As usual, we split data into minibatches, feed each such minibatch into the network and update weights.

Loading output library...

Peer-reviewed assignment

#Peer-reviewed-assignment

Congradulations, you managed to get this far! There is just one quest left undone, and this time you'll get to choose what to do.

Option I: initialization

#Option-I:-initialization
  • Implement Dense layer with Xavier initialization as explained here

To pass this assignment, you must conduct an experiment showing how xavier initialization compares to default initialization on deep networks (5+ layers).

Option II: regularization

#Option-II:-regularization
  • Implement a version of Dense layer with L2 regularization penalty: when updating Dense Layer weights, adjust gradients to minimize

@@0@@

To pass this assignment, you must conduct an experiment showing if regularization mitigates overfitting in case of abundantly large number of neurons. Consider tuning @@1@@ for better results.

Option III: optimization

#Option-III:-optimization
  • Implement a version of Dense layer that uses momentum/rmsprop or whatever method worked best for you last time.

Most of those methods require persistent parameters like momentum direction or moving average grad norm, but you can easily store those params inside your layers.

To pass this assignment, you must conduct an experiment showing how your chosen method performs compared to vanilla SGD.

General remarks

#General-remarks

Please read the peer-review guidelines before starting this part of the assignment.

In short, a good solution is one that:

  • is based on this notebook
  • runs in the default course environment with Run All
  • its code doesn't cause spontaneous eye bleeding
  • its report is easy to read.

Formally we can't ban you from writing boring reports, but if you bored your reviewer to death, there's noone left alive to give you the grade you want.

Bonus assignments

#Bonus-assignments

As a bonus assignment (no points, just swag), consider implementing Batch Normalization (guide) or Dropout (guide). Note, however, that those "layers" behave differently when training and when predicting on test set.

  • Dropout:

    • During training: drop units randomly with probability p and multiply everything by 1/(1-p)
    • During final predicton: do nothing; pretend there's no dropout
  • Batch normalization

    • During training, it substracts mean-over-batch and divides by std-over-batch and updates mean and variance.
    • During final prediction, it uses accumulated mean and variance.

Peer-reviewed assignments

#Peer-reviewed-assignments

Option I: initialization

#Option-I:-initialization
Loading output library...

For small network no noticeable improvemt is observed for any learning rate.

Let's try a bigger network

Loading output library...

For low learning rates networks learn faster and achieve a higher val accuracy when trained with Xavier initialization.

For higher learning rates networks with 'normal' initialization simply do not learn at all, while networks with Xavier initialization exhibit very good perfomance (not considering network with highest learning rate where numerical overflow occured).

Let's try even bigger network!

Loading output library...

For big networks the effect is even more pronounced: with 'normal' initialization networks do no learn at all for any learning rate, while networks with Xavier initialization show very good learning curves (last network again experienced numerical overflow)

Conclusion

#Conclusion

Xavier initialization allows networks learn faster and achieve higher accuracies compared to initialization with normal distrbution with fixed standard deviation (0.01 in our experiments). For large networks Xavier initialization even allows to succesfully train networks that do not learn at all with 'normal' initialization.

Option II: regularization

#Option-II:-regularization
Loading output library...
Loading output library...
Loading output library...

rerun with no lr = 0.5 and with alpha = 0.003 or higher

Option III:

#Option-III:
Loading output library...
Loading output library...