So you want to teach a computer to recognize handwritten digits? You want to code this out in Python? You understand a little about Machine Learning? You wanna build a neural network?
Let’s try and implement a simple 3-layer neural network (NN) from scratch. I won’t get into the math because I suck at math, let alone trying to teach it. I can also point to moar math resources if you read up on the details.
I assume you’re familiar with basic Machine Learning concepts like classification and regularization. Oh, and how optimization techniques like gradient descent work.
So, why not teach you Tensorflow or some other deep learning framework? I found that I learn best when I see the code, and learn the basics of the implementation. I find it helps me with intuition in choosing each part of the model. Of course, there are some AutoML solutions that could get me quicker ways to a baseline, but I still wouldn’t know anything. I’m trying to get out of just running the code like a script kiddie.
So let’s get started!
For the past few months (thanks Arvin), I have learned to appreciate both Classic Machine Learning (prior 2012) and Deep Learning techniques to model Kaggle competition data.
The handwritten digits competition was my first attempt at deep learning. So, I think it’s appropriate that it’s your first example to do deep learning. I remember this important gotcha moment. It was seeing the relationships between the data and pictures. It helped me to imagine the deep learning concepts visually.
What does the data look like?
We’re going to use the classic visual recognition challenge data set, called the MNIST data set. Kaggle competitions are awesome because you can self score your solutions and they provide data in simple clean CSV files. If successful, we should have a deep learning solution that should be the able to classify 25,000 images with a correct label. Let’s look at the CSV data.
Using a Jupyter notebook, let’s dump the data into a numpy matrix, and reshape it back into a picture. Each digit has been normalized to a 28 by 28 matrix.
The goal is to take the training data as an input (handwritten digit), pump it through the deep learning model, and predict if the data is a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
Architecture of a Simple Neural Network
1. Picking the shape of the neural network. I’m gonna choose a simple NN consisting of three layers:
- First Layer: Input layer (784 neurons)
- Second Layer: Hidden layer (n = 15 neurons)
- Third Layer: Output layer
Here’s a look of the 3 layer network proposed above:
Basic Structure of the code
Data structure to hold our data
2. Picking the right matrix data structure. Nested python lists? CudaMAT? Python Dict? I’m choosing numpy because we’ll heavily use np.dot, np.reshape, np.random, np.zeros, np.argmax, and np.exp functions that I’m not really interested in implementing from scratch.
Simulating perceptrons using an Activation Function
3. Picking the activation function for our hidden layer. The activation function transforms the inputs of the hidden layer into its outputs. Common choices for activation functions are tanh, the sigmoid function, or ReLUs. We’ll use the sigmoid function.
Python Neural Network Object
Feed Forward Function
a.k.a The Forward Pass
The purpose of the feed forward function is to pass the input into the NN matrix and return the new activations.
Stochastic Gradient Descent function (SGD)
Update Mini Batch Function
Mini-batch gradient descent can work a bit faster than stochastic gradient descent. In Batch gradient descent we will use all m examples in each generation. Whereas in Stochastic gradient descent we will use a single example in each generation. What Mini-batch gradient descent does is somewhere in between. Specifically, with this algorithm we’re going to use b examples in each iteration where b is a parameter called the “mini batch size” so the idea is that this is somewhat in-between Batch gradient descent and Stochastic gradient descent.
Back Prop Function
a.k.a The Backwards Pass
Our goal with back propagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole. Back prop is a method to stop us from overfitting our model, so the model is more generalized.
Cost Derivative Function
So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
Putting it all together – Network.py
Flask Digits Classifier