Imagine that we want to create a neural network that can recognize a cat versus a dog. Our brains has been trained to recognize a variety of patterns differentiating a cat and a dog -- furriness, size, bone structure, etc. We can replicate this same process with neural networks, which are based off the human brain. Specifically, we do so using Convolutional Neural Networks (CNNs), which can recognize these patterns and differentiate between them. Much like other aspects of neural networks, CNNs are based off human anatomy, specifically the visual cortex of the brain.
A CNN is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various patterns in the image and be able to differentiate one from the other. A CNN is able to use much less pre-processing compared to other classification algorithms. While a traditional neural network has to hand-determine filters, a CNN is able to learn these filters by itself.
CNNs are most commonly used for image processing, classification, segmentation and also for other auto correlated data. For example, we applied CNNs to classify images as either cats or dogs.
This guide includes:
At first glance, it's not completely clear why we need a CNN versus a traditional neural network. Hypothetically, we could do the same processes just by flattening an image out into an array of its pixels, as shown below.
However, by doing this, we would lose all the spatial recognition between pixels -- for example, the relationship between the top left 1 and the middle 2. It would be virtually impossible to have good accuracy with all this spatial recognition lost by flattening the image into an array.
A CNN is able to succcessfully capture the Spatial dependencies by applying relevant filters. The architecture performs a better fit to the image dataset by reducing the number of parameters involved and allowing reusability of weights. Across the board, the network can understand the image in general much better.
A CNN usually has 3 main layers:
In a machine learning model, one convolutional layer is paired with one pooling layer. For example, in our cats and dogs demo, this is what the model structure looks like:
The convolutional layer is the backbone of a CNN and will extract features from an input image. Convolution preserves the relationship between pixels by learning image features using small squares of input data.
Convolution will learn these features through a mathematical operation that takes 2 inputs, specifically an image matrix from the original image and a filter.
A filter is essentially a matrix of learnable weights that is learned through backpropogation. A filter is able to slice through an image and map each section one by one, learning different portions one by one. For example, if a small filter is looking for a dark edge, each time a match is found, you output that onto another matrix. By improving the filter through back-propogation each time, you gradually train the filter to recognize deep patterns in the set of images.
Imagine, for example, that we have a 5x5 matrix with image pixel values of 0, 1 and a filter matrix 3 x 3 as shown below.
We would then perform convolution by multiplying the 5x5 matrix with a 3x3 filter at each section. This will output a feature map, which is depicted below as a pink matrix.
By convoluting the image with different filters, we can perform various operations such as edge detection, blur, and sharpen.
Filters are not the only trainable variable. Strides, which are the number of pixel sihfts over the input matrix, can have a noticeable effect on the results. For example, in the image above, you can see we move the matrix over one pixel at a time. As opposed to this, the image below shows how it may look if we moved the matrix over two pixels at a time.
Sometimes, the filter doesn't perfectly fit the input image. In this case, we usually have two options.
Both are acceptable ways of making the filter and the image compatible.
The Pooling Layer is able to reduce the spatial size of the convolved feature. This allows your program to decrease the computational power required to process the data, which is far more efficient. In addition, it can isolate the dominant features of an image that don't depend on rotation or position, which we always want with image recognition.
There are 2 main types of pooling: Max Pooling and Average Pooling.
Max Pooling will return the maxiumum value from the portion of the image covered by the filter. On the other hand, Average Pooling will return the avearge of all the values from the portion of the image covered by the filter.
In general, Max Pooling performs far better than Average Pooling. Max Pooling is essentially a noise suppressant, which will discard the irrelevant activations and reduce dimensionality. Average Pooling will simply perform dimenionality reduction as a noise suppressing mechanism. We don't need all this extra noise in our system, so we typically use Max Pooling.
We typically put the convolutional layer and its associated pooling layer together to form the i-th layer of a CNN. Depending on the complexities in the image, the number of layers may be increased to further capture low-level details, but this must be weighed against the computational power available.
At this point, the model is able to understand the features of a neural network. From here, we can flatten the final output and feed it to a regular neural network for classification, which we do using the Fully Connected Layer.
A Fully Connected layer is essentially a normal neural network layer. Our model is now able to recognize and understand the specific features in our image. The Fully Connected layer is a way to determine non-linear relationships between these features and assign importance to each of them during classification.
At this point, our input image has been converted to a suitable form for our Multi-Level Perception. Therefore, we can flatten this image into a column vector. After doing so, we can feed this vector to a neural network and apply backpropogation to ever iteration of training. By doing this over numerous epochs, the model can distinguish between dominating and low-level features in images.