Readability and maintainability are at the core of modern programming principles. To help achieve these goals, many strategies can be used and implemented. Let’s...
Deep dive into Convolutional Filters
Convolutional Neural Networks, or CNNs, are used to process images for a variety of task including object detection, classification, and more. CNNs are built up of a few basic layers; Convolutional and Max Pooling (or Down sampling) . This will create a set of “features” that can be fed into fully connected (or Dense) layers to find meaning in an image. Allowing the dense layer to realize the content of images. For instance this image below, contains a robot.
We are going to dive into the convolution layer in this post. In particular we are going to dive into how filters work. Before we jump into this let us look at how data goes into a convolutional layer.
In machine learning we refer to this data as “tensors”. Most programmers would recognize these as multidimensional arrays. Normally during training this would be a 4D tensor. The highest layer is a collection (or batch) of images. Now we are at a 3D tensor. The highest layer here is channels; think Red, Blue, and Green. Now with only a 2D tensor left we are simply at the pixels organized in width and height.
It will be easier to train and understand the concepts of CNNs if we pre-process our images to grayscale. This way we are only playing with 1 channel per image instead of the 3 normal RGB channels. We will lose accuracy with grayscale so I wouldn’t recommend it for most production models.
At a high level, a convolution layer is a set of filters. These filters can be any squared size, most commonly 3×3 or 5×5. Then the convolution layer sweeps these filters over the image. Depending on the values of the filter we can find things like vertical lines. Later on in the network these filters reading the output of the other filters can determine more complex shapes like eyes.
These filters are nothing new in the world of computer vision. CNNs are essentially taking this old “classical” computer vision tool and figuring out the values of the filters through iterations of the network to find significances in the image.
The bent it using OpenCV. For those unfamiliar with the tool OpenCV is a C++ library that implements a lot of computer vision st way to understand a filter is to implemealgorithms. In this post we will use OpenCV’s Python wrapper and the image below to learn how filters work.
import matplotlib.pyplot as plt # We'll use this to show images import matplotlib.image as mpimg # And this as well import cv2 # OpenCV import numpy as np # We'll use this to manage our data
As mentioned we are going to start off with processing this under grayscale so let’s get that changed.
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) plt.imshow(gray, cmap='gray')
The filter we are going to build for this example is a Sobel Filter. This is a very commonly used to detect edges and in finding patterns in busy images. We will have to do two passes for both horizontal and vertical edges. Let us see how we might find lanes using a filter.
# Create our Filters sobel_y = np.array([[ -1, -2, -1], [ 0, 0, 0], [ 1, 2, 1]]) sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]) # Apply the filters to the image filtered_image = cv2.filter2D(gray, -1, sobel_y) x_filtered_image = cv2.filter2D(gray, -1, sobel_x) plt.imshow(x_filtered_image, cmap='gray')
Above you can see us using the X Sobel filter to find vertical edges. We don’t show the Y Sobel here as you can see from the original photo there are far more vertical lines. As well for our use case horizontal lines aren’t great for seeing line markings. A quick note OpenCV does have a
cv2.Sobel() method but we want to see these filters here. The math happening here is simply a matrix multiplication that runs over each 3×3 sets of pixels in this image. There is overlap in a normal CNN. You can set a variable called ‘stride’ in most convolution layers that will allow you to jump lines so instead of moving down 1 pixel at time you might jump 2 pixels.
Filters can be larger in size. For example lets look at two 5×5 examples
five_filter = np.array([[-1,0,0,0,1], [-1,0,0,0,1], [-2,0,0,0,2], [-1,0,0,0,1], [-1,0,0,0,1]]) five_filter_image = cv2.filter2D(gray, -1, five_filter) plt.imshow(five_filter_image, cmap='gray')
Above you can see by making it larger and spacing it out we got more bolden lines. What if we amplify it?
amp_five_filter = np.array([[-2,-1,0,1,2], [-2,-1,0,1,2], [-3,-2,0,2,3], [-2,-1,0,1,2], [-2,-1,0,1,2]]) five_filter_image = cv2.filter2D(gray, -1, amp_five_filter) plt.imshow(five_filter_image, cmap='gray')
Well, this probably is no good. We got so much noise out in this image. However this at the end of the day is what the CNN is doing. You are in charge of setting the hyper parameters like number of filters, filter size, and stride. However the CNN is trying different numbers in the filter to see what works best.
This is a very long process to figure out well and I highly recommend using pre-trained weights like ImageNet as a starting point for your CNNs. However if you are building from scratch you should initialize your filters. I prefer Xavier Uniform, also known as Glorot Uniform, as my starting point. If I’m building my own CNN and can’t used pre-trained weights. Luckily tools like Keras use Xavier Uniform as a CNN weight initialization by default for you!
I hope you enjoyed this dive into the filters that build your CNN. It is crazy to think some simple math can turn out useful data like this. Although it does take a lot of filters to turn our results. For example a simple CNN like the VGG-16 has about 3,456 filters. It is quite amazing that our GPUs can even train these networks so well.