Back to Blog

Training a Hand Detector like the OpenPose one in Tensorflow

On a previous article I presented the problem of Hand Recognition, the kind of things we could do if we are able to detect hand movements, the different kinds of detection we can make and some released products available out there. In case you didn’t read it, you can find it in our blog.

For this second story, I’m going to jump into action and present the design, training and testing of a Hand Keypoint detector, a Neural Network able to do as the heading image.

To accomplish this I’m going to replicate the hand detector used on OpenPose, as its code, foundation paper and dataset are publicly available. It uses a 21 keypoint model for the hand, four points for each finger plus an additional one for the wrist.

What is the detector going to do

To keep things simple for this first approach to the problem, I’m going to stick to some limitations. The detector is only going to work on square 368x368 images containing just one hand occupying most of the image area, so we don’t have to mess with cropping or adjusting sizes.

We pretend the network to give us, if there’s a hand on the image, the position for each of its 21 keypoints. But instead of making it so straight and giving the exact position for each of these points, the networks makes it slightly different and actually outputs the probability of each keypoint being on any position of the image.

This means the output is going to be a set of 21 maps, one for each keypoint, and each point of the map is going to be a value giving the probability for that keypoint to be at that position on the image. The map are actually called belif maps, as they express the belief for each keypoint to be in a particular position on the image. We can see each of these maps as a heatmap, being the heatest point the most likely point to contain the keypoint. Having these maps, we can easily determinate the hand shape by joining the points with the highest value for each map.

Heatmaps for each keypoint

Architecture of the Neural Network

OpenPose performs the hand keypoint detection in a similar way to what it does for the body pose, using an architecture called Convolutional Pose Machines (CPMs).

CPMs were presented on a 2016 paper, by the Robotic Institute at Carnegie Mellon University. It is I think one of the early contributions of CMU to the field, after what they built the Panoptic Studio, an impressive dome full of cameras able to catch every human interaction from hundreds of different points of view, released datasets and even the infamous OpenPose.

CPM are Neural Networks good at detecting structured objects, objects that are actually a composition of different parts or objects. For our problem, hands are a composition of the keypoints we want to find: knuckles, finger joints and prints. This parts are movable, a hand can take lots of different possible poses, but the position of each one is somehow related to the position of the others. Unless you did serious damage to your hand, the position of a finger is usually a strong prior condition to the position of another finger.

To take advantage of this dependencies between parts, CPMs consist on a sequence of stages, each stage produces a belief map for the position of each part, and each stage takes as input both the image features and the belief map produced by the previous stage. The images features are, as it’s usually done it Deep Learning, the result of applying the first layers of some already trained network to the image. These first layers of a powerful network trained for image detection on a huge set, like AlexNet, GoogLeNet or ResNet, do a good job extracting features of the image, this is, converting the RGB image to a different feature space.

High level diagram of a CPM machine

It’s important to remark that the number of stages on the CPM is not related to the number of keypoints we want to detect. Each stage outputs the belief maps for each keypoint, 21 for our case. Adding more or less stages affects the accuracy and the execution time of the network, but it’s an independent choice from the number of keypoints. As the papers says:

At each stage, the spatial context of part beliefs provide strong disambiguating cues to a subsequent stage. As a result, each stage of a CPM produces belief maps with increasingly refined estimates for the locations of each part.

The actual implementation of the OpenPose hand detector is presented on another paper, “Hand Keypoint Detection in Single Images using Multiview Bootstrapping”, published also by the CMU. Although the paper is mostly about how they managed to generate labeled data for a hand position dataset, it’s there that the details about the network used are given: how many stages, the convolutional layers used, etc.

The first layers of VGG-19 are used for the feature extraction, together with three additional convolution producing a 128-channel feature map. There are six stages, each one is made of 7 convolutional layers, taking as input the concatenation of the feature map and the belief maps of the previous stage (128 channels for features + 22 channels for belief maps = 150 channels). Note the elief maps are 22 channels and not 21 as the number of keypoints on our model, this is because a new belief map is added for the background. It measures the probability. Authors found the use of this belief map helps the CPM stages to better resolve confusions between parts and background, improving the detection’s accuracy.

You can have a look at the tensorflow graph here. It’s a big graph but the same structure is repeated for every stage. The input image passes through layers from conv1_1 to conv4_4, taken from the VGG-19 network, and then there are the additional layers conv5_1, conv5_2 and conv5_3_CPM, this last one produces the features map. Then each stage starts with a concatenation node, joining the conv5_3_CPM tensor with the output from the previous stage. The layers on each stage are named from Mconv1_stageX to Mconv6_stageX, and the output of the last stage is finally renamed Openpose/out.

As on the first VGG layers the input image passes through three max_pool layers with stride 2, the width and the height are reduced by a factor of 8. For an input image of size w x h, the resulting belief maps are of size w’ x h’, with w’ = w / 8 and h’ = h / 8. This means we will have belief maps smaller than the original image size, they have to be resized using bicubic resampling, it may sound like a name for a complex algorithm but it’s actually the same method you would apply to resize any image.

As we are doing supervised learning, we train the network with a dataset of both hand images and the label data, the belief maps for the locations of each of the parts. The loss function is defined to minimize the l2 distance between the predicted belief map at each stage and the ideal one. Authors argument that by training with a loss function defined on all stages instead of just the final stage, the network is less likely to suffer the vanishing gradients problem. Vanishing gradients is a problem big networks with many layers are prone to suffer, as the magnitude of back-propagated gradients decreases in strength with the number of intermediate layers between the output layer and the input layer. Using the distance between each stage output and the desired detection as the training loss function we are giving information about the training directly to the intermediate layers of the network, preventing the vanishing gradients problem.

The desired belief maps are created by putting Gaussian peaks at ground truth locations of each body part p.

To make sure the architecture and the loss worked well together, before doing the real training I did a short test training with only one image, and ploting the loss on each step together with the detection belief maps, It can be seen how the detection get’s better as the loss decreases. Of course training with only one image is the epitome of overfitting and the algorithm is not really learning and only giving the desired output “by heart”, but it served to check everything was ready to proceed to the actual training.

For each step, the loss and the belief maps, plotted together as a heatmap.

Training the network

Having defined the network architecture and the loss function to be used, now we have to define on which dataset is the network going to be trained. CMU has released three hands keypoints datasets on their Hand Database, which I think were the datasets used to train OpenPose. There are three datasets available, each one having different sources for the images used.

The best option would be of course to train the network with all those images, but due to the limitations of this first approach, limited to squared images, I chose to use only the “Hands from Synthetic Data” dataset, and to use only the image on the folders ‘synth2’ and ‘synth3’ of that dataset. The other folders contain bigger images with scenery, so I avoided them.

There are then 5591 images to be used for training, all of them on the same 368x368 size, containing generated renders for different perspectives of a 3D hand model on a lot of possible poses.

Synthetic Data is a powerful alternative to generate a dataset to train a neural network, which would otherwise be an expensive task, as it requires a lot of human work time to correctly hand label the ground truth data. By using a 3D model, we can generate as many images as we want, from all different possible angles and poses. Even more, we don’t need a human to label the data, as since we actually generated the image, we already know the position of each part on it. Although it’s unlimited potential, if we only train a detection model on Synthetic data we may end up finding it doesn’t perform well on real data, what is called the reality gap problem.

https://github.com/ildoonet/tf-pose-estimation

For the training code I customized an already existing project, tf-pose-estimation. On that repository the author shows how to train the Openpose body pose estimation algorithm on TensorFlow, using the original network architecture or trying other ones like using Mobilenet (a smaller and faster network). I forked it and adapted it to train the hand keypoint detector with the Synth dataset.

The training code is available at Hand Detector Train, there you will find all the information about the code and how to run it if you are interested on doing it yourself.

https://github.com/ortegatron/hand_detector_train

I trained the network on a Amazon p3.2xlarge instance, which has a Tesla V100 GPU. It took about 20 hours to reach epoch 22, at which I stopped the training and froze the graph. Note we are not doing any split between training and testing data, we are only training on the whole dataset, and stoping the training when we feel the loss is small enough.

Testing the graph on a image taken from the dataset, the keypoints are detected very well. Note in this heatmap we are also plotting the background belief map, that’s why there are high white values on all the image except for a small area around the keypoints.

Looking at the belief maps produced at each stage, we can see how the result gets better the more stages is passes through. The first stage produces a very bad output and can’t really distinguish the background, while later stages make a much better detection. The little difference between stages 2 to 6 is actually surprising, and I think it may be related to the small dataset we are using for training.

But the result is not that good when we test it on a real hand image, even if it has a white background. The pinky finger is totally missed, although the other keypoints are correctly detected. It seems the network is suffering the reality gap problem, getting good results on Synthetic Data but not that good when we evaluate real data.

Conclusions

It seems clear that although the network managed to perform well on the training dataset, it doesn’t do the same for other real world images.

To get better results, the network should be trained on a bigger dataset. After all, we just trained on a small portion (5.500 images) of the entire available dataset at CMU (30.000 images, having both synthetic and real data). Training with that images won’t be that hard anyway, we just need to extract the patch area containing the hand and resize it to fit the network’s input size.

Once we manage to get better results training with a bigger dataset, we can have a working hand keypoint detector by adding some more logic, like resizing the belief maps to the original image size, and giving the most likely position of each keypoint. Done that, we would have a working hand keypoint detector we could integrate with a hand detector, and getting a result similar as what the hand_standalone detector is doing now, but using our own Tensorflow implementation without having to rely on Openpose.

https://github.com/ortegatron/hand_standalone

So wait for the next chapter on this Hand Detection series, and don't hesistate to contact us if you liked the post or have any problems running or understanding it.

Related posts