From jderobot
Jump to: navigation, search

Project Card[edit]

Project Name: Convolutional Neural Networks for Human Pose Estimation

Author: David Pascual Hernández [d.pascualhe@gmail.com]

Academic Year: 2017/2018

Degree: Computer Vision Master (URJC)

GitHub Repositories: 2017-tfm-david-pascual

Tags: Deep Learning, Tensorflow, Convolutional Pose Machines, Human Pose Estimation

State: Developing

Weeks 22-25: LSP + MPII datasets[edit]

In order to increase the number of samples for future training and evaluation, LSP and MPII datasets have been combined. A script for parsing the labels of both datasets and storing them in a concordant structure has been developed. More specifically, the script performs the following tasks:

  1. Images and labels from both datasets are parsed.
  2. HDF5 files for train, validation and test datasets are initialized. The percentage of samples that each of these datasets contains is defined by the user. Datasets are divided in two subsets, one of them contains the images of the humans labeled in MPII and LSP. The size of these images is dependant of the boxsize chosen by the user. The other subset contains the labels for those images. Each label is composed by the following fields:
    • dataset: LSP or MPII.
    • fname: original image file name. It works as an identifier of each sample.
    • scale: size of the human in the image with respect to the image.
    • center: center coordinates of the human in the image (x, y).
    • joints: each joint coordinates (x, y).
    • headsize: size of the diagonal within the head bounding box of the human in the image. It is later used for evaluation.
  3. The samples and labels previously parsed are shuffled and stored in those HDF5 datasets.
  4. HDF5 files are saved.

The new HDF5 files are much lighter than the images folder + annotations provided by MPII and LSP. Besides that, the evaluation is faster now because the datasets contain the ready-to-go images, i.e. humans cropped from the image with the appropiate boxsize, instead of the original ones. Hopefully, this will also decrease computational cost during training.

LSP headsize[edit]

As we have mentioned before, one of the attributes stored for each sample is its headsize. This parameter is required for computing the PCKh metric and its already provided in MPII annotations. However, LSP images doesn't have and associated headsize. In order to solve this issue, the heads present in the LSP images have been annotated. This task has been accomplished thanks to the on-line annotation tool LabelMe.

After annotating every image, the resulting labels can be downloaded in .xml files. These files are then parsed and stored in a NumPy .npy file which is read later to build our datasets.

Weeks 20 & 21: Evaluation methodology[edit]

Now that we have integrated different CPMs implementations, we must define an evaluation methodology to carry out not just quantitative comparisons as we've already done, but qualitative ones. This methodology must enable fast experimentation in order to analyze the effect that different techniques, architectures, parameters and learning processes produce in CPMs performance. The methodology implemented must be compatible with previous work (same metrics, same datasets).


According to original CPMs paper and other state-of-the-art works, the most popular datasets for human pose estimation are the following ones:

  • Max Plank Institut Informatik (MPII) Human Pose: Includes around 20k images from YouTube containing more than 40k humans. Annotations include human center and scale within the image, 16 body joints coordinates, occlusions and activities.
  • Leeds Sports Pose (LSP): Includes 2k images from Flickr containing one human each performing sports. Annotations include 14 body joints coordinates. It does not provide any information about human scale or location before they are properly centered in the images.
  • Frames Labeled in Cinema (FLIC): Includes 5k images from popular Hollywood films containing one human each. Annotations are only provided for upper body parts.

Because of its number of samples, as well as its exhaustive annotation, we're going to start working with MPII Human Pose dataset. Besides that, it is the main dataset used for training in the original paper (although they also train a model with both MPII and LSP datasets combined).


Most used metrics for evaluating models on the datasets that have just been mentioned are:

  • Percentage of Correct Parts (PCP): a body part (limb) is considered correctly detected if its segment endpoints lie within half of the length of the ground-truth segment from their annotated location (Ferrari et al., 2008). It is used when the algorithm estimates body limbs instead of joints.
  • Percentage of Correct Keypoints (PCK): a keypoint (joint) is considered correctly detected if it lies within a certain distance from its annotated location (Yang and Ramanan, 2013). That distance is usually related to the size of the human in the image.
  • PCKh: it's the same metric as PCK but setting the acceptance distance as half of the length of the subject's head. This is the evaluation measure used in MPII Human Pose dataset, so it's the one that we're goint to use as well.


Aided by the original CPMs paper and its repo, the dataset and the evaluation toolkit provided by MPII and the eval-mpii-pose repo by @anibali, I've managed to evaluate the Caffe model following the next procedure:

  1. Reading annotations. The more direct option could be parsing the annotations provided in the MPII dataset webpage, as samples are divided in train and test. However, I have found out that test samples don't have joint location annotations, because MPII withholds them for official evaluation. The only option left is to divide the train samples in subsets like train/validation/test. This have been already done and HDF5 files of the mentioned subsets are provided in eval-mpii-pose repo.
  2. Preprocessing. For this task I have revisited the original paper and repo. In a nutshell, each human in the image is cropped based on his location and scale into a resulting image of size boxsize x boxsize.
  3. Estimation. It is performed with the human pose estimation model.
  4. Storing results. The human pose estimation model regresses the coordinates for 14 body joints. These 14 coordinates are reordered to match MPII evaluation specifications and then we store them in an HDF5 file.
  5. Evaluation. Once all the predictions have been stored in the HDF5 file, the evaluation is performed by a Matlab script provided in eval-mpii-pose, which is based in MPII evaluation toolkit. It computes and store PCKh metric for each joint and plots the results.

In the following images, Caffe model results for the validation subset using different boxsizes are shown.

I'm having troubles with TensorFlow so whenever I manage to solve them I will evalaute the TF model as well and properly discuss the results obtained. I'll just say about Caffe results that they're slightly worse than the ones reported in the original paper, which could be justified because of a refinement process that they used. It's also important not to forget that the samples that we have used for evaluation have been probably seen before by the model when training, so that's a problem we have to solve.

Week 19: Extended TensorFlow implementation[edit]

A few weeks ago we found another repo with a CPMs' implementation in TensorFlow:

What makes this repo interesting is that it extends the official release with new visualizations, models trained for hand pose estimation and Kalman filters. It's important to note that it doesn't include any model for human detection, which force the user to give as input images with properly centered and sized humans (or hands).


The test scripts available in the repo let you choose between three different visualization modes during live execution:

  • Single: just input image with limbs drawn over it.
  • HM: it displays the heatmap of each joint after the last stage, We've already got very similar pics.
  • MULTI: combined heatmaps for each body joint between stages are shown. Displaying these heatmaps is really useful to understand how CPMs get a more refined estimation after each stage. An example can be seen in the image below.

Hand pose estimation[edit]

CPMs architecture can be applied to estimate the joints of any articulated object, as long as its parts are spatially dependent between them. Following that line of thought, a model trained with hands and a script to test it are provided within this repo. Live hand pose estimation is shown in the following picture:

Kalman filters[edit]

Probably the main improvement with respect to the original release is the introduction of Kalman filtering in the output of the pose estimation model. Kalman filtering is an algorithm that fuses information about the state of a system and (maybe) its environment and produces an estimation for the next state. It basically combines system expected and estimated measurements (e.g. position and velocity) with its uncertainties. That fusion of information allows to infere current state in a more reliable way than simply using measurements obtained if they're noisy. Kalman filtering has been applied for smoothing output from laptop trackpads, positioning systems, etc. It makes sense to apply this technique to CPMs output as it is usually very noisy, with trembling joint locations. When Kalman filtering is applied, that effect is removed to a large extent. Besides that, the algorithm for Kalman filtering is really fast, so it can be used in real time applications like ours.


Inference time when 192 px boxsizes are evaluated is about 60 ms, although it's only doing pose estimation (not human detection) and that it evaluates one human per frame. Inference times for previous Caffe and TensorFlow implementations, under the same restrictions, are 40 ms and 100 ms, respectively. That means the new repo is a bit slower than Caffe original release, but significantly improves the TensorFlow implemented that we're already using. After taking a look at both TensorFlow repos code I've not been able to spot yet what makes the difference.

Weeks 14-18: Tensorflow & Caffe working with GPU - Comparison[edit]

These weeks I have finally integrated the CPMs TensorFlow implementation. Now the humanpose component can estimate poses with both frameworks, Caffe and Tensorflow. Switching between frameworks is just as easy as changing the Framework parameter in the brand new humanpose.yml configuration file. Configuration file format has been changed to YAML to stay tuned with JdeRobot latest updates. This change only affects the Camera object code, which now depends on comm and config libraries (installed along with JdeRobot). These libraries provide a new level of abstraction, avoiding the need of directly using Ice to establish the communication with the drivers. Besides which framework to use, a bunch of shared parameters between Caffe and TensorFlow (boxsize, limb colors...), as well as the path to each model, are specified within the YAML file.

Another big step forward that has been taken in the past weeks is enabling CUDA based acceleration for both frameworks. I have also upgraded my hardware. Current hardware and software specifications:

  • Laptop: Intel Core i7-7700HQ @ 2.80GHz; NVIDIA GeForce GTX-1050.
  • CUDA: v8.0.
  • CuDNN: v7.

Before moving on to solve a real problem with the acquired knowledge, it's worth it to make a comparison on performance and qualitative results between the integrated models. The following test has been carried out:

  • Both models have been tested against the first ten seconds of the following video: McEwen Spin-O-Rama to the Button - 2015 World Financial Group Continental Cup of Curling. At 30 fps, the number of frames goes up to 300.
  • CPU and GPU accelerated inferences have been evaluated.
  • Each model has been tested out using four different boxsizes: 96, 128, 192, 320.
  • For each of these 2x2x4 = 16 tests, I have stored inference times for each of the 300 frames of:
    • Human detector model.
    • Pose estimation model.
    • Total time. It includes human and pose inference times, as well as the time that takes to process the images and coordinates before, during and after them.

Performance comparison[edit]

In terms of performance, Caffe model (remember, original release) is doing slightly better than its sibling implementation on TensorFlow. In the following figure, the average times for human detection, pose estimation and full prediction depending on the boxsize are shown.

And here the tabulated results for the same tests.

Human detection times (ms)
96 px 128 px 192 px 320 px
CPU - TensorFlow 215 378 846 2385
CPU - Caffe 328 559 1230 3378
GPU - TensorFlow 34 40 60 144
GPU - Caffe 23 28 50 153
Pose estimation times (ms)
96 px 128 px 192 px 320 px
CPU - TensorFlow 315 588 1335 4002
CPU - Caffe 270 451 1028 3058
GPU - TensorFlow 71 94 133 312
GPU - Caffe 26 33 48 156
Full inference times (ms)
96 px 128 px 192 px 320 px
CPU - TensorFlow 473 944 1841 5659
CPU - Caffe 580 1030 2056 6039
GPU - TensorFlow 119 165 204 489
GPU - Caffe 73 94 129 368

After taking a look at the results, the first thing that stands out is the great difference between CPU and GPU accelerated inference. In the case of TensorFlow, using CUDA and CuDNN makes the complete inference around 10 times faster, while Caffe model make predictions 15 times faster. It's worth it to note that while TensorFlow model is slightly faster than Caffe one when working without GPU based acceleration, Caffe performs better when GPU is used, specifically, around 1.5 times faster. For both frameworks, if we compare human detection and pose estimation times, the second one takes generally longer, and if we sum up both times and compare them with the full inference times, we check that there's a litlle overhead introduced when processing frames, drawing limbs... but it doesn't seem worrying, at least for now. In a nutshell, we get a great improvement with GPU and Caffe performs a little faster than TensorFlow. With a boxsize of 192 px, which gives nice qualitative results, Caffe model can make pose estimations at about 7-10 frames per second.

Qualitative results[edit]

Now let's take a look at the estimated poses. In the following video, comparisons between Caffe and TensorFlow models (with GPU and boxsize = 192 px) and between different boxsizes (TensorFlow with GPU) is shown. Needless to say that the framerate has been adjusted to get a natural video and does not represent real inference times.

As it can be seen in the video, it's difficult to appreciate differences between the poses estimated with both models. Maybe it's too risky to draw any conclusion without performing a quantitive analysis, but it seems like they have been similarly trained. With regard to the different boxsizes, it's pretty obvious that bigger boxes lead to better results. A good trade-off between inference time and results is reached when using a 192 px boxsize.

Weeks 12 & 13: TensorFlow model (I)[edit]

In order to adapt the CPMs TensorFlow version (https://github.com/psycharo/cpm), I've been learning TensorFlow basics aided by the tutorials provided in its webpage. The code is going to follow the same structure than the Caffe version that I've already built and will be divided in human detection and pose estimation classes. At the moment, I've coded the human_detector class and I'm working in pose estimation. Whenever the code is finished, the TensorFlow model will be integrated within the humanpose component, that will be able to use both models indistinctly for pose estimation.

Weeks 8-11: Influence of image size[edit]

In order to get closer to real-time prediction, we've been testing how image size influence both execution time and performance. We can experiment with different image sizes tuning boxsize parameter, defined in CPMs configuration file. Before testing different box sizes, I have analyzed how the sample that goes through the model changes until the pose estimation is reached and how it depends on boxsize.

  1. Original image is resized according to boxsize.
  2. Human detector is fed with the resized image.
  3. Human detector outputs a heatmap eight times smaller than its input because of pooling and convolutional layers stride.
  4. The heatmap is resized back to the size of the image that fed the human detector.
  5. With the human coordinates obtained from the heatmap, a squared region of size boxsize is cropped.
  6. Human box is fed to the pose estimator.
  7. Pose estimator outputs joint coordinates over an image eight times smaller than its input.
  8. Finally, these joint coordinates are transformed to fit full size image.

All of these steps can be seen in the following diagram:

Originally, boxsize was equal to 384. These are the results obtained with different box sizes:

Boxsize Human detection (s) Pose estimation (s) Total (s)
384 19.43 13.08 32.51
192 4.81 3.04 7.65
128 2.31 1.43 3.74
92 1.21 - -

As it can be seen, when we reduce size, we get a very significant speed-up, but predictions become less accurate or even non-existent. A good trade-off is reached with 192x192 boxsize: predictions are 4x times faster and they still being pretty accurate.

Week 7: HumanPose component (v0)[edit]

Last week I had troubles matching JdeRobot, Caffe and OpenCV dependencies. Finally, I've managed to build a stable environment.

  1. I installed JdeRobot from Debian packages. As ROS is automatically installed as a JdeRobot dependency, OpenCV gets automatically installed too. Last time I tried to build JdeRobot, I already had built OpenCV from source, what got me some issues.
  2. With JdeRobot and, consequently, OpenCV installed, I followed this tutorial to install Caffe library: Installing Caffe on Ubuntu (CPU-ONLY)

¡And that's it! Undoubtedly, much easier and cleaner than the last time.

I have restructured my repo in order to develop the humanpose component, which will be based in last year's digitclassifier. Currently, the workload is divided in the following threads:

  • Camera: it is responsible for capturing live video.
  • GUI: it displays both live video and estimation results.
  • Estimator: it loads the models when the component is executed. During execution, it gets a frame from Camera and estimates human pose. Then, estimation results are sent to GUI in order to visualize it. When the current estimation is finished, Estimator gets a new frame from Camera and the process starts again.

The idea is to show live video during execution and update estimation results only when Estimator throws new results. I'm currently working on it and I've uploaded to my repo its first version, but it stuck while an estimation is being made and GUI only updates whenever that estimation is done. In order to get faster results, I plan to reduce frames resolution. Meanwhile, here's a screenshot that has been taken while running the current version:

Weeks 5 & 6: Caffe implementation[edit]

These weeks I've followed the Jupyter notebook that shows how to use CPMs models trained with Caffe. I've divided the workload in two classes: PersonDetector and PoseEstimator.

  • PersonDetector is fed with an image containing one or more humans and returns a heatmap. These images must have a size multiple of eight because of the model architecture, so it must be padded with zeros to meet this requirement. The returned heatmap has greater intensity values where the probability of finding a human is higher, as it is shown in the following image.
In order to discard false humans detected, a maximum filter and a threshold are applied, resulting in the human center coordinates. Once the humans have been detected, the heatmap is resized back to the original image size, that is 8 times bigger.
  • PoseEstimator first crops all the humans that have been found in the original image to fit the model input. The input of the pose estimation model will be these cropped images and a 2D gaussian.
PoseEstimator outputs 14 heatmaps containing each one the probability of finding every joint of the human body (ankles, shoulders, head, hips...). Another heatmap containing the probability of every joint together is returned.
Finding the coordinates of every joint is as easy as finding the maximum values in those heatmaps. The resulting points can be linked together and drew over the original image after being resized, as it is shown in the next image.

As we want to build a JdeRobot component that can predict based on both TensorFlow and Caffe models, I have wrapped all this procedure in another Python will be loaded when the component starts and then, every time a new frame is captured it will go through the prediction method.

The whole process is far from real-time, oscillating between 30-50 seconds for images with a 640x360 resolution with an i5 processor without GPU acceleration.

I'm currently having issues installing JdeRobot with OpenCV and Caffe, so next step will be setting a stable environment in order to build the first version of the pose estimation component. Because of that some of the pictures are missing. Whenever I manage to get Caffe to work again I will upload them. Besides that, the TensorFlow version of the CPMs has to be explored and adapted as well.

Weeks 3 & 4: Caffe installation[edit]

I've had troubles trying to install Caffe and its Python bindings. After going through a lot of tutorials, I managed to install it along with Anaconda for Python 2. I followed the instructions available in this GitHub gist: CaffeInstallation.md by @arundasan91.

Besides that, I have cloned both Caffe and TensorFlow repos and I'm trying to build a simple script for testing each one. These scripts are based on the Jupyter Notebooks provided in the repos. When I am able to feed forward samples through the CPMs and I fully understand how they work, we're going to build a JdeRobot component that can feed both models indistinctly from a webcam or a video file.

Week 2: CPMs repos[edit]

This second week, we've tried to reproduce the results obtained in the CPMs paper. Two main approaches have been found:

  • Official release: this repo contains scripts for training and testing CPMs, as well as access to already trained models, both for Matlab and Python (Caffe). As we're currently working with Python our main interest is to test the model built with Caffe, as shown in their iPython notebook. I've struggled with Caffe installation but hopefully it will be solved next week.
  • Tensorflow version (not official): it contains a single file where the CPM model is implemented with TensorFlow and an iPython notebook that explains how to test it. It also provides pretrained models, but we're not sure how these models have been trained. I've been able to execute the code from the notebook and make predictions on some images, but they look really messy, as if more than one person was detected.

Next week, we're going to install Caffe properly and learn the TensorFlow basics in order to test both implementations and compare the results obtained.

Week 1: Literature Review[edit]

Convolutional Pose Machines (Wei et al.)[edit]

Convolutional pose machines (CPMs) try to address the human pose estimation problem using convolutional neural networks (CNNs) and 2D images. They inherit their architecture from the previously released pose machines. CPMs are formed by a sequence of CNNs that produces belief maps for the location of each part of the human body (ankle, elbow, head...). They're multistage, so the image features and the belief maps generated in the previous stage are used as input to the following one. At each stage, the estimations of the locations of each part are more refined.

CPMs first stage predicts part beliefs from any local image evidence, as the receptive field in that stage is just a small patch of the original image. In subsequent stages, the effective receptive field gets bigger and bigger, allowing CPMs to learn complex and long range dependencies between parts. In that way, detecting challenging parts is easier thanks to the belief maps of the easier ones. Large receptive fields are achieved by:

  • Pooling, at the cost of lower precision.
  • Using larger kernels, which increases the number of parameters of the model.
  • Adding more convolutional layers, at the risk of facing the vanishing gradients problem.

In CPMs paper, this last approach is implemented. To solve the problem of vanishing gradients, intermediate supervision is applied after every stage, using the L2 norm as a loss function. The ideal belief maps which are compared with the ones generated by the CNN are built synthetically with Gaussian peaks at ground truth locations of the image. The overall objective function is minimized using stochastic gradient descent to jointly train every stage.

The source code is available in: convolutional-pose-machines-release. CPMs were originally built with Caffe, but it has been ported to TensorFlow.

Other papers & repos[edit]