From jderobot
Jump to: navigation, search

Project Card[edit]

Project Name: CNN using TensorFlow

Author: Nacho Condés []

Academic Year: 2017/2018

Degree: Degree in Telecommunications Systems Engineering

GitHub Repositories: TFG-Nacho_Condes / dl-digitclassifier / dl-objectdetector

Tags: Deep Learning, TensorFlow

State: Finished

Project videos[edit]

FollowPerson (neural detection + robotic behavioral)[edit]

Turtlebot2 robot PTZ Camera




Smooth response following a single person (mom)[edit]

The last efforts have been committed to evolve the system to be capable of following a single person (mom). In order to achieve this, we need a smooth output from the neural network, in order to achieve an accurate position of each detected person. For the sake of this, we have added a person tracker to the system, which avoids the effect of false negatives/positives in the detection process.

From now on, we take our control decisions based on the tracked person, and not anymore on the detected persons (raw output from the neural network, with a spurious component).

After that person detection+tracking, we repeat the same process, but focusing into faces on each person (we look for the faces inside of each detected person). If this is successful (after a similar facial tracking process), it is introduced inside another neural network: a siamese network, that compares that detected face (after a whitening process) with the face of the person to track (mom):

This siamese network (that makes both inputs/patches to be feed-forwarded through the same structure/weights) computes the distance (on a feature sense) between both faces. If it is lower than a threshold, we can infer that the detected face (and thus the person of whom that face is) is mom, so we can refresh its position, and get the robot moved towards it.

It has to be remarked that a successful face detection and identification is not always necessary, as the person tracker can infer if a tracked individual is the same one we saw on the previous frame. So, if we have identified mom in the past, even if we can't get its face again, we can follow it without a problem: its current position will be known thanks to the tracker. Thus, we will keep following that person until we lose it or we find another one suitable with mom's face.

A functional video:

Followperson + turtlebot[edit]

This week, I followed some ROS tutorials (until 12) to gain some basic knowledge about its basic concept (nodes, messages, topics, etc.). This has been useful to me, as I have used it on my last achievement.

I have worked on the PTZ following node to evolve it: now it supports a Turtlebot (image below), which can follow a person with the help of an ASUS Xtion Pro Live sensor (on the image below). This sensor is equipped with two cameras: an RGB one, and a infrared one, which can measure depth obtaining a distance image.


Asus Xtion

As a main difference, now all the subprocesses are controlled by ROS: the Xtion camera works on a ROS package (openni2). JdeRobot has its ICE equivalent, but it's considerably heavier (it needed a bigger CPU use to translate the sensors info to images). The ROS variant (in addition to be lighter), puts this information into two topics (/camera/depth/, /camera/depth_registered/ and /camera/rgb), for RGB and depth images, respectively). As a thing to mention, both sensors have a little offset on the Xtion. This means that both images (RGB and depth) are not seen from the same point, so they will be slightly different. To solve this, one of the processes which takes the openni2 driver is registering the depth image. It consists on a mapping of each depth pixel on the RGB pixel, by performing a slight deformation on the image. This way, we can obtain the registered depth image (available on another separated topic), which is now seen from the same point of view than the RGB image. This can be more easily seen on the ROS documentation images.

RGB image

Depth image (before registration)

Disparity between both

Disparity after the registration process

Another thing to mention is that, originally, the depth data (which is 16 bit long) stands for the depth (in mm) measured on each pixel. Unfortunately, in the middle of the driving process, it is truncated to 8 bit long, so we lose the euclidean distance information. As a consequence, we only have available for us relative distance data, from 0 to 255. So, we will work with this relative distance.

How do we implement this? The comm class, available on JdeRobot as a Python package, allows us to control both sensors by only specifying the topic in the YML configuration file. This made the ICE - ROS jump practically immediate, which is a really good thing to mention.

Functional process: The node (which is practically identical to dl-objectdetector on this way), processes in real time the incoming RGB image with a TensorFlow CNN (in this example, we used one based on SSD, on a MobileNet architecture [link al fichero de la red], which allowed us to achieve a processing time of 120-150 ms on a Nvidia 940M GPU). Once a person is detected, we know where it is (as we have the bounding box which surrounds it). We crop that piece of image on the depth matrix, and we have the relative depth on each pixel inside the bounding box (here relies the importance of registering the depth image). As it contains several pixels that belong probably to the background, we will crop it even more, discarding the outer 10% of each side, with the objective of having the biggest "human-background" ratio inside the box we will process. After that, we sample the measured depth on a 10x10 grid, which should be inside the person zone, and compute its median (it is better than the mean since it is not affected by the background samples, nor little variations on the measurements of the body, nor noise on the image). So, this will be the measured distance to the person.

Now, we have two possible movements on the turtlebot: rotation and straight-line movement, each one controlled by the RGB and depth measurements, respectively. If we apply several PID controllers (which, nowadays, are not completely well-tuned), we can perform a basic navigation to follow the detected person!! In addition, it can perform a basic search behavioral when the person is out of view, rotating in the direction where it was last detected (with beeping included, thanks to having learned to use ROS topics!).

We can see a basic example here:

The huge advantage is the strong stability that the detection can achieve due to the neural network, which keeps detecting correctly the person on different poses (frontward, backward, crouched down, etc.), so the abrupt stops observed sometimes in the video are due to a non-very-accurate PID parameters tuning.

Followperson: intro[edit]

This node, available on the project repository, uses a Sony Evi D100P PTZ (which stands for Pan, Tilt, Zoom) camera, that has the particularity of being attached to a mechanical base which allows it to pan and tilt via Evi (Sony propietary protocol) messages. As in JdeRobot we have a component (evicam_driver, thanks to @aitormf for this driver) capable of communicating with these motors, above of an ICE connection, we can make cool things processing the incoming images from the camera itself (which, by the way, supports a ROS connection). Thanks a lot to @cawadall for his help handling this device.

So, as you might be wondering, yes, we have connected a Neural Network to this device.

By developing a simple position control algorithm, we can analyze the bounding boxes which the Neural Network outputs for each incoming image from the PTZ camera, and compute a movement command for it, which is sent to the PTZ motors via the evicam_driver component.

Images/videos coming soon.

Capability to load more models (on both TF + Keras)[edit]

At last! We can load any neural network (SSD/RCNN on TF, SSD300x300/SSD512x512 on Keras) into the component, downloading it (.pb file on TF (from the TF model zoo, the file is inside the desired .tar.gz file), .h5 file on Keras (from the original project page), and placing it on the suitable folder (Net/{Keras; TensorFlow}). In the Keras framework, you can load a h5 file containing the full model (as the included example full_model.h5, inside the file), or a .h5 file containing the weights of the network (as in the previous page).

In addition, you will have to specify on which dataset was the model trained. This is because the output from the CNN is an index (e.g. 23), which then maps into a class (e.g. "person"). These tuples are different on each dataset, so please indicate its origin on the YML file (Model.Dataset field), so the correspondence is suitable, and you are not labeled as a chair ;)

Keras integration[edit]

As the result of a merge process (including kind of a standarization of the network output), the framework used is now completely transparent to the GUI, which renders the bounding boxes with the labels and the scores anyway. So, now we can use a Keras SSD detector by loading the weights.h5 or the model.h5 into the component (via the YML file). A lot of further details are available on the file of the repository.

Node hierarchy[edit]

At this time, a complete refactoring of the project has been done. Before this, we had some blocking calls between the components (e.g. GUI got frozen until network yielded the output image). Due to this, it was really inefficient. Now, the components have been implemented in an asynchronous way, so each one of them reads what it needs from another component when it's needed, nothing is passed to them or awaited as the return value of any method. This makes it way more efficient, and wipes away practically every race condition previously found. Also, it gives a better sense to the threaded structure.

By the way, the GUI component now includes fps (frames per second) counters for both the video capture and the CNN frame processing! We must consider that the fps cannot surpass the timer requested by the user on the threads: if we want the network to predict every 100 ms, out maximum framerate will be 10 fps. To fetch all of this, the requested timers are printed when the program is executed:

Functional real-time component[edit]

The functional version is here! It supports loading different models (thanks Vinay), which can be downloaded from the TensorFlow model zoo, or trained by yourself (in a little while). It also supports real-time handling via a three-threaded architecture, which implements a single thread apart for the network update.

Here is a video of the functional version of the dl-objectdetector:

The version shown on this video is based on a SSD model trained over the COCO (Common Objects in COntext) dataset, with an amount of classes of 90.

JdeRobot dl-objectdetector component[edit]

I have created the new JdeRobot component, dl-objectdetector, which, relying on a TensorFlow pretrained model (from all the models you can choose here, thanks a lot TensorFlow people), is capable of detect up to 90 different classes of objects. The very first model, based on a SSD detection network trained on the COCO (Common Objects in COntext) dataset is capable of switching on and off the detection flow. This means, a button is available in the component to switch on/off the real-time detection. In addition, there is another button to process a single frame. Images/video will be released here soon.

This component requires cameraserver (or a ROS video source, thanks to the YML/comm implementation) working on background. Also, as we are working over a TensorFlow imported model, we need to compile firstly their protobufs, which are the structures where they save the network parameters/structure. The best way to do it, as usual, is to follow their installation guide.

Building a component[edit]

As we've said, we are focused on creating a component capable of detecting in real-time (or on-demand) objects from a given image from cameraserver, in this case. I have modified the classifier source code to create something similar to this. The buttons are not still functional, because the on-demand service is not still implemented (due to some conflicts with the threading concurrence while I was developing it), but it will be soon!


As I have already overseen classification background (thanks to David's and Nuria's projects), we will focus from now into detection working, which is focused on locating an object into a given image, and enclosing it into a bounding box. Some usual approaches to this are by using color filters (like we could do into the drones part), but we will try to get it the same way that the digit classifier: using a TensorFlow powered CNN.

For the moment, our short term scope is to create a similar component to the classifier, which takes a video from cameraserver and gives it to a neural network. This one will output the same image with bounding boxes surrounding each detected object, in addition to the category into which has been classified, and the score of this classification, which means the probability of this object of belonging to the predicted class. As we can see, this behavior is not only focused into detection, but also into classification.

The way we will take this for the moment is by using pre-trained models of TensorFlow, available here. This repository provides us a pretrained model (SSD on COCO dataset for default, but we have a lot of models available in this model zoo), so the only thing we have to do is to instance a new network, and load into it this provided model with all the graph structure and its weights, really easy to do! As it also provides a Jupyter Notebook for an offline testing with images, I have taken this code and created a live detector (available soon), via OpenCV video capturing, since I have had some issues with threading when trying to embed it into a JdeRobot component, which I expect to solve soon.

Digit Classifier[edit]

Improving the training process[edit]

Node hierarchy[edit]

A complete refactoring has been also made to this component (see objectdetector), you can view the results here:

Final version[edit]

At last, I got the definitive process, by refactoring the training structure as we said. This allowed me to get similar results to Keras' network, with an equal performance, through a much longer training (about 1.5h over 10 minutes). Now, the own network can be invoked as the manager ( contains the Network class and the __main__() method, which allows to train/test/both the network modifying some parameters, and save the results into the desired directory and results matrix, to be analyzed with the Octave script). So, the project structure is now tidier.

The results we can achieve during the training are shown below. In this particular case, the training was halted due to patience, which prevented the loss to be degraded 2 times in a row. It completed 32/100 epochs. Maybe we could get even better results with a longer training, we can study it!

Also, I added YAML (data serializing format) support, mainly for letting the component know where the cameraserver proxy is created. I made this replacing ICE by the library comm. This library allows a better behavioral, and compatibility (by using a flag on the digitclassifier.yml file) with both ICE and ROS (so we could load this component and feed it with a YouTube video, or a live stream from a drone!). Also, we can give the proxy created by cameraserver (which relies on ICE) more flexibility (use different IP/ports than default, which would allow to process images from a different computer streaming images to the cameraserver proxy). I also added a beautiful JdeRobot icon for the component.

Below you can watch the video of the final version of the digit classifier, which works perfectly even over poorly visible digits:

Getting Keras' performance[edit]

The final idea for our classification tool is to get the same results that David got with his Keras network. I trained that network by myself using his code, and I found that his training is much longer than mine, because of its structure: while I was training by passing random batches (sets of a certain size composed by labeled images from the dataset) for a given number of steps (which is the basic structure I followed from the first tutorials), his training is divided into epochs, which define how many times is the entire dataset visited. We can divide each epoch into batches, which is the "unit" of images that we pass to the network for each gradient actualization. The third parameter is the batch size, which tells how many image-label pairs contains each batch. We can observe that these last parameters are related to the dataset size, since an epoch is a complete pass through the dataset:

So, as this training process is far stronger than the one which I've been following, I will stick to it, in order to get a better network. As a bonus, I have included a progress bar for each epoch during the training process, as Keras middleware does :)


I have been messing around with TensorBoard, a tool which allows to view parameters related to the network (even in real time, while it is being trained!) on a browser. It requires a few changes in the code, since it works over the log files that the own network generates when it is on the training loop. For the moment, since we are already using the Octave script to visualize accuracy/loss performance during the training, validation and testing, we will use TensorBoard to visualize a general schema of the network:

Since it is an interactive environment, you can go deeper on the nodes. For example, if we double-click on the fc1 layer (first fully-connected layer of the network), we can see its internal structure: the tensors which compose the weights and the biases for that layer, as follows:

Format fix, advanced version of the classifier[edit]

Finally, I caught the dataset bug: while standard TensorFlow MNIST libraries (tensorflow.examples.tutorials.mnist) is encoded from 0 to 1 (value of each pixel, being 0 totally black and 1 totally white), the HDF5 datasets we introduced previously was encoded from 0 to 255, following the same rule. Thus, while the optimizer was tuning the network's weights with a rate of , it was not useful at all (practically an infinite training would be needed to achieve the true pixel values). To fix this, we scale every HDF5 dataset right after importing it. This fixes everything.

Now, we will be performing training processes with very longer datasets (as we said before, 1-6 dataset is an augmentation of 6 images for each single MNIST standard image), so our training steps have to be greater, to be sufficiently longer.

Octave benchmarking and network manager[edit]

Since I got to load any previoulsy existing HDF5 dataset (which allows me to train a network with augmented MNIST datasets, or any dataset in general, which is interesting in the future for other applications apart of digit classifying), so the next step was clear: benchmarking datasets.

So, I reused the Octave (numeric free software) script which David and Nuria created to evaluate a certain model. While you are training or testing a network by using HDF5 datasets (training, validation and testing), you can obtain precision and recall for each class (0-9 digits), and the confusion matrix as I did here. Also, it gets data about accuracy and loss function value during the training process on each step (testing parameters) and every 100 steps (validation parameters). So, you can get interesting graphs.

In adition, as we just said, I had to modify the training process to implement the accuracy/loss readout on each step, so why not implementing early stopping? This means that we can automatically stop the training process if we don't get an improvement on our monitored result (accuracy/loss) on consecutive training steps. This allows us to avoid overfitting on our model.

So, I have implemented a network manager which allows us to train and/or test a model by using HDF5 datasets, and tuning the training process. It yields the training results to a matrix (.mat) file, which we pass to the Octave function bencmark.m, resulting in the graphs.

So, we have a conflict with data format, because training with standard MNIST from HDF5, we get the poor stats we had the last week, as below. So, now that we have the benchmarking tool, we can go to get the necessary conversion for the next update.

HDF5 datasets[edit]

This time, I have focused on taking another databases. In this case, the datasets which my mates David Pascual and Nuria Oyaga had previously built: they were standard MNIST datasets, which have been modified with a random scale and rotation factor, and gaussian noise, in order to augment the training process, so the resulting CNN is harder. This way, they are referred as x-y datasets, so it contains y modified images for each x normal images.

These datasets are stored in HDF5 format, which is a kind of container for large amounts of data, independent of the kind of data which you want to store. So, I have retrieved these datasets (thanks a lot, Nuria and David) and I have developed a python module to extract images and data from these .h5 files.

Beside of that, I'm having some issues with the training process using these augmented data, so I will keep it on development.

Implementing the component on TensorFlow[edit]

Component working[edit]

I finally could get to work the classifier, thanks to a refactoring: now, the network is an object, which is created and loaded only once in all the component execution (before this, it loaded everything on each new arriving image of cameraserver, which was really inefficient, to the point that it blocked the whole program). So now, it works well. It's accurate, taking into account that it is dealing with real images, not test images from MNIST database. Here is a video showing it:

As it is already functional, I have begun to focus on the training process, where I will be able to test how its parameters can affect to the network's performance. To get started, I've got to retrieve the confusion matrix of all classes, where I can clearly see which classes (type of digits) are the most correctly detected on the test process. It also allows me to calculate precision and recall for each class, which will be really useful soon, in order to improve the network. Here is a screenshot of the testing process:

In this implementation of the confusion matrix, as we can read in TensorFlow documentation, the matrix columns map to the real labels of the test set, and the rows map to the predicted labels by the network. On this way, we can easily find where the classifier fails more often. For example, in the previous image, we can notice that the digit "6" is often confused with "0": it has been confused 10 times. We can know it as we have a 10 in the matrix[7,0] position (it is a 7 because the columns and rows begin on "0" digit, instead of "1").

On the other hand, for the next weeks, I will also keep trying on using the edges images which I mentioned in Week 5. I am now having some troubles on this process, but I'm on it. Despite of that, in the previous video we can see that, although the network was trained on normal MNIST images (no edges processing), it predicts pretty well the cameraserver images, which, as we can see in the video, are processed with this Sobel filter before being introduced on the network. So it looks nice!

CNN building process[edit]

This week, I have focused on getting better results from the CNN which I embedded last week on the Classifier component. Before this, it was barely a convolutional layer (with a single tensor for weight and bias), so it achieved a success rate of 92% with a test set (identical images to the training ones, without any kind of noise, transformation, etc.), but it did not get very good results on the real-time behavior (processing images from the camera). Now, I have implemented several more layers (convolutional -> pooling -> convolutional -> pooling -> fully connected -> dropout -> fully connected). This makes it more complex, and reliable. Also, I have modified the training dataset, with a Gaussian filter and a Sobel filter. Thus, the network has been trained with the edges of the images without noise, so it is much more robust due to the "imperfect" training set. Also, this is important because that is the preprocessing which we apply to the image provided by cameraserver before putting it into the network. This makes possible to work on a network trained with images which are not that distant of the real images that it will have to classify. If we test this model, we can achieve an accuracy of 98%, which is wonderful.

Although I have had several difficulties during this process, mainly caused by memory usage problems, which led the Python programs to crash after being killed by a segmentation fault. I could fix this in the training and testing process, but I think this is leading to a bad behavioral on the full component: it launches and classify finely in real time, but the image from cameraserver gets frozen after a few seconds of usage, maybe because of the modeling of the network itself. I will keep working on eliminate it for a suitable behavioral, in order to experiment with real-time images on my network, for a finer tuning on the modeling and training parameters.

Also, I will look for more TensorFlow trained networks, in order to embed them in the component, and test it classifying another kind of objects (animals, cars, people, etc.); and I will also try to train my model on another databases (COCO, VOC, etc.), in order to keep my model to classify not only numbers (people, cars, animals, etc.).

Beginning to embed the network[edit]

This was a little more difficult, but I have been able to embed the TensorFlow Neural Network on the JdeRobot component. For now, it is not useful at all, because it is a really simple network, as it has not been trained properly, but it works! Now the component is able to output a number in real-time using the images served by cameraserver.

Now, I can focus on a better training for the network, without having to modify the component at all, because it loads the network variables (weights, biases, etc.) from external checkpoint files, which are generated by the training script.

On the other hand, I have been reading about the precision and recall concepts, applied to CNNs, which can be defined as:

  • Precision: How many of the images labeled with a class were correctly classified (over all images)

  • Recall: How good the classifier finds the images which belong to a class (over that class)

Note: TP = True Positive, FP = False Positive, FN = False Negative

Getting started[edit]

Keras component working[edit]

After some difficulties, I got David Pascual's component working! His number classifier engined by Keras is working now on my machine. I had to do a few changes on his original code (due to the change from his machine, basically), so I forked his original GitHub repository, and commited my little changes. You can see this on forked repo . As you can see, it is working fine!

I am also aiming to find how to possibly connect a component in JdeRobot which will use cameraserver (like David's one with Keras), with a trained CNN in TensorFlow. I've just found this great tutorial, which not only explains what happens behind a Neural Network, but also provides an example of how to give an image to a previously trained CNN in TensorFlow! So, I have followed the example, which you can find here. The next steps can be trying to "attach" a CNN to David's component using TensorFlow, and getting a real-time image (via cameraserver) preprocessed and entered to the network (we can perform a fine training later). By the way, I will also focus on saving a network, so I can train it, and use it elsewhere.

This is an example of the heat-map which you get for each pixel on each class (0-9) after training a network. The color represents the weight of that pixel in the case of that number (based on the training set). These weights will be multiplied by the corresponding pixel of the image which is being classified, and passed to a softmax function.

Beginning on TensorFlow[edit]

Once we have decided to keep on going with TensorFlow (the Deep Learning framework developed by Google), I have installed TensorFlow on my machine. For the moment, I have installed the only-CPU version. As pre-compiled versions (installable via pip/anaconda) can possibly be less efficient, I have built it myself on my machine from sources (not difficult, I promise) . Once installed, I have tested a few examples about how it works:

  • Getting Started With TensorFlow : this is a simple guide from the TensorFlow site itself, where you can learn how its high-level logic works, and train yourself a simple network which will find values for a linear regression.
  • TensorFlow Tutorial For Beginners : an external tutorial where you can create another network which will classify traffic sign images. You will train it with a training set, and evaluate it with another set (which the network has not ever seen during the training process). This page is useful to learn a few new concepts about how TensorFlow manages a Convolutional Neural Network.
  • MNIST For ML Beginners : this is another tutorial from TensorFlow site, which is really useful to keep learning. Also, it is particularly useful because it uses the MNIST Dataset (which is composed by images of handwritten digits from 0 to 9, correspondingly labeled), which will be our main Dataset to train the network we are creating in a few weeks. This website teaches you further concepts, as the loss function (cross-entropy in our case), or the optimization algorithm (softmax for the moment). You can write your own Python script, which you can compare later with another one located on TensorFlow's GitHub (, which is linked in the tutorial. The final result of this tutorial is to train a network which you will create, and to evaluate it with a test set, getting its accuracy which is resulting on about 92% of correct prediction on the test set.


On the first week of the project, I have installed JdeRobot on my machine. I have run several examples:

I have also read the TFG reports of Nuria Oyaga and David Pascual, based on training and testing CNNs over Caffe and Keras, and using them to develope components capable to classify handwritten numbers on real-time.