From jderobot
Jump to: navigation, search

Project Card[edit]

Project Name: Object Tracker

Author: Alexandre Rodríguez Rendo []

Academic Year: 2017/2018

Degree: Computer Vision Master (URJC)

GitHub Repositories: 2017-tfm-alexandre-rodriguez

Tags: Deep Learning, object segmentation, object detection, object tracking

State: Developing


Week 19: Keras models[edit]

This week the Keras network models were finally introduced into the application. The supported models include SSD_300x300 and SSD_512x512 architectures. The next step is going to be the test of new trackers.

Week 18: Offline mode and close[edit]

Last week I solved some small pending tasks. I have refactorized the offline mode to adapt it to the new architecture of the Network side. This mode was working at the very beginning of the project to allow the user to have an option of running the program without GUI. This way, you can save the results of the application as .jpg images. Apart from that, now the GUI can be fully closed by clicking the Close button. In the previous versions, the program was still running in background and it was necessary to close it in the terminal.

The last version is available at The component was renamed from dl_objectsegmentator to dl_objecttracker.

Week 17: Using dl-objectdetector[edit]

With the previous stable version of the tracker working fine, I started to work on introducing a bigger number of neural networks for the object detection using dl-objectdetector from JdeRobot ( For that purpose, I tested the component using both Keras and Tensorflow models. After that I introduced the component in the dl-objecttracker. At the moment, it only uses Tensorflow models but a Keras version will be ready soon. I tested some models from Tensorflow detection model zoo as SSD or Mask R-CNN (

Apart from that, new sources are going to be added to feed the program: local camera, local video and stream (ROS/ICE). By now, the program is tested with the local video because there are still some bugs in the others that need to be fixed.

Week 16: Refinement of synchronism and tracker[edit]

These last weeks I have been working on having a stable tracker which allows the application a better synchronization between the different branches (Camera, Net, GUI) and the tracking itself. For that purpose, the internal logic of the tracker was modified, allowing the tracking to work in a more flexible way with the buffer which takes as input. This gives as result a tracker that has 3 modes: slow, normal and fast (depending on the FPS average rate of tracking of the previous frames). For example, if the tracking is running slow, a number of frames in the buffer are skipped to avoid the buffer to grow more than expected. And, if the tracking is running fast, the tracker slows down to prevent that the tracking finishes before the neural network gives a result.

Related to the multiobject tracking problem (the tracking of multiple objects affects the FPS rate, slowing it) I looked for new trackers following these posts (LearnOpenCV: pyimagesearch: I am actually working with TLD tracker but it has some problems with false positives. However, it is the best option available for my purpose in the version 3.3.1 of OpenCV (included in JdeRobot). One of the next steps is to test the MOSSE and CSRT trackers that are available in more recent OpenCV versions and look promising, specially MOSSE due to the speed requisites.

At the same time, I am going to include the JdeRobot object detector ( to make use of other networks apart from the actual Mask R-CNN.

Week 15: Camera buffer and tracker updates[edit]

Now, the GUI uses the images coming from the Camera buffer directly instead of using its own buffer. The tracker has a mechanism to avoid that the tracking process to be much faster than the neural network detections.

Week 14: Net result bug in GUI fixed[edit]

The GUI in the neural network result was showing the image segmented from the Mask R-CNN along with a bounding box from the tracker sometimes, so this was fixed. The next necessary improvement is to move the buffer to Cam completely without having it in the GUI branch too.

Other future tasks include the incorporation of the DetectionSuite ( and the ObjectDetector ( components from JdeRobot.

Week 13: Circular buffer (first version) and buffer in Cam[edit]

Once the first version of the buffer with delay is running the next necessary step is to implement a circular buffer to avoid the increase in size of this buffer. But first, the buffer with delay (and also the instructions which control the different branches) was moved to the Cam branch too, to allow the application run without GUI. At the moment, this GUI-off option saves the results that were displayed in the 'Combined' window in .jpg files. Now, to execute the main program you need to type in the terminal (in the case you do not want the GUI to start / in the case you want the GUI -> off / on):

python2 objectsegmentator.yml off

The circular buffer is done to control the buffer size, which tends to increase due to the tracker changes in speed (FPS). For this reason, the first version of this tracker needs to handle two main situations: tracker fast and tracker slow (measured in FPS rate). In the first case, the old frames in the buffer are discarded for the next tracking. In the second case, some frames are skipped. With this changes, the tracking and the segmentation are closer to the real frames captured by the camera and, as I said before, the buffer does not increases its size without control.

But, as usually happens, this first version has some bugs that need to be fixed. For example, sometimes old frames are still 'alive' and it takes some time to the program to update the processing with the last frames in the buffer.

This buffer upgrade is not available without GUI for the moment.

Week 12: Tracker fixes and GUI changes[edit]

So, this new week I started by solving some problems that the tracker had which affected the flow of the application (some more still need to be fixed yet). Also, I had a look on different possible types of tracker implemented in OpenCV (at the moment I am using the TLD). For this purpose I used the website LearnOpenCV of Satya Mallik ( and the OpenCV documentation. I tested all the rest of the trackers mentioned in the post (including the GOTURN which uses deep learning with an offline trained model) but I found that the actual is the better for this purpose.

Furthermore, some bugs were fixed related to the buttons in the GUI and its behavior in the new buffer architecture (as the tracker some more still need to be fixed I guess). I included a new GUI setup with 4 images: the live input video, the combined result from tracker and neural network and the separated results of that two. Now, the images are tagged with the frame number for a better understanding of the application (and some debugging too).

The next image available at the link shows the actual state of the application:

Week 11: First prototype of the delay buffer[edit]

This week I implemented a first prototype of the delay buffer fixing some bugs of the previous version but it still has some little failures pending to be solved. On the other hand, the Docker container has not been configured yet to allow graphic sessions so the works in it are paused at the moment waiting for that. The next steps are going to include to move the buffer to the Cam branch, a circular buffer and an improved visualization (some GUI changes).

Week 10: Continuing with the improvements[edit]

These last 3 weeks I have been working in two different type of approaches to improve the behaviour of the application. On the one hand, I built a first prototype of the application with a buffer with delay. This buffer allows to show all the detections and segmentations in the GUI with a delay given by the length of the buffer in each moment but this way all frames injected to the Net are the last frames (this did not happen before). The buffer working at the moment has some little bugs that need to be fixed. Furthermore, another proposal was to do a double buffer technique but it is not implemented yet.

In the other hand, I had access to a GPU in a Docker container (thanks to Francisco Rivas) working with CUDA where all the necessary packages were installed (Tensorflow, Keras, ...) and I have launched the program without errors. The performance was not measured yet because some features need to be installed in the Docker container. The program will capture the video here from a recorded video instead of the webcam because there is not a real camera in the hardware used (using the cameraserver).

Some issues were fixed to allow the program to download the COCO trained weights of the model of the Mask R-CNN in the case they were not already downloaded.

Week 9: Mask R-CNN improvements[edit]

This week I tested the Mask R-CNN with different image sizes to reduce the execution time of a segmentation. The minimum size allowed by the net that obtains results and mantains the original aspect ratio is 540x404. The execution time ranges now between 23 and 25 seconds (using CPU).

I also improved the architecture of the program with 4 branches running independently: Camera, GUI, Net and Tracker. The temporary tracker uses a multitracker implemented in OpenCV ( The continuous mode works at the moment with the detections (and segmentations) given by the Mask R-CNN (after a considerably amount of time) followed by the tracking-by-detection. In future improvements I expect to use a GPU to accelerate the detections.

In this video, the current functionality of the component can be seen:

Week 8: Optimization step[edit]

Once the Mask R-CNN is running the next necessary step is to optimize it with the objective of being able to perform activities in real-time conditions. For this purpose, we thought about two possible solutions.

The first one was to use GPU support, which could reduce the time of execution considerably. I tried to use the GPU that Google provides for free in Google Colab ( but I've been having some problems to install the necessary dependencies of the project so I will try it in the GPU available at the JdeRobot lab.

The second improvement could be given with the incorporation of a feature-tracking algorithm in the application, this one has not been implemented yet. Furthermore I measured the execution times of the application with different input images. I watched the performance with one or more objects and the performance with a bigger area to be segmented or a smaller one. The conclusion is that the influence of this parameters is not really important in the final execution time. After that I tried with smaller input images but the network model seems to have problems with some image sizes.

Week 7: Run continuous mode added[edit]

This week I added the 'Run continuous' mode to the application which allows the user to segment the video stream from the webcam in 'real-time'. This real-time is obviously conditioned by the time that needs the computer to process the image using the Mask R-CNN, which in my case takes about 25 seconds (using CPU). The following image shows one of the results achieved with this new implemented mode:

Week 6: Starting to build the object segmentator component[edit]

With the objective of building a visual memory for a robot the first required step is to develop an object-segmentator component running in real-time with a video stream. It will be build with a structure of 3 branches: Camera, GUI and Net. The first approach will be an application working using the camera server component video with two toggle buttons that allow the user to choose between passing to the net a single frame or a continuous sequence of frames from the camera. For this purpose, I re-use parts of code already done from, thanks to Nacho.

This week I put the Mask R-CNN model working in real time with a single frame from the camera. By now the net has recognized objects like 'person', 'cell-phone', 'bed', 'apple' and more without problems. The following video shows the application working (this is an early implementation, the Net is runned over a laptop without GPU and with a poor camera so the results could be improved).

Week 5: Mask R-CNN code review and test[edit]

The actual task is to understand the Mask R-CNN code implementation available at and test it on my personal laptop. To do so, I started by running a demo example of using a pre-trained model on MS COCO to segment objects in your own images ( The first step to run the demo is to clone the Mask R-CNN repository mentioned before. Furthermore, this demo has the following main requisites of installation: pycocotools and Keras with Tensorflow backend. To install pycocotools (with Python 3) you need to type on terminal this instructions:

git clone
cd coco/PythonAPI
sudo make install
sudo python3 install

and then append your COCO' local folder to the system path (example):


The next image shows one of the results that can be achieved when you run this demo on test images (from the folder 'images' of the Mask R-CNN repository):

Week 4: State of the Art work[edit]

This Christmas holidays I wrote an small part of the TFM report which includes the sections of introduction and State of the Art / Related works. In this document I talk about robotics in Computer Vision, neural networks, methods of tracking, detection and instance segmentation like Mask R-CNN, among other topics. The document is available at my GitHub repository

Week 3[edit]

This week I ran David's code once I had the Keras model to feed the classifier as it can be seen in the following video:

After that I studied the code and I saw the implementation done inside the digit classifier. This project follows the next design from a high level and a low level point of view (images are from David's project):

Furthermore I finished the testing of Marcos' project and I launched it successfully. The next video shows the tracking performance over a set of frames from MOT16-04:

This project uses a hybrid tracking approach with a neural network-based tracking and a feature-based tracking. The first one gives the system better detections but it is not able to work in real-time, so it returns detections every 30 frames. Meanwhile, the feature tracking component, which is able to work in real-time, computes the tracking between the frames. The process is shown in the next figure (image from Marcos' project):

I also read an article with a recent object detection and segmentation technique called Mask R-CNN. The paper is available at and also a Python implementation is available at This framework efficiently detects objects in images and it generates high-quality segmentation masks in the instances. It can be easy generalized to achieve different tasks as person keypoint detections. This method extends Faster R-CNN and adds a branch for predicting an object mask in parallel with the branch for bounding box recognition. This way, it decouples mask and class prediction which allows better performance. They introduce a layer called RoIAlign to fix the pixel-to-pixel misalignment between network inputs and outputs of the Faster R-CNN and this way they preserve spatial locations.

Mask R-CNN outperforms existing state-of-art techniques on COCO suite of challenges ( And it gives better results in instance segmentation tasks on Cityscapes dataset ( too. The next images show some of the commented results in object segmentation (left) and keypoints detection for human pose (right):

Week 2: Code review and testing[edit]

This second week, the proposed task was to execute and study the code of the Final Master Project of Marcos Pieras and the Final Grade Project of David Pascual. The code is available at their GitHub repositories (Marcos) and (David).

Once you have downloaded the repositories, you need to open a new terminal and type:

-Marcos' project:


-David's project:

cameraserver cameraserver.cfg
python digitclassifier.cfg 

But, as expected, it is not so easy. First, I had to install all the necessary dependencies for each project which included Keras with Tensorflow and Theano backends following the installation process available at After that I setup OpenCV and JdeRobot tools to work properly together.

To use Marcos' project you need to download also the content included in which allows you to use the SSD VGG 300 net and other tools used in the project. Besides, you need to have the dataset to test the detection project, for example the MOT16-14 dataset (, and the checkpoint used in the project ( I had some problems loading the checkpoint but I solved it using the information provided in

After the configuration made previously, David's project should work fine but the Keras model used net_4conv_patience5.h5 was not available at his repository so I asked him personally and he is going to update the repository soon.

I could not finish the complete task yet but I hope to do it soon. Also to appreciate the friendly help of Marcos and David in this process :)

Week 1: Getting started[edit]

In this first week, I started installing the JdeRobot environment on my laptop following the steps provided in After that, I tested that my installation was working fine by playing with some examples provided in the Documentation section. For example, the OpenCV demo ( If you want to use this example, you have to open two terminals and type on each one the following lines respectively:

cameraserver cameraserver.cfg
opencvdemo opencvdemo.cfg

Also, I have read some of the previous work from colleagues as the Final Grade Projects of Nuria Oyaga ("Análisis de Aprendizaje Profundo con la plataforma Caffe") and David Pascual ("Study of Convolutional Neural Networks using Keras Framework") and the Final Master Project of Marcos Pieras ("Visual people tracking with deep learning detection and feature tracking"). These works gave me an introduction to the Deep Learning basic concepts. Besides that, they show some of the State of the Art in detection, tracking and classification using Deep Learning techniques.