From jderobot
Jump to: navigation, search

Project Card[edit]

Project Name: Object Tracker

Author: Alexandre Rodríguez Rendo []

Academic Year: 2017/2018

Degree: Computer Vision Master (URJC)

GitHub Repositories: 2017-tfm-alexandre-rodriguez

Tags: Deep Learning, object segmentation, object detection, object tracking

State: Developing


Week 27: Problems with MOT, improving logger, raw images source, first concrete results[edit]

This week I solved a problem refering to the MOT dataset sequences I obtained. It seems that the duration and FPS indicated in the web are not ok ( so I introduced images source directly to be able to feed the aplication with the raw frames. I also realized that the ground truth provided only takes into account the class 'person' if following the instructions provided in the official paper of the dataset ( So maybe this needs to be discussed...

I had problems with frames not being logged when there was not a tracked object in it so that was solved. The logger was changed to log the skipped frames too (which was being done the opposite way in the previous version).

Some frames are still not being logged (most of them are) so that needs to be fixed in order to get the proper results. In first tests done using MOT (MOT17-02 exactly) the AP obtained for the person class was 7,97% with a precision of 81% and a recall of 9,7%. The results in other sequences from MOT were similar. This means that the application is usually right with the positive samples that are true positives but misses a lot of objects. This can give an idea on where to touch on the system to improve this results.

Work in progress: ID assignation and final results on MOT17Det. Future work: check TrackingNet dataset which provides evaluation tool in python -->

Refering to the dissertation the main chapters were defined, a first draw of the Introduction section (chapter 1) was done and also some cleaning in the State of the Art section (chapter 2). The format was modified using a more appropriated template. The latest version can be seen at

Week 26: Tracker confidence, new configuration options, logger improved[edit]

This week I introduced a confidence thresholding into the Tracker, so now the trackers are checked according to a value (in case of dlib tracking) or a boolean (in case of OpenCV tracking). This way I pretend to improve the results obtained and it allows some tuning of the tracker. Apart from that, new configuration options in the objecttracker.yml were added. Now the confidence of the Network object detections can be selected (with a number between 0 and 1 in terms of confidence) and also the image input size can be changed. This is a little more "dangerous" because some models cannot work with different input sizes but others can do it. This change affects to the FPS of the application both in Net and Tracker so it is interesting to be capable of changing this size. Showing now an example of the current configuration file:

  Source: Video # Local (local camera), Video (local file), Stream (ROS/ICE)

    DeviceNo: 1 # Device number for the desired webcam

    Path: "/media/alexandre/Data/Documents/Alexandre2R/MOVA/TFM/video/MOT17Det/sequences/MOT17-11.mp4"

    Server: ROS # "ROS", "ICE", "Deactivate"
    Proxy: "cameraA:tcp -h localhost -p 9999"
    Format: RGB8
    Topic: "/usb_cam/image_raw"
    Name: cameraA

    Framework: TensorFlow  # Currently supported: "Keras" or "TensorFlow"
    #Model: VGG_coco_SSD_512x512_iter_360000.h5
    Model: frozen_inference_graph.pb
    Dataset: COCO  # available: VOC, COCO, KITTI, OID, PET
    InputSize: [700,700]  # only some models allow changing this size
    Confidence: 0.6  # confidence threshold for detections

    Lib: OpenCV  # Currently supported: "OpenCV" or "dlib"
    Type: MOSSE  # available (with OpenCV as Lib): KCF, BOOSTING, MIL, TLD, MEDIANFLOW, CSRT, MOSSE

    Status: on  # turn on/off the logging of the results: "on" or "off"

  NodeName: dl-objecttracker

The logger is now optional, the user can change the logger from on to off or viceversa in the .yml configuration file. The FPS average from Net and Tracker is now logged too, apart from the detections that were logged in previous versions. And the logging is now cancelled automatically if frames are skipped during the execution in a video due to a slow tracking (which throws away frames from the circular buffer). This is done to avoid problems of metrics calculations between detections and ground truths in the datasets.

Work in progress: ID assignation and results evaluation.

Week 25: First metrics obtained[edit]

After introducing ROS I started working on getting the first results using datasets with the application. To do this some hotfixes were done including the log of the first and last frame of a processed video (which were failing before) and the rescale of the detections for the log according to the resize done to the image. The MOT dataset ground truth was parsed and some tests were performed using train sequences from this dataset. The results obtained in terms of mAP were very poor so I think that some trouble is happening when using the metrics tool or maybe I made some mistake in the creation of the files for detections/ground truth.

Furthermore I continued to work on the writing of the dissertation introducing the state-of-the-art in tracking datasets.

Week 24: Introducing ROS[edit]

The main task this week was the introduction of ROS as an image source in the application. Now, we can read from video and local camera using OpenCV and from local stream using ROS usb_cam driver.

To use it you need to have usb_cam driver installed ( For the installation I followed this question on the ROS forum . After installing the driver, connect a V4L USB compatible camera (any normal webcam camera should work) and launch the usb_cam node in terminal with:

roslaunch usb_cam.launch

The usb_cam.launch file is available at

When launched you may see some warnings like:

[ WARN] [1556133267.311291656]: sh: 1: v4l2-ctl: not found

Solved with:

sudo apt-get install v4l-utils

Or some errors like:

VIDIOC_S_FMT error 5, Input/output error

Solved by disconnecting and connecting again the usb camera. Most of them are solved in ROS forums if you need any help.

If you finally ran the node successfully you should be able to see the image topic of the camera /usb_cam/image_raw by typing:

rostopic list

After that, modify the objecttracker.yml and run the objectracker. You will be able to see the ObjectTracker running with your camera using ROS.

Week 23: Fixing bugs, datasets parsing[edit]

This week some necessary bugs were fixed. First, the log from the Tensorflow networks are now done fine. In the previous versions only the logging from Keras networks were working. Also, the first and last frames from a local video source are now processed and logged. An initial version of the ground truth converters is available at which includes the OTB and NFS datasets partially parsed. This is because the classes from those datasets does not match always with the classes which the neural network has (COCO, VOC datasets...). So, for the moment this class is hardcoded. For the MOT dataset parsing is necessary to include a mechanism of IDs assignation which is in progress. Also in progress is the ROS image source.

Week 22: Starting to extract results, demo video, new user options[edit]

This last two weeks were mainly used to introduce a way to start to obtain statistics from the application. The results obtained from both the Net (neural network detections) and the Tracker are now logged into .yaml files. The format of the file has the following structure:

 - 60   <-- frame number
 - person  <-- class
 - 0.9  <-- confidence
 - - 130  <-- left
   - 148  <-- top
 - - 170  <-- right
   - 340  <-- bottom

This log of the results is done online at the end of the application execution. To get statistics the idea is to use the Pascal VOC performance measurements: precision, recall, precision x recall curve and AP (average precision). This repo provides the metrics already calculated and you only need to adapt the format of your results and the ground truth to the format used.

Format used in the detection files (your results):

bottle 0.14981 80 1 295 500  
bus 0.12601 36 13 404 316  
horse 0.12526 430 117 500 307  
pottedplant 0.14585 212 78 292 118  
tvmonitor 0.070565 388 89 500 196 

Format used in the ground truth files:

bottle 6 234 39 128
person 1 156 102 180
person 36 111 162 305
person 91 42 247 458 

The results obtained in the .yaml files are easily converted to the required format using a bash file (+python script) made explicitly for the offline format conversion The repo also provides other formats to use but I found that the more readable and easy to implement on my application. In terms of datasets, some reserch was done to find out which are the state of the art datasets in multiobject tracking. Some of the most significant are MOT (, VOT (, OTB, PETS ( or NFS ( With you can download easily most of the previous datasets and many more. The next step is to convert the format of the ground truth in the datasets used to the one used in to obtain the statistics.

Also, the objecttracker.yml was modified (and the code too) to allow the user to choose between OpenCV or dlib tracking. And, if using OpenCVs tracking the option to select which tracker to use from KCF, BOOSTING, MIL, TLD, MEDIANFLOW, CSRT and MOSSE (default options are OpenCV tracking and MOSSE).

The next video provides an idea of the current state of the dl-objecttracker:

-Neural network detections: mask_rcnn_inception_v2_coco_2018_01_28
-Tracking: MOSSE OpenCV tracker

With respect to the different sources the application includes local video (as before) and live video from OpenCV local camera. I am working on introducing video using ROS.

Week 21: New trackers and some bugs solved[edit]

This week was dedicated to the introduction of the MOSSE and CSRT trackers included in recent OpenCV versions, also the dlib tracker was tested. To use the new OpenCV trackers I installed the last OpenCV version from source available at the moment (4.0.1) to work along with the jderobot environment. Both MOSSE and CSRT perform better in accuracy than the previous trackers tested. MOSSE is extremely fast but not as accurated as CSRT. The test of the dlib trackers was found very positive too in terms of accuracy but the speed seems to slow down with a increasing number of objects to track. For the moment, the chosen tracker is MOSSE.

Apart from that some bugs found with the bounding boxes coordinates used were fixed. The GUI off mode was improved too with the tagging of the image according to the frame number. When using a local video the last frames of the video in the buffer were not being processed, so this was solved.

The next steps include a way to extract some statistics or performance measurements from the application: IoU (intersection over union) of detection and tracking in datasets, speeds in FPS... with the different configurations.

Week 20: Revisiting the previous work on the project report[edit]

The work done in Week 4 was reviewed to check the possible updates needed according to the State of the Art. I also had a look over the structure of the final report to continue its writing in parallel with the last changes in the application.

The MOSSE and CSRT tracker could not be tested inside the application due to the actual version available in JdeRobot but a tracker using dlib library is going to be reviewed and hopefully tested ( dlib is an opensource library containing machine learning algorithms and tools which is commonly used both on industry and academia.

Week 19: Keras models[edit]

This week the Keras network models were finally introduced into the application. The supported models include SSD_300x300 and SSD_512x512 architectures. The next step is going to be the test of new trackers.

Week 18: Offline mode and close[edit]

Last week I solved some small pending tasks. I have refactorized the offline mode to adapt it to the new architecture of the Network side. This mode was working at the very beginning of the project to allow the user to have an option of running the program without GUI. This way, you can save the results of the application as .jpg images. Apart from that, now the GUI can be fully closed by clicking the Close button. In the previous versions, the program was still running in background and it was necessary to close it in the terminal.

The last version is available at The component was renamed from dl_objectsegmentator to dl_objecttracker.

Week 17: Using dl-objectdetector[edit]

With the previous stable version of the tracker working fine, I started to work on introducing a bigger number of neural networks for the object detection using dl-objectdetector from JdeRobot ( For that purpose, I tested the component using both Keras and Tensorflow models. After that I introduced the component in the dl-objecttracker. At the moment, it only uses Tensorflow models but a Keras version will be ready soon. I tested some models from Tensorflow detection model zoo as SSD or Mask R-CNN (

Apart from that, new sources are going to be added to feed the program: local camera, local video and stream (ROS/ICE). By now, the program is tested with the local video because there are still some bugs in the others that need to be fixed.

Week 16: Refinement of synchronism and tracker[edit]

These last weeks I have been working on having a stable tracker which allows the application a better synchronization between the different branches (Camera, Net, GUI) and the tracking itself. For that purpose, the internal logic of the tracker was modified, allowing the tracking to work in a more flexible way with the buffer which takes as input. This gives as result a tracker that has 3 modes: slow, normal and fast (depending on the FPS average rate of tracking of the previous frames). For example, if the tracking is running slow, a number of frames in the buffer are skipped to avoid the buffer to grow more than expected. And, if the tracking is running fast, the tracker slows down to prevent that the tracking finishes before the neural network gives a result.

Related to the multiobject tracking problem (the tracking of multiple objects affects the FPS rate, slowing it) I looked for new trackers following these posts (LearnOpenCV: pyimagesearch: I am actually working with TLD tracker but it has some problems with false positives. However, it is the best option available for my purpose in the version 3.3.1 of OpenCV (included in JdeRobot). One of the next steps is to test the MOSSE and CSRT trackers that are available in more recent OpenCV versions and look promising, specially MOSSE due to the speed requisites.

At the same time, I am going to include the JdeRobot object detector ( to make use of other networks apart from the actual Mask R-CNN.

Week 15: Camera buffer and tracker updates[edit]

Now, the GUI uses the images coming from the Camera buffer directly instead of using its own buffer. The tracker has a mechanism to avoid that the tracking process to be much faster than the neural network detections.

Week 14: Net result bug in GUI fixed[edit]

The GUI in the neural network result was showing the image segmented from the Mask R-CNN along with a bounding box from the tracker sometimes, so this was fixed. The next necessary improvement is to move the buffer to Cam completely without having it in the GUI branch too.

Other future tasks include the incorporation of the DetectionSuite ( and the ObjectDetector ( components from JdeRobot.

Week 13: Circular buffer (first version) and buffer in Cam[edit]

Once the first version of the buffer with delay is running the next necessary step is to implement a circular buffer to avoid the increase in size of this buffer. But first, the buffer with delay (and also the instructions which control the different branches) was moved to the Cam branch too, to allow the application run without GUI. At the moment, this GUI-off option saves the results that were displayed in the 'Combined' window in .jpg files. Now, to execute the main program you need to type in the terminal (in the case you do not want the GUI to start / in the case you want the GUI -> off / on):

python2 objectsegmentator.yml off

The circular buffer is done to control the buffer size, which tends to increase due to the tracker changes in speed (FPS). For this reason, the first version of this tracker needs to handle two main situations: tracker fast and tracker slow (measured in FPS rate). In the first case, the old frames in the buffer are discarded for the next tracking. In the second case, some frames are skipped. With this changes, the tracking and the segmentation are closer to the real frames captured by the camera and, as I said before, the buffer does not increases its size without control.

But, as usually happens, this first version has some bugs that need to be fixed. For example, sometimes old frames are still 'alive' and it takes some time to the program to update the processing with the last frames in the buffer.

This buffer upgrade is not available without GUI for the moment.

Week 12: Tracker fixes and GUI changes[edit]

So, this new week I started by solving some problems that the tracker had which affected the flow of the application (some more still need to be fixed yet). Also, I had a look on different possible types of tracker implemented in OpenCV (at the moment I am using the TLD). For this purpose I used the website LearnOpenCV of Satya Mallik ( and the OpenCV documentation. I tested all the rest of the trackers mentioned in the post (including the GOTURN which uses deep learning with an offline trained model) but I found that the actual is the better for this purpose.

Furthermore, some bugs were fixed related to the buttons in the GUI and its behavior in the new buffer architecture (as the tracker some more still need to be fixed I guess). I included a new GUI setup with 4 images: the live input video, the combined result from tracker and neural network and the separated results of that two. Now, the images are tagged with the frame number for a better understanding of the application (and some debugging too).

The next image available at the link shows the actual state of the application:

Week 11: First prototype of the delay buffer[edit]

This week I implemented a first prototype of the delay buffer fixing some bugs of the previous version but it still has some little failures pending to be solved. On the other hand, the Docker container has not been configured yet to allow graphic sessions so the works in it are paused at the moment waiting for that. The next steps are going to include to move the buffer to the Cam branch, a circular buffer and an improved visualization (some GUI changes).

Week 10: Continuing with the improvements[edit]

These last 3 weeks I have been working in two different type of approaches to improve the behaviour of the application. On the one hand, I built a first prototype of the application with a buffer with delay. This buffer allows to show all the detections and segmentations in the GUI with a delay given by the length of the buffer in each moment but this way all frames injected to the Net are the last frames (this did not happen before). The buffer working at the moment has some little bugs that need to be fixed. Furthermore, another proposal was to do a double buffer technique but it is not implemented yet.

In the other hand, I had access to a GPU in a Docker container (thanks to Francisco Rivas) working with CUDA where all the necessary packages were installed (Tensorflow, Keras, ...) and I have launched the program without errors. The performance was not measured yet because some features need to be installed in the Docker container. The program will capture the video here from a recorded video instead of the webcam because there is not a real camera in the hardware used (using the cameraserver).

Some issues were fixed to allow the program to download the COCO trained weights of the model of the Mask R-CNN in the case they were not already downloaded.

Week 9: Mask R-CNN improvements[edit]

This week I tested the Mask R-CNN with different image sizes to reduce the execution time of a segmentation. The minimum size allowed by the net that obtains results and mantains the original aspect ratio is 540x404. The execution time ranges now between 23 and 25 seconds (using CPU).

I also improved the architecture of the program with 4 branches running independently: Camera, GUI, Net and Tracker. The temporary tracker uses a multitracker implemented in OpenCV ( The continuous mode works at the moment with the detections (and segmentations) given by the Mask R-CNN (after a considerably amount of time) followed by the tracking-by-detection. In future improvements I expect to use a GPU to accelerate the detections.

In this video, the current functionality of the component can be seen:

Week 8: Optimization step[edit]

Once the Mask R-CNN is running the next necessary step is to optimize it with the objective of being able to perform activities in real-time conditions. For this purpose, we thought about two possible solutions.

The first one was to use GPU support, which could reduce the time of execution considerably. I tried to use the GPU that Google provides for free in Google Colab ( but I've been having some problems to install the necessary dependencies of the project so I will try it in the GPU available at the JdeRobot lab.

The second improvement could be given with the incorporation of a feature-tracking algorithm in the application, this one has not been implemented yet. Furthermore I measured the execution times of the application with different input images. I watched the performance with one or more objects and the performance with a bigger area to be segmented or a smaller one. The conclusion is that the influence of this parameters is not really important in the final execution time. After that I tried with smaller input images but the network model seems to have problems with some image sizes.

Week 7: Run continuous mode added[edit]

This week I added the 'Run continuous' mode to the application which allows the user to segment the video stream from the webcam in 'real-time'. This real-time is obviously conditioned by the time that needs the computer to process the image using the Mask R-CNN, which in my case takes about 25 seconds (using CPU). The following image shows one of the results achieved with this new implemented mode:

Week 6: Starting to build the object segmentator component[edit]

With the objective of building a visual memory for a robot the first required step is to develop an object-segmentator component running in real-time with a video stream. It will be build with a structure of 3 branches: Camera, GUI and Net. The first approach will be an application working using the camera server component video with two toggle buttons that allow the user to choose between passing to the net a single frame or a continuous sequence of frames from the camera. For this purpose, I re-use parts of code already done from, thanks to Nacho.

This week I put the Mask R-CNN model working in real time with a single frame from the camera. By now the net has recognized objects like 'person', 'cell-phone', 'bed', 'apple' and more without problems. The following video shows the application working (this is an early implementation, the Net is runned over a laptop without GPU and with a poor camera so the results could be improved).

Week 5: Mask R-CNN code review and test[edit]

The actual task is to understand the Mask R-CNN code implementation available at and test it on my personal laptop. To do so, I started by running a demo example of using a pre-trained model on MS COCO to segment objects in your own images ( The first step to run the demo is to clone the Mask R-CNN repository mentioned before. Furthermore, this demo has the following main requisites of installation: pycocotools and Keras with Tensorflow backend. To install pycocotools (with Python 3) you need to type on terminal this instructions:

git clone
cd coco/PythonAPI
sudo make install
sudo python3 install

and then append your COCO' local folder to the system path (example):


The next image shows one of the results that can be achieved when you run this demo on test images (from the folder 'images' of the Mask R-CNN repository):

Week 4: State of the Art work[edit]

This Christmas holidays I wrote an small part of the TFM report which includes the sections of introduction and State of the Art / Related works. In this document I talk about robotics in Computer Vision, neural networks, methods of tracking, detection and instance segmentation like Mask R-CNN, among other topics. The document is available at my GitHub repository

Week 3[edit]

This week I ran David's code once I had the Keras model to feed the classifier as it can be seen in the following video:

After that I studied the code and I saw the implementation done inside the digit classifier. This project follows the next design from a high level and a low level point of view (images are from David's project):

Furthermore I finished the testing of Marcos' project and I launched it successfully. The next video shows the tracking performance over a set of frames from MOT16-04:

This project uses a hybrid tracking approach with a neural network-based tracking and a feature-based tracking. The first one gives the system better detections but it is not able to work in real-time, so it returns detections every 30 frames. Meanwhile, the feature tracking component, which is able to work in real-time, computes the tracking between the frames. The process is shown in the next figure (image from Marcos' project):

I also read an article with a recent object detection and segmentation technique called Mask R-CNN. The paper is available at and also a Python implementation is available at This framework efficiently detects objects in images and it generates high-quality segmentation masks in the instances. It can be easy generalized to achieve different tasks as person keypoint detections. This method extends Faster R-CNN and adds a branch for predicting an object mask in parallel with the branch for bounding box recognition. This way, it decouples mask and class prediction which allows better performance. They introduce a layer called RoIAlign to fix the pixel-to-pixel misalignment between network inputs and outputs of the Faster R-CNN and this way they preserve spatial locations.

Mask R-CNN outperforms existing state-of-art techniques on COCO suite of challenges ( And it gives better results in instance segmentation tasks on Cityscapes dataset ( too. The next images show some of the commented results in object segmentation (left) and keypoints detection for human pose (right):

Week 2: Code review and testing[edit]

This second week, the proposed task was to execute and study the code of the Final Master Project of Marcos Pieras and the Final Grade Project of David Pascual. The code is available at their GitHub repositories (Marcos) and (David).

Once you have downloaded the repositories, you need to open a new terminal and type:

-Marcos' project:


-David's project:

cameraserver cameraserver.cfg
python digitclassifier.cfg 

But, as expected, it is not so easy. First, I had to install all the necessary dependencies for each project which included Keras with Tensorflow and Theano backends following the installation process available at After that I setup OpenCV and JdeRobot tools to work properly together.

To use Marcos' project you need to download also the content included in which allows you to use the SSD VGG 300 net and other tools used in the project. Besides, you need to have the dataset to test the detection project, for example the MOT16-14 dataset (, and the checkpoint used in the project ( I had some problems loading the checkpoint but I solved it using the information provided in

After the configuration made previously, David's project should work fine but the Keras model used net_4conv_patience5.h5 was not available at his repository so I asked him personally and he is going to update the repository soon.

I could not finish the complete task yet but I hope to do it soon. Also to appreciate the friendly help of Marcos and David in this process :)

Week 1: Getting started[edit]

In this first week, I started installing the JdeRobot environment on my laptop following the steps provided in After that, I tested that my installation was working fine by playing with some examples provided in the Documentation section. For example, the OpenCV demo ( If you want to use this example, you have to open two terminals and type on each one the following lines respectively:

cameraserver cameraserver.cfg
opencvdemo opencvdemo.cfg

Also, I have read some of the previous work from colleagues as the Final Grade Projects of Nuria Oyaga ("Análisis de Aprendizaje Profundo con la plataforma Caffe") and David Pascual ("Study of Convolutional Neural Networks using Keras Framework") and the Final Master Project of Marcos Pieras ("Visual people tracking with deep learning detection and feature tracking"). These works gave me an introduction to the Deep Learning basic concepts. Besides that, they show some of the State of the Art in detection, tracking and classification using Deep Learning techniques.