Davidbutra-tfg

From jderobot
Jump to: navigation, search

Project Card[edit]

Project Name: Deep Learning on RGBD sensors

Author: David Butragueño Palomar [david.butra23@gmail.com]

Academic Year: 2015/2016

Degree: Degree in Audiovisual Systems and Multimedia Engineering

GitHub Repositories: TFG-DavidButragueño

Tags: Deep Learning, Caffe, Python, JdeRobot, SSD

State: Developing

Comparing database accuracy[edit]

In section Jaccard Index, I explained the script whose output was the jaccard index, which shows the precision to detect of some network. In previous section, I checked the accuracy of the trained network with the VOC2012 database. In order to compare some databases, I ran the same script, but now using the trained network with de COCO database. The results are showed in the following table:

Jaccard Index Average COCO

AEROPLANE
Numero de veces detectado: 152
Jaccard Index Average: 0.854206217564

BICYCLE
Numero de veces detectado: 109
Jaccard Index Average: 0.838051438704

BIRD
Numero de veces detectado: 188
Jaccard Index Average: 0.827536403584

BOAT
Numero de veces detectado: 93
Jaccard Index Average: 0.832384798461

BOTTLE
Numero de veces detectado: 85
Jaccard Index Average: 0.792206010244

BUS
Numero de veces detectado: 83
Jaccard Index Average: 0.885377182

CAR
Numero de veces detectado: 386
Jaccard Index Average: 0.787868869306

CAT
Numero de veces detectado: 231
Jaccard Index Average: 0.872403110331

CHAIR
Numero de veces detectado: 211
Jaccard Index Average: 0.797857251987

COW
Numero de veces detectado: 80
Jaccard Index Average: 0.847126083675

DININGTABLE
Numero de veces detectado: 69
Jaccard Index Average: 0.802892001534

DOG
Numero de veces detectado: 246
Jaccard Index Average: 0.84336351488

HORSE
Numero de veces detectado: 174
Jaccard Index Average: 0.844410658171

MOTORBIKE
Numero de veces detectado: 148
Jaccard Index Average: 0.765441662221

PERSON
Numero de veces detectado: 2576
Jaccard Index Average: 0.714477792895

POTTEDPLANT
Numero de veces detectado: 49
Jaccard Index Average: 0.463115709356

SHEPP
Numero de veces detectado: 77
Jaccard Index Average: 0.812879832216

SOFA
Numero de veces detectado: 90
Jaccard Index Average: 0.748775322915

TRAIN
Numero de veces detectado: 93
Jaccard Index Average: 0.863239923723

TVMONITOR
Numero de veces detectado: 166
Jaccard Index Average: 0.722339485916

The following graph shows the results using both databases (VOC2012 and COCO):


The following graph shows the how many times have been detected each object.



Jaccard Index[edit]

Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

This coefficient can be used to know the accuracy of the detector, comparing the ground truth bounding box with the predicted bounding box of an object. First, I created the script image_detection_compare.py which uses an input image and return the same image with the the ground truth bounding box with the predicted bounding box. The following image shows an example of the output of this script:



After that, in order to know the accuracy of the detector, I created the script jaccard_index.py which uses an input image, detect objects and calculate the jaccard index for each object. Then, it generates 2 files. First with the jaccard index of each object for each of the input images, and second with the average jaccard index for each detected object. I this script, I used as input images the first 4000 training images of the VOC2012 database. The following table shows the Jaccard index average for each object, being one of the files of the script output.

Jaccard Index Average

AEROPLANE
Numero de veces detectado: 217
Jaccard Index Average: 0.888294500116

BICYCLE
Numero de veces detectado: 174
Jaccard Index Average: 0.850015789784

BIRD
Numero de veces detectado: 320
Jaccard Index Average: 0.867387618049

BOAT
Numero de veces detectado: 230
Jaccard Index Average: 0.836357973071

BOTTLE
Numero de veces detectado: 173
Jaccard Index Average: 0.783992867619

BUS
Numero de veces detectado: 106
Jaccard Index Average: 0.898339001104

CAR
Numero de veces detectado: 541
Jaccard Index Average: 0.843371301036

CAT
Numero de veces detectado: 314
Jaccard Index Average: 0.913944427625

CHAIR
Numero de veces detectado: 491
Jaccard Index Average: 0.824212275621

COW
Numero de veces detectado: 117
Jaccard Index Average: 0.864286728992

DININGTABLE
Numero de veces detectado: 138
Jaccard Index Average: 0.813477535819

DOG
Numero de veces detectado: 379
Jaccard Index Average: 0.886535399874

HORSE
Numero de veces detectado: 219
Jaccard Index Average: 0.867706590844

MOTORBIKE
Numero de veces detectado: 204
Jaccard Index Average: 0.849772278624

PERSON
Numero de veces detectado: 3292
Jaccard Index Average: 0.798371879907

POTTEDPLANT
Numero de veces detectado: 206
Jaccard Index Average: 0.771373236023

SHEPP
Numero de veces detectado: 111
Jaccard Index Average: 0.857312412318

SOFA
Numero de veces detectado: 163
Jaccard Index Average: 0.790956777554

TRAIN
Numero de veces detectado: 122
Jaccard Index Average: 0.894330957063

TVMONITOR
Numero de veces detectado: 233
Jaccard Index Average: 0.802044018707

The output file, which shows Jaccard index of each object for each of the input images, is showed here.


Conditional Detection[edit]

Eventually, a possible aim may be to detect only certain objects instead of all those objects for which the database with which the network has been trained is ready (labelmap.prototxt). For this reason, I did some changes to create a new label map depending on the objects that you wish to identify. It is important to know that if the trained network is not ready to detect a specific object (because of the train database does not have annotattions for that specific object), it should not apper in label map, since, obviously, it will not be detected.

Changes in Script[edit]

I created the script image_detection_conditional.py, which is a modification of image_detection.py. First of all, I created an array of strings where you can specified what objects you are looking for. For example, if you just want detect dogs:

labels_conditional = ["dog"]

After that, I created the functio get_arraylabels_conditional which receive the array commented above, and look for in the label map choosen, the labels which apper in the array. This function returns an array with as many positions as there are objects in the label map, and with values "0" if the label of the variable labels_conditional was not found and "1" if it has been found.

def get_arraylabels_conditional(self, labelmap, array_labels): 

    array_labels_YesNo = [] 


    for x in range(0, len(labelmap.item)): 
        found = False 
        for i in range(0, len(array_labels)): 
            if labelmap.item[x].display_name == array_labels[i]: 
                array_labels_YesNo.append(1) 
                found = True 
                break 
        if found == False: 
            array_labels_YesNo.append(0) 


    return array_labels_YesNo 

Then, I used the returned array to create a new label map file with only the labels found. For this aim, I created the function get_labelmap_conditional. This function will return an array with 3 internals arrays with the name, label and dislay name of each object in order to create de new label map.

def get_labelmap_conditional(self,labelmap, array_conditional): 

    names = [] 
    labels = [] 
    display = [] 
    numlabels = len(labelmap.item) 


    for i in xrange(0, numlabels): 
        if array_conditional[i] == 1: 
            names.append(labelmap.item[i].name) 
            labels.append(labelmap.item[i].label) 
            display.append(labelmap.item[i].display_name) 


    return names, labels, display

After that, the detection is normally made and then, the function delete_labels delete the detected objects that not appear in the new label map.

def delete_labels(self,labelmap, labels, top_indices): 

    labels_conditional = [] 
    top_indices_conditional = [] 
    for i in range(0, len(labels)): 
        for x in range(0, len(labelmap.item)): 
            if labels[i] == labelmap.item[x].label: 
                labels_conditional.append(labelmap.item[x].label) 
                top_indices_conditional.append(top_indices[i]) 
                break 


    return labels_conditional, top_indices_conditional

Finally, I used the image showed in section Studying used databases as input image to make the detection and try the new script. I utilized the resulting weights of training with VOC2012 database. First, I ran image_detection.py which uses the original labelmap (20 classes) of VOC2012 database. Using this script, the output image shows 3 detections (it just show detections with confiance greater than 0.6):

After that, I ran image_detection_conditional.py which uses a condtional label map. I made 2 tests, one to detect only dogs and another one to detect only motorbikes. The following images refers to the output of the both tests.


Changes in Component[edit]

I added in the configuration file one variable where it is possible to specify the objects that you want to detect:

DetectorSSD.Camera.Proxy=cameraA:default -h localhost -p 9999 
#Write all objects that yo want to detect separated by one space. 
DetectorSSD.Labels=person dog

After that, I created the script camera_conditional.py, similar to camera.py, where the variable DetectorSSD.Labels of the configuration file is used. After that, I followed the same procces as above, in section Changes in Script, using the same functions.

Including detection network in the component[edit]

I modified the component explained in section "Using my camera" in order to adapt it to a detection network. First os all, I changed the GUI appareance. For detection, it is not need one window to tranformed image or 10 little windows for each number for classification. In this case, it is only neccesary one window which shows detected object in real time. For this reason, I changed the script gui.py to shows only one window. I also added a button with which the detection starts or ends if it is pulsed. I used the following code:

# BUTTON 
self.button = QtGui.QPushButton('Pulsa para deteccion', self) 
self.button.clicked.connect(self.handleButton) 

# IMAGE PRINCIPAL 
self.imgPrincipal = QtGui.QLabel(self) 
self.imgPrincipal.resize(640,480) 
self.imgPrincipal.move(150,50) 
self.imgPrincipal.show() 

# Configuracion BOX 
vbox = QtGui.QVBoxLayout() 
vbox.addWidget(self.imgPrincipal) 
vbox.addWidget(self.button) 
self.setLayout(vbox)  

Now, the function getImage will do the process to detect the objects, returning the image with the bounding boxes over the detected objects. The variable handleButtonON is a boolean that changes his value when the button is pulsed.

def getImage(self):
     
    if self.camera: 
        self.lock.acquire() 
        image = np.zeros((self.height, self.width, 3), np.uint8) 
        image = np.frombuffer(self.image.pixelData, dtype=np.uint8) 
        image.shape = self.height, self.width, 3 
        if self.handleButtonON: 
            image = self.detectiontest(image) 
        self.lock.release() 

    return image 

Returned image will be displayed on the computer screen using the following code.

 
img_out = QtGui.QImage(image.data, image.shape[1], image.shape[0], QtGui.QImage.Format_RGB888) 
scaledImageOut = img_out.scaled(self.imgPrincipal.size()) 
self.imgPrincipal.setPixmap(QtGui.QPixmap.fromImage(scaledImageOut)) 


The following video shows an example of this component running.


Detection with SSD[edit]

Now, I am deal with the detection problem, using for that aim the SSD tool that Caffe provides. Single Shot MultiBox Detector (SSD) is an unified framework for object detection with a single network which allows the training of models for detection of different objects and depending on the database you use and using already trained models.

Studying used databases[edit]

I used trained models with 2 differents databases which are explained below:


  • VOC2012-SSD300x300
  • This database contains 11,530 images of training and validation which represent 27,450 different objects distributed in 20 classes. The training data provided consists of a set of images; each image has an annotation file giving a bounding box and object class label for each object in one of the twenty classes present in the image. Therefore, detection task consists in predicting the bounding box and label of each object from the twenty target classes in the test image. The following image shows an example of this database:



    The annotation file can be found here. This file shows the delimitation of the bounding box for each of the objects that can be detected in this image, in this case, motorbike, dog, person and chair.

    It is important to know how many images appear in each of the 20 objects in order to know if the database is well scaled, or if, on the contrary, many objects appear more often than others. In VOC2012 database, the distributions of images and objects by class are approximately equal across the training/validation and test sets and it can be found here. Specifically, the distribution of objects for the detection task is shown in the following image.


    Here, you can see example images containing at least one instance of each object category.


  • COCO-SSD300x300
  • COCO is a large-scale object detection, segmentation, and captioning dataset. This database has several features:
    • Object segmentation
    • Recognition in context
    • Superpixel stuff segmentation
    • 330K images (>200K labeled)
    • 1.5 million object instances
    • 80 object categories
    • 91 stuff categories
    • 5 captions per image
    • 250,000 people with keypoints


    COCO currently has three annotation types: object instances, object keypoints, and image captions. The annotations are stored using the JSON file format. All annotations share the basic data structure below:



    For detection task, it is of special interest the annotation using object instances. Each instance annotation contains a series of fields, including the category id and segmentation mask of the object. The segmentation format depends on whether the instance represents a single object (iscrowd=0 in which case polygons are used) or a collection of objects (iscrowd=1 in which case RLE is used). Note that a single object (iscrowd=0) may require multiple polygons, for example if occluded. Crowd annotations (iscrowd=1) are used to label large groups of objects (e.g. a crowd of people). In addition, an enclosing bounding box is provided for each object (box coordinates are measured from the top left image corner and are 0-indexed). Finally, the categories field of the annotation structure stores the mapping of category id to category and supercategory names.



    In refer to data, MS COCO dataset is splitedinto two roughly equal parts. The first half of the dataset was released in 2014, the second half will be released in 2015. The 2014 release contains 82,783 training, 40,504 validation, and 40,775 testing images (approximately 1/2 train, 1/4 val, and 1/4 test). There are nearly 270k segmented people and a total of 886k segmented object instances in the 2014 train+val data alone. The cumulative 2015 release will contain a total of 165,482 train, 81,208 val, and 81,434 test images.

    The distribution of the objects in this database can be obtained from their website. In section Explore, it is possible to choose and combine each of the objects and observe how many images these objects appear. The distribution for each of the objects in training/validation set is shown in the following image:



Finally, in order to understand better both databases, the following image shows some characteristics of them, comparing them with other important databases.


(a) Number of annotated instances per category for MS COCO and PASCAL VOC. (b,c) Number of annotated categories and annotated instances, respectively, per image for MS COCO, ImageNet Detection, PASCAL VOC and SUN (average number of categories and instances are shown in parentheses). (d) Number of categories vs. the number of instances per category for a number of popular object recognition datasets. (e) The distribution of instance sizes for the MS COCO, ImageNet Detection, PASCAL VOC and SUN datasets.


Script for detection[edit]

After getting the trained models, I used a script that allows introducing a image into the network and make the detection on it with the model choosen. For that reason, I utilised a example which is available in GitHub. This script uses some files related to network parameters which will be used in some functions.

  • labelmap_voc.prototxt: This file contains the tags which can be assigned after detection. This file will be used in the function getlabel which returns desired tag after detection
def get_labelname(self,labelmap, labels):
    num_labels = len(labelmap.item)
    labelnames = []
    if type(labels) is not list:
        labels = [labels]
    for label in labels:
        found = False
        for i in xrange(0, num_labels):
            if label == labelmap.item[i].label:
                found = True
                labelnames.append(labelmap.item[i].display_name)
                break
        assert found == True
    return labelnames
  • deploy.prototxt: This file contains the structure os the trained model.
  • VGG_VOC0712_SSD_300x300_iter_120000.caffemodel: This file contains the weighs after training the network.

Both files will be used in order to define the parameters of the network. For this purpose, I used the following function that Caffe provides:

net = caffe.Net(caffe_root + 'models/VGGNet/VOC0712/SSD_300x300/deploy.prototxt',      #Structure of the model
                    caffe_root + 'models/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_512x512_iter_60000.caffemodel',      #Train weighs
                    caffe.TEST)      # Test mode


After that, I used a function for the detection, detectiontest, which takes an input image and returns one image with the detections obtained. First, this function makes some transformation to the input image in order to adapt it to the input of the trained network. To do this, the function Transformer provided by the Caffe is used. Then, the transformed image is introduced into the network using this:

self.net.blobs['data'].data[...] = transformed_image

After that, detection's part begins. For that, I used a Caffe's function which return the detections obtaiend.

detections = self.net.forward()['detection_out']

This output is parsed in some vectors which contains labels, confidence and bounding boxes coordinates.

det_label = detections[0,0,:,1] 
det_conf = detections[0,0,:,2] 
det_xmin = detections[0,0,:,3] 
det_ymin = detections[0,0,:,4] 
det_xmax = detections[0,0,:,5] 
det_ymax = detections[0,0,:,6] 

It is important not to select all the detections obtained since some of them with a low confidence value may not be as accurate as you want. For that, a filter, which chooses detections with a confidence level greater than 0.6, is used.

# Get detections with confidence higher than 0.6. 
for i in range(0, len(det_conf)): 
    if (det_conf[i] >= 0.6): 
        top_indices.append(i) 

top_conf = det_conf[top_indices] 
top_label_indices = det_label[top_indices].tolist() 
top_labels = self.get_labelname(self.labelmap, top_label_indices) 
top_xmin = det_xmin[top_indices] 
top_ymin = det_ymin[top_indices] 
top_xmax = det_xmax[top_indices] 
top_ymax = det_ymax[top_indices] 

Finally, the following code includes on the input image the bounding box for each of the objects detected.

colors = plt.cm.hsv(np.linspace(0, 1, 81)).tolist() 
font = cv2.FONT_HERSHEY_SIMPLEX 

for i in range(top_conf.shape[0]): 
    xmin = int(round(top_xmin[i] * img.shape[1])) 
     ymin = int(round(top_ymin[i] * img.shape[0])) 
     xmax = int(round(top_xmax[i] * img.shape[1])) 
     ymax = int(round(top_ymax[i] * img.shape[0])) 
     score = top_conf[i] 
     label = int(top_label_indices[i]) 
     label_name = top_labels[i] 
     color = colors[label] 
     for i in range(0, len(color)-1): 
         color[i]=color[i]*255 


     cv2.rectangle(img,(xmin,ymin),(xmax,ymax),color,2) 
     cv2.putText(img,label_name,(xmin+5,ymin+10), font, 0.5,(255,0,0),2) 

Later, I used the explained SCRIPT to verify the detection capacity of the models trained with the databases cited above. Following images show the results of the detection (LEFT = VOC and RIGHT = COCO).


Using my camera[edit]

An important aim is to use a camera to detect the images it captures. I used the camera of my laptop first.

First I used a GUI (Graphical User Interface). This GUI uses PyQt4 that is one of the two most popular Python bindings for the Qt cross-platform GUI/XML/SQL C++ framework. This GUI will show three windows:

  • The main window which shows the image that camera captures.
  • The window which shows the transformed image.
  • The window which shows 10 labels, one for each number (0-9) which will change color depending on the image shown by the camera

For all this, I used the following code:


#Original Image Label
self.imgLabel=QtGui.QLabel(self)
self.imgLabel.resize(500,400)
self.imgLabel.move(70,50)
self.imgLabel.show()

#Transform Image Label
self.transLabel=QtGui.QLabel(self)
self.transLabel.resize(200,200)
self.transLabel.move(700,50)
self.transLabel.show()

self.numbers = []
#Numbers labels 0 to 9
lab0=QtGui.QLabel(self)
lab0.resize(30,30)
lab0.move(785,450)
lab0.setText('0')
lab0.setAlignment(QtCore.Qt.AlignCenter)
 lab0.setStyleSheet("background-color: #7FFFD4; color: #000; font-size: 20px; border: 1px solid black;")
 self.numbers.append(lab0)

The following image shows the GUI appearance:


After that, I need to use caffe in order to identify the number showed by the camera. The images which I trained each network have 28 pixels height and 28 pixels width, that is, they are of 28x28 pixels, so it is necessary to resize the input image. The transformed image will be showed in the window of the top corner on the right. I used the next code:

def detection(self, img): #Uses caffe to detect the number we are showing
    self.net.blobs['data'].reshape(1,1,28,28)
    self.net.blobs['data'].data[...]=img
    output = self.net.forward()
    digito = output['prob'].argmax()
    return digito

I also used one function in order to do another transformation. The following function apply a Laplacian or Canny filter depending of the configuration file.

def trasformImage(self, img): #Trasformates the image for the network
    #Focus the image
    img_crop = img[0:480, 80:560]
    #Grayscale image
    img_gray = cv2.cvtColor(img_crop, cv2.COLOR_BGR2GRAY)
    #Resize image
    resize = cv2.resize(img_gray,(28,28))
    #Gaussian filter
    img_filt = cv2.GaussianBlur(resize, (5, 5), 0)
    if (filterMode == "0"):
        #Canny filter
        v = np.median(img_filt)
        sigma = 0.33
        lower = int(max(0, (1.0 - sigma) * v))
        upper = int(min(255, (1.0 + sigma) * v))
        edges = cv2.Canny(img_filt, lower, upper)
    else:
        #Laplacian filter
        edges = cv2.Laplacian(img_filt,-1,5)
        edges = cv2.convertScaleAbs(edges)
    kernel = np.ones((5,5),np.uint8)
    dilation = cv2.dilate(edges,kernel,iterations = 1)
    #Negative
    #neg = 255-resize
    return dilation

The following code refers to the configuration file. If the variable Numberclassifier.Filter has a value of 0, a Canny filter will be applied, if on the contrary its value is 1, it will apply a Laplacian filter.

Numberclassifier.Camera.Proxy=cameraA:default -h 0.0.0.0 -p 9999
#set 0 for a Canny filter and 1 to a Laplacian filter
Numberclassifier.Filter=0


In order to know the number detected, it is necessary yo change the colour of box, between the third window. For that reason, I used the following function between the GUI script. I used the yellow colour to identify the number detected.

def lightON(self,out): #This function turn on the light for the network output
    for number in self.numbers:
        number.setStyleSheet("background-color: #7FFFD4; color: #000; font-size: 20px; border: 1px solid black;")
    self.numbers[out].setStyleSheet("background-color: #FFFF00; color: #000; font-size: 20px; border: 1px solid black;")

Testing some networks[edit]

Once I know how to create a database with the images I wish, I did more of them, in order to train my network. I created, from the original database with the background in black and the number in white, 6 more. I did one using Canny filter, another one using Sobel filter, 3 more using Laplacian filter varying their kernel size (3, 5, 7, 9). The following images are some examples of the databases created.

After that I trained 5 different networks. For testing them I created 5 more databases, one for each network trained. To test the accuracy of the network I added in each of them the negative of each image in the test database being in this way databases formed by 20000 images. The 5 databases are:

  • Database with 10000 images filtered by Canny and the negative of each of them
  • Database with 10000 images filtered by Sobel and the negative of each of them
  • Database with 10000 images filtered by Laplacian with 3 of kernel size and the negative of each of them
  • Database with 10000 images filtered by Laplacian with 5 of kernel size and the negative of each of them
  • Database with 10000 images filtered by Laplacian with 7 of kernel size and the negative of each of them
  • Database with 10000 images filtered by Laplacian with 9 of kernel size and the negative of each of them

Below are examples of these databases:

After that I tested all networks using the test databases created previously. The following graph shows the results:



In order to improve the results, I decided to apply the absolute value to the images which have been filtered using Sobel filter and which have been filtered using Laplacian filter. In the previous graph, you can see that the kernel size in the Laplacian filter does no significantly affect the final result, so I just created one database using a single value of kernel size. Likewise, I created the respective test databases with 20000 images. Below yo can see two example of images filtered in which value absolute han been used:

After training the networks, the result of the test is the one shown in the following graph:


After that I decided to test one network using images of numbers in different positions, with different size and introducing some of noise. I did 6 more test databases which are formed by 60000 images. All of them use Sobel filter and the absolute value of each pixel. After that I did some transformation over each image:

  • Images rotated
  • Images translated
  • Images scaled
  • Images with gaussian noise of variance 0.2
  • Images with gaussian noise of variance 0.5
  • Images using all previous transformations

The results are showed in the following graph:

Creating a new LMDB database[edit]

An import aim is to train and test any neural network with the images that you want. For that is necessary to know how create a database with images and their associated labels. The first LMDB database that I created was using the images of the MNIST database but applying a Canny filter to them. For that I used OpenCV because of it has a function by which it is possible to apply this kind of edges filter to any image. There is more information about this filter in the following link. When all the images were filtered I created the LMDB database. The whole process, from to filter the images to create the LMDB databes, is achieved using the following Python code:

import sys
sys.path.append('/usr/lib/python2.7/dist-packages')
 
import caffe
import lmdb
import cv2
import numpy as np
 
lmdb_env = lmdb.open('/home/davidbutra/Escritorio/caffe/examples/mnist/mnist_train_lmdb')
lmdb_txn = lmdb_env.begin()
lmdb_cursor = lmdb_txn.cursor()
datum = caffe.proto.caffe_pb2.Datum()
 
item_id = -1
batch_size = 256
 
 
 
new_lmdb_env = lmdb.open('/home/davidbutra/Escritorio/caffe/examples/mnist/mnist_train_images_filter/data_filter',map_size=int(1e12))
new_lmdb_txn = new_lmdb_env.begin(write=True)
new_lmdb_cursor = new_lmdb_txn.cursor()
new_datum = caffe.proto.caffe_pb2.Datum()
 
 
for key, value in lmdb_cursor:
    item_id = item_id + 1
 
    datum.ParseFromString(value)
    label = datum.label
    data = caffe.io.datum_to_array(datum)
    data_filter = cv2.Canny(data[0],100,200)
    data_filter = data_filter[np.newaxis,:, :]
 
    new_datum = caffe.io.array_to_datum(data_filter,label)
 
    keystr = '{:0>8d}'.format(item_id)
    new_lmdb_txn.put( keystr, new_datum.SerializeToString() )
 
    # write batch
    if(item_id + 1) % batch_size == 0:
        new_lmdb_txn.commit()
        new_lmdb_txn = new_lmdb_env.begin(write=True)
        print (item_id + 1)
 
# write last batch
if (item_id+1) % batch_size != 0:
    new_lmdb_txn.commit()
    print 'last batch'
    print (item_id + 1)

Additionally I used the following code, which represents the first 30 images of the database, to know that the database was created successfully. Finally, after running the above script, we know that filtered images look like this:



Install JdeRobot[edit]

I installed JdeRobot following the instructions in the next link Install JdeRobot. In order to install JdeRobot we must have a machine running Ubuntu 14.04 Trusty or debian stable and add some sources to our machine. After that I installed JdeRobot following the steps in the paragraph entitled "Installation for running JdeRobot".

In order to try JdeRobot, I did the example 2.2 Cameraserver + Cameraview which appears in the following link Cameraserver + Cameraview.

Because of I am using a Virtual Machine with Ubuntu 14.04, I have to set up the camera of the laptop to use it in the Virtual Machine. For that I employed the following commands:

VBoxManage list webcams     # To see how many cameras there are in the laptop.
VBoxManage controlvm "Ubuntu 14.04" webcam list     # List of cameras associated with that virtual machine.
VBoxManage list vms     # List of virtual machines.
VBoxManage controlvm "Ubuntu 14.04" webcam attach .1     # Associating integrated camera to Virtual Machine.	

Caffe on Python[edit]

In order to execute the previously trained network, I have used the next Python code:

import caffe
import os

model_file = '../examples/mnist/lenet.prototxt'
pretrained_file = '../examples/mnist/lenet_iter_10000.caffemodel'
net = caffe.Classifier(model_file, pretrained_file, image_dims=(28, 28), raw_scale=255)
score = net.predict([caffe.io.load_image('Dos.bmp', color=False)], oversample=False)
print score

First of all, I created two variables with the paths of two files. I utilised the file lenet.prototxt where specified the network structure and the file lenet_iter_10000.caffemodel where contained the state of the weights for each layer of the network after 10000 iterations.

After that I have used the Classifier function of the Python interface of Caffe. The inputs of this function are the two files previously explained, the parameter image_dims where specified the dimensions to scale the input image for propping/sampling and the parameter raw_scale to indicate that the input image is a grayscale image with 255 levels.

Finally, I used the Predict function using a handwritten digit as input. Because of the network was trained using white handwritten digits on a black background, we must use images of these features like this:


The output of this function will be an array of ten classes with nine '0' and one '1' which represents the activated neuron in the output layer. The output array will look as showed then:

 [[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]] 

Studying the MNIST Database[edit]

In order to understand better the tutorial, I have created some scripts to knowledge about the images of the MNIST database.

Data Section[edit]

The images of the database are in LMDB (Lightning Memory-Mapped Database Manager) format. I did the following code which transforms each image in a 28x28 array of arrays. Then, using PGM (Portable Graymap Format) could view the images. The format of the PGM files is showed in the following code which represents the first number in the database which is '5'.

P2
28 28    # Width and Height.
255       # Numbers of grey between black and white.

0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   3  18  18  18 126 136 175  26 166 255 247 127   0   0   0   0
0   0   0   0   0   0   0   0  30  36  94 154 170 253 253 253 253 253 225 172 253 242 195  64   0   0   0   0
0   0   0   0   0   0   0  49 238 253 253 253 253 253 253 253 253 251  93  82  82  56  39   0   0   0   0   0
0   0   0   0   0   0   0  18 219 253 253 253 253 253 198 182 247 241   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0  80 156 107 253 253 205  11   0  43 154   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0  14   1 154 253  90   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0 139 253 190   2   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0  11 190 253  70   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0  35 241 225 160 108   1   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0  81 240 253 253 119  25   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0  45 186 253 253 150  27   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  16  93 252 253 187   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 249 253 249  64   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0  46 130 183 253 253 207   2   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0  39 148 229 253 253 253 250 182   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0  24 114 221 253 253 253 253 201  78   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0  23  66 213 253 253 253 253 198  81   2   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0  18 171 219 253 253 253 253 195  80   9   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0  55 172 226 253 253 253 253 244 133  11   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0 136 253 253 253 212 135 132  16   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

The code is represented in the following text. In this case, it shows the first 30 images of the MNIST database. The code is also available in my GitHub Data.py

import caffe
import lmdb

lmdb_env = lmdb.open('/home/davidbutra/Escritorio/caffe/examples/mnist/mnist_train_lmdb')
lmdb_txn = lmdb_env.begin()
lmdb_cursor = lmdb_txn.cursor()
datum = caffe.proto.caffe_pb2.Datum()
n = 0
i = 0
nImage = 1
loop = 0

width = 28
height = 28
ftype = 'P2'

for key, value in lmdb_cursor:

    n = 0
    i = 0

    if loop < 30:
        
        datum.ParseFromString(value)
        label = datum.label
        data = caffe.io.datum_to_array(datum)
        #print label
        #print data

        pgmfile=open('data' + str(nImage) + '.pgm', 'w')
        pgmfile.write("%s\n" % (ftype))
        pgmfile.write("%d %d\n" % (width,height))
        pgmfile.write("255\n")

        txtfile=open('data' + str(nImage) + '.txt', 'w')
        txtfile.write("%s\n" % (ftype))
        txtfile.write("%d %d\n" % (width,height))
        txtfile.write("255\n")

        nImage = nImage + 1
        loop = loop + 1

        while i < height:
            if n == width - 1:
                pgmfile.write("%s\n" % (data[0][i][n]))
                txtfile.write("%s\n" % (data[0][i][n]))
                i = i + 1
                n = 0   
            elif data[0][i][n + 1] < 10:
                pgmfile.write("%s   " % (data[0][i][n]))
                txtfile.write("%s   " % (data[0][i][n]))
                n = n + 1
            elif data[0][i][n + 1] < 100:
                pgmfile.write("%s  " % (data[0][i][n]))
                txtfile.write("%s  " % (data[0][i][n]))
                n = n + 1
            else:
                pgmfile.write("%s " % (data[0][i][n]))
                txtfile.write("%s " % (data[0][i][n]))
                n = n + 1

        pgmfile.close()

Label Section[edit]

It is also important to know how many images of each number are in the database. For that I did this code. The output of this script is showed in the following paragraph.

Zero: 5923
One: 6742
Two: 5958
Three: 6131
Four: 5842
Five: 5421
Six: 5918
Seven: 6265
Eight: 5851
Nine: 5949
Total images: 60000

Thanks to this we know that there are 60000 images in the database and are distributed more or less evenly among the 10 numbers.

Data Average[edit]

To know more about the database I made a script to know the average levels of intensity of each pixel in each number. This is important to know the uniformity in the images of the different numbers and to know the intensity levels of the edges of each image because later it will be important to apply a edges filter to the images. The code can be found here. The output images are below:



LeNet MNIST Tutorial[edit]

In order to get used to working with Caffe, I did the MNIST Tutorial which is explained in the Official Website of Caffe [LeNet MNIST Tutorial]. MNIST is a database of handwritten digits which has a training set of 60,000 examples and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image. The four files with images and labels are available on the next site [MNIST-Database].


The aim of the tutorial is to train and test a neural network, using the MNIST database, and then can classifier an input image on the right way, being the input image a number between 0-9. The first step is to define the LeNet model for MNIST handwritten digits classification. The structure of the network will look as showed in the following code:

name: "LeNet"
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "examples/mnist/mnist_train_lmdb"
    batch_size: 64
    backend: LMDB
  }
}
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "examples/mnist/mnist_test_lmdb"
    batch_size: 100
    backend: LMDB
  }
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 20
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 50
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "pool2"
  type: "Pooling"
  bottom: "conv2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool2"
  top: "ip1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 500
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
}
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip2"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
  top: "loss"
}

The following diagram shows the network and all its connections as well as the inputs and outputs of the network and each layer:


Layer By Layer[edit]

Now, we will see each layer separately. The explanation of the values parameters of each layer also can be checked here [Define the MNIST Network].


Training and Test Data

The first two sections are not considerer a layer. They refer to the input data, one at the training images and other at the test images. The name of both sections is 'mnist' and the type is 'Data'. In the parameters of each division are specified the path to the training and test images, the batch size, 64 for training images and 100 for test images, and the backend, in that cases LMDB.

layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "examples/mnist/mnist_train_lmdb"
    batch_size: 64
    backend: LMDB
  }
}

Convolution Layer

This layer takes the data blob, which is provided by the data layer, and produces the conv1 layer. It produces outputs of 20 channels, with the convolutional kernel size 5 and carried out with stride 1.

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 20
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}


Pooling Layer

layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}


Fully Connected Layer

layer {
  name: "ip1"
  type: "InnerProduct"
  param { lr_mult: 1 }
  param { lr_mult: 2 }
  inner_product_param {
    num_output: 500
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
  bottom: "pool2"
  top: "ip1"
}


ReLU Layer

layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
}


Loss Layer

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
}


Define the Solver[edit]

# The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: CPU

Training and Testing the Network[edit]

./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt

Trained and Tested Network[edit]

When the network is trained and tested, the final model will be stored as a binary protobuf file called lenet_iter_10000. Two files will be created:

  • lenet_iter_10000.caffemodel: The caffemodel, which is output at a specified interval while training, is a binary contains the current state of the weights for each layer of the network.
  • lenet_iter_10000.solverstate: The solverstate, which is generated alongside, is a binary contains the information required to continue training the model from where it last stopped.

Caffe Installation[edit]

The first step is to install Caffe. Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe provides a complete toolkit for training, testing, finetuning and deploying models, with well-documented examples for all of these tasks. As such, it is an ideal starting point for researchers and other developers looking to jump into state-of-the-art machine learning.

Therefore, Caffe installation will be done using the steps which are in the Official Website of Caffe. Because of I am going to use Ubuntu 14.04 LTS, I have to follow the steps showed in the next hyperlink [Caffe-Ubuntu Installation].

It is also interesting to visit the BVLC's GitHub repositorie about Caffe [GitHub-BVLC/Caffe].