From jderobot
Jump to: navigation, search


Proyect Card[edit]

Project Name: Visual Control with DeepLearning

Author: Vanessa Fernández Martínez [vanessa_1895@msn.com]

Academic Year: 2017/2018

Degree: Computer Vision Master

GitHub Repositories: [1]

Tags: Deep Learning, Detection, JdeRobot


Week 30: Driving videos, Classification network[edit]

Driving videos[edit]

I've used the predictions of the classification network according to w (7 classes) and constant v to driving a formula 1:

Week 29: Driving videos, Dataset coherence study, Driving analysis, New gui[edit]

Driving videos[edit]

I've used the predictions of the classification network according to w (7 classes) and v constant to driving a formula 1:

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to driving a formula 1:

I've used the predictions of the regression network for w and constant v to driving a formula 1:

I've used the predictions of the regression network for w and v to driving a formula 1:

Dataset coherence study[edit]

To analyze the data, I've analyzed two lines of each image and calculated the centroids of the corresponding lines (row 250 and row 360). On the x-axis of the graph, the centroid of row 350 is represented and the y-axis represents the centroid of row 260 of the image. In the following images we can see the representation of this statistic of the training set (new dataset) for w and for v.

In the next image we see how the points are divided by colors according to their class of w (7 classes). Class "radically_left" is represented in red, class "moderately_left" is represented in blue, class "slightly_left" is represented in green, class "slight" is represented in cyan, "slightly_right" is represented in purple, "moderately_right" is represented in yellow, and "radically_right" is represented in black.

In the next image we see how the points are divided by colors according to their class of v (4 classes). Class "slow" is represented in red, class "moderate" is represented in blue, class "fast" is represented in green, and "very_fast" is represented in purple.

Diving analysis[edit]

I've relabelled the images from https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/Failed_driving at my criteria. And I've compared the relabelled data with the driving data. A 72% accuracy is obtained.

New gui[edit]

I've modified the gui adding LEDs. Under the image of the car camera there are 7 leds that correspond to the 7 classes of w. The LED that corresponds to the class predicted by the network will light. To the right of the image there are 4 leds for v that correspond to the 4 classes of v.

Week 28: Classification Network, Regression Network, Reading information[edit]

This week, I've retrained the models with the new dataset. This new dataset is divided into 11275 pairs of images and speed data for training, and 4833 pairs of images and data for testing.

Driving videos[edit]

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to driving a formula 1:

I've used the predictions of the regression network (223 epochs for v and 212 for w) to driving a formula 1:

Data statistics[edit]

To analyze the data, a new statistic was created (analysis_vectors.py). I've analyzed two lines of each image and calculated the centroids of the corresponding lines (row 250 and row 360). On the x-axis of the graph, the centroid of row 350 is represented and the y-axis represents the centroid of row 260 of the image. In the following image we can see the representation of this statistic of the training set (new dataset) (red circles) iagainst the driving data (blue crosses).

Reading information[edit]

I've read some information about self-driving. I've read about different architectures:

Classification network for w[edit]

I've retrained the classification network for w with the new dataset. The test results are:

Classification network for v[edit]

I've retrained the classification network for v with the new dataset. The test results are:

Week 27: Data's Analysis, New Dataset[edit]

Number of data for each class[edit]

At https://jderobot.org/Vmartinezf-tfm#Follow_line_with_classification_network_and_with_regression_network we saw that the car was not able to complete the entire circuit with the classification network of 7 classes and constant v. For this reason, we want to evaluate our dataset and see if it is representative. For this we've saved the images that the F1 sees during the driving with the neural network and some data. This data can be found at https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/Failed_driving.

I've created a script (evaluate_class.py) that shows a graph of the number of examples that exist for each class (7 classes of w). In the following images we can see first the graph for the training data and then the graph for the driving data.

Data statistics[edit]

To analyze the data, a new statistic was created (analysis_vectors.py). I've analyzed two lines of each image and calculated the centroids of the corresponding lines (row 250 and row 360). On the x-axis of the graph, the centroid of row 350 is represented and the y-axis represents the centroid of row 260 of the image. In the following image we can see the representation of this statistic of the training set (red circles) iagainst the driving data (blue crosses).


I've used the entropy how measure of simility. The Shannon entropy is the measure of information of a set. Entropy is defined as the expected value of the information. I've follow a Python example of the book "Machine Learning In Action" (http://www2.ift.ulaval.ca/~chaib/IFT-4102-7025/public_html/Fichiers/Machine_Learning_in_Action.pdf). In this book uses the following function to calculate the Shannon entropy of a dataset:

def calculate_shannon_entropy(dataset):
    numEntries = len(dataset)
    labelCounts = {}
    for featVec in dataset: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

First, you calculate a count of the number of instances in the dataset. This could have been calculated inline, but it’s used multiple times in the code, so an explicit variable is created for it. Next, you create a dictionary whose keys are the values in the final column. If a key was not encountered previously, one is created. For each key, you keep track of how many times this label occurs. Finally, you use the frequency of all the different labels to calculate the probability of that label. This probability is used to calculate the Shannon entropy, and you sum this up for all the labels.

In this example the dataset is:

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

In my case, we want to measure the entropy of train set and entropy of driving train. In my case, I haven't used the labels and I've used the centroid of 3 rows of images. For this reason, I've modified the function "calculate_shannon_entropy":

def calculate_shannon_entropy(dataset):
    numEntries = len(dataset)
    labels = []
    counts = []
    for featVec in dataset: #the the number of unique elements and their occurance
        found = False
        for i in range(0, len(labels)):
            if featVec == labels[i]:
                found = True
                counts[i] += 1
        if not found:
    shannonEnt = 0.0
    for num in counts:
        prob = float(num)/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

My dataset is like:

[[32, 445, 34], [43, 12, 545], [89, 67, 234]]

The entropy's results are:

Shannon entropy of driving: 0.00711579828413
Shannon entropy of dataset: 0.00336038443482

SSIM and MSE[edit]

In addition, to verify the difference between the piloting data and the training data I've used the SSIM and MSE measurements. The MSE value is obtained, although it isn't a very representative value of the similarity between images. Structural similarity aims to address this shortcoming by taking texture into account.

The Structural Similarity (SSIM) index is a method for measuring the similarity between two images. The SSIM index can be viewed as a quality measure of one of the images being compared, provided the other image is regarded as of perfect quality.

I've analyzed the image that the car saw just before leaving the road. I've compared this image with the whole training set. For this I've calculated the average SSIM. In addition, I'e calculated the minimum SSIM and the maximum SSIM. The minimum SSIM is given in the case that we compare our image with a very disparate one. And the maximum SSIM is given in the case that we compare the image with the one that is closest to the training set. Next, the case of the minimum SSIM is shown on the left side and the case of the maximum SSIM on the right side. For each case, the corresponding SSIM, the MSE, the medium SSIM, the average MSE, the image, the iamgen of the dataset with which the supplied SSIM corresponds, and the SSIM image are printed.

In addition, the same images are provided for a piloting image of each class of w.

  • 'radically_left':

  • 'moderately_left':

  • 'slightly_left':

  • 'slight':

  • 'slightly_right':

  • 'moderately_right':

  • 'radically_right':

New Dataset[edit]

I've based on the code created for the follow-line practice of JdeRobot Academy (http://vanessavisionrobotica.blogspot.com/2018/05/practica-1-follow-line-prueba-2.html) in order to create a new dataset. This new dataset has been generated using 3 circuits so that the data is more varied. The circuits of the worlds monacoLine.world, f1.launch and f1-chrono.launch have been used.

Week 26: Follow line with classification network, Studying Tensorboard, Classification network for v, Regression network for w and v[edit]

Follow line with classification network and with regression network[edit]

I've used the predictions of the classification network according to w (7 classes) to pilot a formula 1. Depending on the class of w, a different angle of rotation is given to the vehicle and the linear speed remains constant. With this network part of the circuit is achieved, but the car crashes when leaving a curve. Below, you can see an example:

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to pilot a formula 1. Depending on the class of w, a different angle of rotation is given to the vehicle and depending on the class of v, a different linear speed is given to the vehicle. With this network part of the circuit is achieved, but the car crashes when leaving a curve. Below, you can see an example:

I've used the predictions of the regression network to drive a formula 1 (223 epochs for v and 212 epochs for w):

Studying Tensorboard[edit]

Tensorboard (https://www.tensorflow.org/guide/summaries_and_tensorboard, https://github.com/tensorflow/tensorboard) is a suite of visualization tools that makes it easier to understand, debug, and optimize TensorFlow programs. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it. Tensorboard can also be used with Keras. I've followed some tutorials: https://www.datacamp.com/community/tutorials/tensorboard-tutorial, http://fizzylogic.nl/2017/05/08/monitor-progress-of-your-keras-based-neural-network-using-tensorboard/, https://keras.io/callbacks/.

Tensorboard is a separate tool you need to install on your computer. You can install Tensorboard using pip the python package manager:

pip install Tensorboard

To use Tensorboard you have to modify the Keras code a bit. You need to create a new TensorBoard instance and point it to a log directory where data should be collected. Next you need to modify the fit call so that it includes the tensorboard callback.

from time import time
from keras.callbacks import TensorBoard

tensorboard = TensorBoard(log_dir="logs/{}".format(time()))
model.fit(X_train, y_train, epochs=nb_epochs, batch_size=batch_size, callbacks=[tensorboard])

We can pass different arguments to the callback:

  • log_dir: the path of the directory where to save the log files to be parsed by TensorBoard.
  • histogram_freq: frequency (in epochs) at which to compute activation and weight histograms for the layers of the model. If set to 0, histograms won't be computed. Validation data (or split) must be specified for histogram visualizations.
  • write_graph: whether to visualize the graph in TensorBoard. The log file can become quite large when write_graph is set to True.
  • write_grads: whether to visualize gradient histograms in TensorBoard. histogram_freq must be greater than 0.
  • batch_size: size of batch of inputs to feed to the network for histograms computation.
  • write_images: whether to write model weights to visualize as image in TensorBoard.
  • embeddings_freq: frequency (in epochs) at which selected embedding layers will be saved. If set to 0, embeddings won't be computed. Data to be visualized in TensorBoard's Embedding tab must be passed as embeddings_data.
  • embeddings_layer_names: a list of names of layers to keep eye on. If None or empty list all the embedding layer will be watched.
  • embeddings_metadata: a dictionary which maps layer name to a file name in which metadata for this embedding layer is saved. In case if the same metadata file is used for all embedding layers, string can be passed.
  • embeddings_data: data to be embedded at layers specified in embeddings_layer_names. Numpy array (if the model has a single input) or list of Numpy arrays (if the model has multiple inputs).

The callback raises a ValueError if histogram_freq is set and no validation data is provided. Using Tensorboard callback will work while eager execution is enabled, however outputting histogram summaries of weights and gradients is not supported, and thus histogram_freq will be ignored.

To run TensorBoard, use the following command:

tensorboard --logdir=path/to/log-directory

where logdir points to the directory where the FileWriter serialized its data. If this logdir directory contains subdirectories which contain serialized data from separate runs, then TensorBoard will visualize the data from all of those runs. For example, in our case we use:

tensorboard --logdir=logs/

Once TensorBoard is running, navigate your web browser to localhost:6006 to view the TensorBoard. When looking at TensorBoard, you will see the navigation tabs in the top right corner. Each tab represents a set of serialized data that can be visualized.

Tensorboard has different views which take inputs of different formats and display them differently. You can change them on the orange top bar. Different views of Tensorboard are:

  • Scalars: Visualize scalar values, such as classification accuracy.
  • Graph: Visualize the computational graph of your model, such as the neural network model.
  • Distributions: Visualize how data changes over time, such as the weights of a neural network.
  • Histograms: A fancier view of the distribution that shows distributions in a 3-dimensional perspective.
  • Projector: Can be used to visualize word embeddings (that is, word embeddings are numerical representations of words that capture their semantic relationships).
  • Image: Visualizing image data.
  • Audio: Visualizing audio data.
  • Text: Visualizing text (string) data.

If we see the graphs of the training we can check when our model is being overfitting. For example, in a training of a classification model (4 classes) with a batch_size of 32 and an epochs of 40 we can see the point where the training stops being efficient. In the graphs of the validation set we can see that from epochs 23 training is no longer efficient.

Classification network for v[edit]

The files data.json, train.json and test.json have been modified to add a new classification that divides the linear speed into 4 classes. The classes are the following:

slow: if the linear speed is v <= 7.
moderate: if the linear speed is v > 7 and v <= 9.
fast: if the linear speed is v > 9 and v <= 11.
very_fast: if the linear speed is v > 11.

I've trained a model with the 4 classes mentioned above. The CNN architecture I am using is SmallerVGGNet, a simplified version of VGGNet. After training the network, we save the model (models/model_smaller_vgg_4classes_v.h5) and show the graphs of loss and accuracy for training and validation according to the epochs. For that, I've used Tensorboard:

In addition, I evaluate the accuracy, precision, recall, F1-score (in test set) and we paint the confusion matrix. The results are the following:

Regression network for w and v[edit]

I've trained two regression networks (for v and for w) following the Pilotnet architecture. To get an idea of ​​the efficiency of the training I've used Tensorboard. I've trained both networks with 1000 epochs to see how they behaved. The results can be seen below, where the red curve represents the model of v and the blue curve represents the model of w.

  • Accuracy:

  • Loss:

  • Mean squared error:

  • Mean absolute error:

Week 25: Correction of the binary classification model, correction of driver node, accuracy top2, Pilotnet network[edit]

Correction of the binary classification model[edit]

I modified the binary classification models/model_classification.h5, which had some error. That model has been removed and is now called model_binary_classification.h5. After training the network, we save the model (models/model_binary_classification.h5) and evaluate the model with the validation set:

In addition, we evaluate the accuracy, precision, recall, F1-score and we paint the confusion matrix. The results are the following:

Correction of driver node[edit]

The driver node had an error that caused Formula 1 to stop sometimes. This error was due to the fact that the telemarketer was interfering with the speed of the vehicle. This error has been corrected and the F1 can be driven correctly.

Accuracy top 2 of multiclass classification network[edit]

To get an idea of ​​the results with the multiclass classification network that we trained previously, we have obtained a measure of accuracy top 2. To obtain this metric we give as good a prediction if it is equal to the true label or if it is equal to the adjacent label, that is, we take into account only 1 accuracy jump. In this way we obtain a 100% accuracy.

Pilotnet network[edit]

This week, one of the objectives was to train a regression network to learn values ​​of linear speed and angular speed of the car. For this we've followed the article "Explaining How to Deep Neural Network Trained with End-to-End Learning Steers to Car" (https://arxiv.org/pdf/1704.07911.pdf) and we've created a network with the Pilotnet architecture. NVIDIA has created a neural-network-based system, known as PilotNet, which outputs steering angles given images of the road ahead. PilotNet is trained using road images paired with the steering angles generated by a human driving a data-collection car. In our case we have pairs of images and (v, w). The Pilonet architecture is as follows:

The Pilotnet model can be seen below:

model = Sequential()
# Normalization
model.add(BatchNormalization(epsilon=0.001, axis=-1, input_shape=img_shape))
# Convolutional layer
model.add(Conv2D(24, (5, 5), strides=(2, 2), activation="relu", input_shape=img_shape))
# Convolutional layer
model.add(Conv2D(36, (5, 5), strides=(2, 2), activation="relu"))
# Convolutional layer
model.add(Conv2D(48, (5, 5), strides=(2, 2), activation="relu"))
# Convolutional layer
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation="relu"))
# Convolutional layer
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation="relu"))
# Flatten
model.add(Dense(1164, activation="relu"))
# Fully-connected layer
model.add(Dense(100, activation="relu"))
# Fully-connected layer
model.add(Dense(50, activation="relu"))
# Fully-connected layer
model.add(Dense(10, activation="relu"))
# Output: vehicle control
model.compile(optimizer="adam", loss="mse", metrics=['accuracy'])

In order to learn the angular speed (w) and the linear speed (v) we train two networks, one for each value. The model for v is called model_pilotnet_v.h5 (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/dl-driver/Net/Keras/models) and the model for w is called model_pilotnet_w.h5.

After training the networks, we save the models (models/model_pilotnet_v.h5, models/model_pilotnet_w.h5) and evaluate (loss, accuracy, mean squared error and mean absolute error) the models with the validation set:

In addition, we evaluate the networks with the test set:

The w's evaluation had a problem. This problem is fixed and the result is:

Evaluation w:
('Test loss:', 0.0021951024647809555)
('Test accuracy:', 0.0)
('Test mean squared error: ', 0.0021951024647809555)
('Test mean absolute error: ', 0.0299175349963251)

Week 24: Adding new class, Classification network[edit]

Adding new class[edit]

The files data.json, train.json and test.json have been modified to add a new classification that divides the angles of rotation into 7 classes. The classes are the following:

radically_right: if the rotation's angle is w <= -1.
moderately_right: if the rotation's angle is -1 < w <= -0.5.
slightly_right: if the rotation's angle is -0.5 < w <= -0.1.
slight: if the rotation's angle is -0.1 < w < 0.1.
slightly_left: if the rotation's angle is 0.1 <= w < 0.5.
moderately_left: if the rotation's angle is 0.5 <= w < 1.
radically_left: if the rotation's angle is w >= 1.

Multiclass classification network[edit]

I've followed this blog https://www.pyimagesearch.com/2018/05/07/multi-label-classification-with-keras/ as example to make the classification network. In this case I have trained a model with the 7 classes mentioned above.

The CNN architecture I am using is SmallerVGGNet, a simplified version of VGGNet. The VGGNet model was first introduced by Simonyan and Zisserman in their 2014 paper, Very Deep Convolutional Networks for Large Scale Image Recognition (https://arxiv.org/pdf/1409.1556/). In this case we keep an image of the model with plot_model to see the architecture of the network. The model (classification_model.py) is as follows:

After training the network, we save the model (models/model_smaller_vgg_7classes_w.h5) and evaluate the model with the validation set:

We also show the graphs of loss and accuracy for training and validation according to the epochs:

In addition, we evaluate the accuracy, precision, recall, F1-score and we paint the confusion matrix. The results are the following:

Week 23: Improving driver node, classification network, and driver test[edit]

Driver node[edit]

The driver node has been modified to make an inference per cycle. To do this, the threadNetwork.py file and the classification_network.py file have been created. threadNetwork allows you to make a prediction per cycle by calling the predict method of the class ClassificationNetwork (classification_network.py).

Classification network[edit]

A file (add_classification_data.py) has been created to modify the data.json file and add the left/right classification. If the data w is positive then the classification will be left, while if w is negative the classification will be right.

Once the dataset is complete, a file (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Follow%20Line/dl-driver/Net/split_train_test.py) has been created to divide the data into train and test. It has been decided to divide the dataset by 70% for train and 30% for test. Since the dataset was 5006 pairs of values, we now have 3504 pairs of train values ​​and 1502 pairs of test values. The train data is i https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/dl-driver/Net/Dataset/Train, and test is in https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/dl-driver/Net/Dataset/Test.

The classification_train.py file has been created that allows training a classification network. This classification network aims to differentiate between left and right. In this file before training we eliminate the pairs of values ​​where the angle is close to 0 (with margin 0.08), because they will not be very significant data. In addition, we divide the train set by 80% for train and 20% for validation.

In our case we will use a very small convnet with few layers and few filters per layer, alongside dropout. Dropout also helps reduce overfitting, by preventing a layer from seeing twice the exact same pattern. Our model have a simple stack of 3 convolution layers with a ReLU activation and followed by max-pooling layers. This is very similar to the architectures that Yann LeCun advocated in the 1990s for image classification (with the exception of ReLU). On top of it we stick two fully-connected layers. We end the model with a single unit and a sigmoid activation, which is perfect for a binary classification. To go with it we will also use the binary_crossentropy loss to train our model. The model (classification_model.py) is as follows:

    model = Sequential()
    model.add(Conv2D(32, (3, 3), input_shape=input_shape))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(32, (3, 3)))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(64, (3, 3)))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    # The model so far outputs 3D feature maps (height, width, features)

    model.add(Flatten()) # This converts our 3D feature maps to 1D feature vectors

    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

After training the network, we save the model (models/model_classification.h5) and evaluate the model with the validation set:

We also show the graphs of loss and accuracy for training and validation according to the epochs:

In addition, the classification_test.py file has been created to evaluate the model in a data set that has not been seen by the network. In this file the test set is used and accuracy, precision, recall, F1-score are evaluated and the confusion matrix is ​​painted. The results are the following:

Driver test[edit]

I've tried to go around the circuit with formula 1 based on the predictions of left or right. To do this, a prediction is made in each iteration that will tell us right or left. If the prediction is right we give a negative angular speed and if it is left it will be positive. At all times we leave a constant linear speed. The result is not good, as the car hits a curve.

Week 22: Improving driver node[edit]

Driver node[edit]

This week, I've improved driver node. Now, you can can see the linear and angular speeds in the gui, and you can save or remove the data (Dataset).

Week 21: Dataset generator and driver node[edit]

Dataset generator[edit]

The final goal of the project is to make a follow-line using Deep Learning. For this it is necessary to collect data. For this reason, I have based on the code created for the follow-line practice of JdeRobot Academy (http://vanessavisionrobotica.blogspot.com/2018/05/practica-1-follow-line-prueba-2.html) in order to create a dataset. The created dataset (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/Dataset) contains the input images with the corresponding linear and angular speeds (in a json file). To create this dataset I have made a Python file that contains functions that allow to create it (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Follow%20Line/generator.py).

Driver node[edit]

In addition, I've created a driver node based on the objectdetector node, which allows to connect neural networks. For now the initial gui looks like this:

Weeks 19,20: First steps with Follow Line[edit]

Reading some information about autonomous driving[edit]

These weeks, I've read the article "Self-Driving a Car in simulation through a CNN" and he tfg (https://ebuah.uah.es/dspace/handle/10017/33946) on which this article was based. This paper researches into Convolutional Neural Networks and its architecture, to develop a new CNN with the capacity to control lateral and longitudinal movements of a vehicle in an open-source driving simulator (Udacity's Self-Driving Car Simulator) replicating the human driving behavior.

Training data is collected by driving the simulator car manually (using the Logitech Force Pro steering wheel), obtaining images from a front-facing camera and synchronizing steering angle and throttle values ​​performed by the driver. The dataset is augmented by horizontal flipping, changing steering angles sign and taking information from left and right cameras from the car. The simulator provides a file (drive.py) in charge of establishing the communication between the vehicle and the neural network, so that the network receives the image captured by the camera and returns an acceleration value and a rotation angle.

The CNNs tested in this project are described, trained and tested using Keras. The CNNs tested are:

  • TinyPilotNet: developed as a reduction from NVIDIA PilotNet CNN used for self-driving a Car. TinyPilotNet network is composed by a 16x32x1 pixels image input (image has a single channel formed by saturation channel

from HSV color space), followed by two convolutional layers, a dropout layer and a flatten layer. The output of this architecture is formed by two fully connected layers that leads into a couple of neurons, each one of them dedicated to predict steering and throttle values respectively.

  • DeeperLSTM-TinyPilotNet: formed by more layers and higher input resolution.
  • DeepestLSTM-TinyPilotNet: formed by three 3x3 kernel convolution layers, combined with maxpooling layers, followed by three 5x5 convolutional LSTM layers and two fully connected layers.
  • Long Short-Term Memory (LSTM) layers are included to TinyPilotNet architecture with the aim of improving the CNN driving performance and predict new values influenced by previous ones, and not just from the current input image. These layers are located at the end of the network, previous to the fully-connected layers. During the training, the dataset is used sequentially, not shuffled.

The training data is an essential part for a correct performance by the convolutional neuronal network. However, extracting an important amount of data is a great difficulty. The data augmentation allows you to modify or increase the training information of the network from a data bank previously obtained so that the CNN more easily understand what data to extract to obtain the expected results. The training data treatment effects carried out in the study are:

  • RGB image: The input image of the network is modified, which previously it was an image that took only the saturation of the HSV color space, by an image of 3 RGB color channels. To be implemented in a CNN, it is simply necessary to modify the first layer of the network.
  • Increase in resolution: it supposes a modification of the input layer's dimension.
  • Image cropping: consists of extracting a specific area from the image in which considers that the information relevant to CNN is concentrated. The image that CNN will analyze it only contains information about the road, eliminating the landscape part of the frame.
  • Edge detector filter: consists of extracting the edges of the input image to highlight them on the original image. This is achieved by using a Canny filter.

In order to compare the performance of the CNN with other networks, a frame-to-frame comparison is made between CNN steering angle and throttle values and human values as ground truth. The metric mean square error (RMSE) is used, obtained with the difference between the CNN address and the accelerator predicted values ​​and given humans. Driving data collected by a human driver is needed to compare the CNN given values for each frame with the values used by the human driver. This parameter does not evaluate the ability of the network to use previous steering and throttle values to predict the new ones, so the LSTM layer does not make effect and appears underrated in comparison with other CNNs that do not use these kind of layers.

To solve this problem, new quality parameters have been proposed to quantify the performance of the network driving the vehicle. These parameters are measured with the information extracted from the simulator when the network is being tested. To calculate these parameters, center points of the road -named as waypoints- are needed, separated 2 meters. These new metrics are:

  • Center of road deviation: One of these parameters is the shifting from the center of the road, or center of road deviation. The lower this parameter is, the better the performance will be, because the car will drive in the center of the road instead of driving in the limits. To calculate deviation, nearest waypoint is needed to calculate distance between vehicle and segment bounded by that waypoint and previous or next.
  • Heading angle: the lower this parameter, the better the performance will be, because lower heading angle means softer driving, knowing better the direction to follow.

To train the network, a data collection is carried out by manual driving in the simulator, traveling several laps in the circuit to obtain a large volume of information. This information will contain recordings of images of the dashboard collected by a central camera and two lateral ones located on both sides of the vehicle at 14 FPS, in addition to data of steering wheel angle, acceleration, brake and absolute speed linked to images. All CNNs have been trained using the left, center and right images of the vehicle by applying a 5° offset to the steering angle. In addition, the training dataset has been increased by performing a horizontal image flip (mirroring) to the images, also inverting the steering wheel angle value, but maintaining the acceleration value.

In order to determine the performance of the CNNs using the metrics previously established, different experiments are made. It experiments with different CNN for steering wheel control:

  • Steering wheel control with a CNN: The results of these modifications are analyzed on the control of the angle of rotation of the steering wheel, setting an acceleration of 5% that makes the vehicle circulate at its speed maximum on the circuit. The following networks are analyzed: TinyPilotNet (16x32 pixel input image), Higher resolution TinyPilotNet (20x40 pixel input image), HD-TinyPilotNet (64x128 pixel input image), RGB-TinyPilotNet (3-channel input image), LSTM-TinyPilotNet (LSTM layers are added at the exit to the network), DeeperLSTM-TinyPilotNet (combines the effects of the LSTM network and the higher resolution network, increasing the size of the input image up to 20x40 pixels), Cropping-DeeperLSTM-TinyPilotNet (is similar to the DeeperLSTM-TinyPilotNet network developed above, but the image cropping effect is applied to its input), Edge-DeeperLSTM-TinyPilotNet (an edge detection is made and a sum type fusion is made, highlighting the edges in the image). It is observed that the only network that improves the RMSE is the TinyPilotNet with higher resolution. However, visually in the simulator it is seen that driving is much better through CNNs containing LSTM layers. The use of RGB-TinyPilotNet, as well as HD-TinyPilotNet, is discarded, as these networks aren't able to guide the vehicle without leaving the road. Following the criterion of improving the average error with respect to the center of the lane for the rest of the networks, the order of performance from best to worst is the following: 1.DeeperLSTM-TinyPilotNet, 2.Cropping-DeeperLSTM-TinyPilotNet, 3.TinyPilotNet, 4.LSTM-TinyPilotNet, 5.Higher resolution TinyPilotNet, 6.Edge-DeeperLSTM-TinyPilotNet. Based on the pitch angle, once the RGB network has been discarded, the order of the different networks ordered from best to worst performance is as follows: 1.Higher resolution TinyPilotNet, 2.DeeperLSTM-TinyPilotNet, 3.Edge-DeeperLSTM-TinyPilotNet, 4.LSTM-TinyPilotNet, 5.Cropping-DeeperLSTM-TinyPilotNet, 6.TinyPilotNet.
  • Control of steering wheel angle and acceleration through CNNs disconnected: the angle of the steering wheel and the acceleration of the vehicle at each moment are controlled simultaneously. For this, two convolutional neuronal networks will be configured, each specialized in a type of displacement. Based on the results obtained in the steering wheel control, it test the following networks: TinyPilotNet, Higher resolution TinyPilotNet, DeeperLSTM-TinyPilotNet, Edge-DeeperLSTM-TinyPilotNet. The only network that improves this the RMSE is the Higher resolution TinyPilotNet. Networks that include LSTM layers aren't able to keep the vehicle inside the circuit. Higher resolution TinyPilotNet equals TinyPilotNet with respect to the deviation from the lane center. None of the trained CNNs is capable of improving the pitch factor of the simple network. In conclusion, the training of separate networks to control longitudinal and lateral displacement is not a reliable method.
  • Control of the angle of the steering wheel and acceleration through the same CNN: it proposes to control the vehicle through a single convolutional neural network, so that it has two outputs, which stores the values ​​of angle and acceleration. For this, it is necessary to modify the architecture of the previously used networks slightly, changing the output neuron for a pair of neurons. The networks trained following this method are the following: TinyPilotNet, Higher resolution TinyPilotNet, DeeperLSTM-TinyPilotNet, Edge-DeeperLSTM-TinyPilotNet, DeepestLSTM-TinyPilotNet, Edge-DeepestLSTM-TinyPilotNet. The only network that improves this the RMSE is the Higher resolution TinyPilotNet. The networks TinyPilotNet, DeeperLSTM-TinyPilotNet and the trained with detection of edges aren't able to keep the vehicle on the road. Following the criterion of the improvement of the average error with respect to the center of the lane for the rest of the networks, the order of performance from best to worst is the following: 1.DeepestLSTM-TinyPilotNet, 2.Higher resolution TinyPilotNet. The only networks that improve the average pitch parameter of the network TinyPilotNet are the ones that include a greater number of layers. The DeepestLSTM-TinyPilotNet network improves up to 36% pitch, producing a smoother and more natural circulation without applying edge detection to the information. The use of a network with LSTM layers and a greater number of configurable parameters produces a more natural driving, with less pitch and minor deviation. However, the fact of controlling the steering wheel and accelerator make the results slightly worse than those obtained in the case of steering wheel control exclusively.

In conclusion:

  • A slight increase in the resolution of the input image produces notable improvements in both quality factors without assuming a significant increase in the size of the network or the processing time.
  • The inclusion of Long Short-Term Memory (LSTM) layers in the output of a convolutional neural network provides an influence of the values ​​previously contributed by it, which leads to a smoother conduction.
  • The use of the RGB color input image instead of using only the saturation of the HSV color space results in a lesser understanding of the environment by CNN, which leads to bad driving. When using the saturation channel, the road remains highlighted in black, while the exterior of this one obtains lighter colors, producing a simple distinction of the circuit.
  • The information about acceleration doesn't produce better control of the steering angle.
  • To obtain evaluation metric values ​​similar to those obtained by a CNN that only controls the steering wheel on a CNN that controls steering wheel and accelerator is necessary increase the depth of the network.
  • The greater definition of edges in the input image through the Canny filter doesn't produce significant improvement.
  • Make a cropping in the input image to extract only the information from the road doesn't improve driving.

Creation of the synthetic dataset[edit]

I have created a synthetic dataset. To create this dataset I have created a background image and I have created a code (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Follow%20Line/First%20steps/dataset_generator.py) that allows you to modify this background and add a road line. This code allows to generate a dataset of 200 images with lines at different angles. The angle of each road in each image has been saved in a txt file. Next, we can see an example of the images.



Getting started with Neural Network for regression[edit]

Also, I've started to study Neural Network for regression. A regression model allows us to predict a continuous value based on data that it already know. I've followed a tutorial (https://medium.com/@rajatgupta310198/getting-started-with-neural-network-for-regression-and-tensorflow-58ad3bd75223) that creates a model of neural networks for regression on financial data (https://in.finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI&guccounter=1). The code can be found at https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Examples%20Deep%20Learning/Neural%20Network%20Regression/Tensorflow. Next, the code is explained:

At the beginning, we preprocess the data and leave 60% of data for training and 40% for test.

We built our neural net model:

  • tf.Variable will create a variable of which value will be changing during optimization steps.
  • tf.random_uniform will generate random number of uniform distribution of dimension specified ([input_dim,number_of_nodes_in_layer]).
  • tf.zeros will create zeros of dimension specified (vector of (1,number_of_hidden_node)).
  • tf.add() will add two parameters.
  • tf.matmul() will multiply two matrices (Weight matrix and input data matrix).
  • tf.nn.relu() is an activation function that after multiplication and addition of weights and biases we apply activation function.
  • tf.placeholder() will define gateway for data to graph.
  • tf.reduce_mean() and tf.square() are function for mean and square in mathematics.
  • tf.train.GradientDescentOptimizer() is class for applying gradient decent.
  • GradientDescentOptimizer() has method minimize() to mimize target function/cost function.

We will train neural network by iterating it through each sample in dataset. Two for loops used one for epochs and other for iteration of each data. Completion of outer for loop will signify that an epoch is completed.

  • tf.Session() initiate current session.
  • sess.run() is function that run elements in graph.
  • tf.global_variables_initializer() will initialize all variables.
  • tf.train.Saver() class will help us to save our model.
  • sess.run([cost,train],feed_dict={xs:X_train[j,:], ys:y_train[j]}) this actually running cost and train step with data feeding to neural network one sample at a time.
  • sess.run(output,feed_dict={xs: X_train}) this running neural network feeding with only test features from dataset.

So finally we completed our neural net in Tensorflow for predicting stock market price. The result can be seen in the following graph:

Weeks 17,18: Understanding LSTM[edit]

Understanding LSTM[edit]

Sometimes it is necessary to use previous information to process the current information. Traditional neural networks can not do this, and it seems an important deficiency. Recurrent neural networks address this problem. They are networks with loops that allow the information to persist. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning ... Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version.

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task. Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies”. In practice, RNNs don’t seem to be able to learn them. LSTMs don’t have this problem.

LSTMs (Long Short-Term Memory networks) (http://www.bioinf.jku.at/publications/older/2604.pdf, https://colah.github.io/posts/2015-08-Understanding-LSTMs/) are a type of RNN (Recurrent Neural Network) architecture that addresses the vanishing/exploding gradient problem and allows learning of long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997). They work very well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior. All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure. LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The main idea consists in a memory cell as a interchangeably block which can maintain its state over time. The key to LSTMs is the cell state (the horizontal line running through the top of the diagram). The cell state runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through”. An LSTM has three of these gates, to protect and control the cell state.

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer". The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

It’s now time to update the old cell state, Ct−1, into the new cell state Ct. Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

First exampke of LSTM[edit]

To implement a first simple example of this kind of networks a tutorial (https://www.knowledgemapper.com/knowmap/knowbook/jasdeepchhabra94@gmail.comUnderstandingLSTMinTensorflow(MNISTdataset) has been followed in which we discover how to develop an LSTM network in tensorflow. In this example, we use MNIST as our dataset. The result of this implementation can be found in https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Examples%20Deep%20Learning/LSTM/Tensorflow/lstm_mnist.py.

Week 16: Follow line dataset[edit]

The final goal of the project is to make a follow-line using Deep Learning. For this it is necessary to collect data. For this reason, I have based on the code created for the follow-line practice of JdeRobot Academy (http://vanessavisionrobotica.blogspot.com/2018/05/practica-1-follow-line-prueba-2.html) in order to create a dataset. The created dataset contains the input images with the corresponding linear and angular speeds. In addition, a dataset has been created that contains images with a single row with the corresponding linear and angular speeds. The dataset is in: https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/follow%20line/dataset. In the file speeds.txt, in each line there is the linear speed and the angular speed corresponding to the image that is in the Image folder. The same goes for the row images that are in the Images_640_1 folder.

Week 15: Read papers about Deep Learning for Steering Autonomous Vehicles, CNN with Tensorflow[edit]

This week, I read some papers about Deep Learning for Steering Autonomous Vehicles. Some of these papers are:

  • End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies (https://arxiv.org/pdf/1710.03804.pdf): In this work, they propose a Convolutional Long Short-Term Memory Recurrent Neural Network (C-LSTM), that is end-to-end trainable, to learn both visual and dynamic temporal dependencies of driving. To train and validate their proposed methods, they used the publicly available Comma.ai dataset. The system they propose is comprised of a front-facing RGB camera and a composite neural network consisting of a CNN and LSTM network that estimate the steering wheel angle based on the camera input. Camera images are processed frame by frame by the CNN. The CNN is pre-trained, on the Imagenet dataset, that features 1.2 million images of approximately 1000 different classes and allows for recognition of a generic set of features and a variety of objects with a high precision. Then, they transfer the trained neural network from that broad domain to another specific one focusing on driving scene images. The LSTM then processes a sequence of w fixed-length feature vectors (sliding window) from the CNN. In turn, the LSTM layers learn to recognize temporal dependences leading to a steering decision Yt based on the inputs from Xt−w to Xt. Small values of t lead to faster reactions, but the network learns only short-term dependences and susceptibility for individually misclassified frames increases. Whereas large values of t lead to a smoother behavior, and hence more stable steering predictions, but increase the chance of learning wrong long-term dependences. The sliding window concept allows the network to learn to recognize different steering angles from the same frame Xi but at different temporal states of the LSTM layers. For the domain-specific training, the classification layer of the CNN is re-initialized and trained on camera road data. Training of the LSTM layer is conducted in a many-to-one fashion; the network learns the steering decisions that are associated with intervals of driving.

  • Reactive Ground Vehicle Control via Deep Networks (https://pdfs.semanticscholar.org/ec17/ec40bb48ec396c626506b6fe5386a614d1c7.pdf): They present a deep learning based reactive controller that uses a simple network architecture requiring few training images. Despite its simple structure and small size, their network architecture, called ControlNet, outperforms more complex networks in multiple environments using different robot platforms. They evaluate ControlNet in structured indoor environments and unstructured outdoor environments. This paper focuses on the low-level task of reactive control, where the robot must avoid obstacles that were not present during map construction such as dynamic obstacles and items added to the environment after map construction. ControlNet abstracts RGB images to generate control commands: turn left, turn right, and go straight. ControlNet’s architecture consists of alternating convolutional layers with max pooling layers, followed by two fully connected layers. The convolutional and pooling layers extract geometric information about the environment while the fully connected layers act as a general classifier. A long short-term memory (LSTM) layer allows the robot to incorporate temporal information by allowing it to continue moving in the same direction over several frames. ControlNet has 63223 trainable parameters.

  • Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car (https://arxiv.org/pdf/1704.07911.pdf): NVIDIA has created a neural-network-based system, known as PilotNet, which outputs steering angles given images of the road ahead. PilotNet is trained using road images paired with the steering angles generated by a human driving a data-collection car. It derives the necessary domain knowledge by observing human drivers. Road tests demonstrated that PilotNet can successfully perform lane keeping in a wide variety of driving conditions, regardless of whether lane markings are present or not. PilotNet training data contains single images sampled from video from a front-facing camera in the car, paired with the corresponding steering command (1/r), where r is the turning radius of the vehicle. The training data is augmented with additional image/steering-command pairs that simulate the vehicle in different off-center and off-orientation poistions. The PilotNet's network consists of 9 layers, including a normalization layer, 5 convolutional layers and 3 fully connected layers. The input image is split into YUV planes and passed to the network. The central idea in discerning the salient objects is finding parts of the image that correspond to locations where the feature maps have the greatest activations. The activations of the higher-level maps become masks for the activations of lower levels using the following algorithm: (1) in each layer, the activations of the feature maps are averaged; (2) the top most averaged map is scaled up to the size of the map of the layer below; (3) the up-scaled averaged map from an upper level is then multiplied with the averaged map from the layer below; (4) the intermediate mask is scaled up to the size of the maps of layer below in the same way as described Step (2); (5) the up-scaled intermediate map is again multiplied with the averaged map from the layer below; (6) Steps (4) and (5) above are repeated until the input is reached. The last mask which is of the size of the input image is normalized to the range from 0.0 to 1.0 and becomes the final visualization mask (shows which regions of the input image contribute most to the output of the network).

  • Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues (https://arxiv.org/pdf/1708.03798.pdf): in this work they focus on a visionbased model that directly maps raw input images to steering angles using deep networks. First, the model is learned and evaluated on real human driving videos that are time-synchronized with other vehicle sensors. This differs from many prior models trained from synthetic data in racing games. Second, state-of-the-art models, such as PilotNet, mostly predict the wheel angles independently on each video frame, which contradicts common understanding of driving as a stateful process. Instead, their proposed model strikes a combination of spatial and temporal cues, jointly investigating instantaneous monocular camera observations and vehicle’s historical states. This is in practice accomplished by inserting carefully-designed recurrent units (e.g., LSTM and Conv-LSTM) at proper network layers. Third, to facilitate the interpretability of the learned model, they utilize a visual back-propagation scheme for discovering and visualizing image regions crucially influencing the final steering prediction.

  • Agile Autonomous Driving using End-to-End Deep Imitation Learning (https://arxiv.org/pdf/1709.07174.pdf): they present an end-to-end imitation learning system for agile, off-road autonomous driving using only low-cost on-board sensors. By imitating a model predictive controller equipped with advanced sensors, they train a deep neural network control policy to map raw, high-dimensional observations to continuous steering and throttle commands. Compared with recent approaches to similar tasks, their method requires neither state estimation nor on-the-fly planning to navigate the vehicle. Their approach relies on, and experimentally validates, recent imitation learning theory.

In addition, I followed the Tensorflow convolutional neural networks tutorial (https://www.tensorflow.org/tutorials/layers). In this tutorial, I've learnt how to use layers to build a convolutional neural network model to recognize the handwritten digits in the MNIST data set. As the model trains, you'll see log output like the following:

INFO:tensorflow:loss = 2.36026, step = 1
INFO:tensorflow:probabilities = [[ 0.07722801  0.08618255  0.09256398, ...]]
INFO:tensorflow:loss = 2.13119, step = 101
INFO:tensorflow:global_step/sec: 5.44132
INFO:tensorflow:Saving checkpoints for 20000 into /tmp/mnist_convnet_model/model.ckpt.
INFO:tensorflow:Loss for final step: 0.14782684.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-01-15:31:44
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/mnist_convnet_model/model.ckpt-20000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-06-01-15:31:53
INFO:tensorflow:Saving dict for global step 20000: accuracy = 0.9695, global_step = 20000, loss = 0.10200113
{'loss': 0.10200113, 'global_step': 20000, 'accuracy': 0.9695}

Here, I've achieved an accuracy of 96.95% on our test data set.

Week 14: Testing DetectionSuite[edit]

This week, I've been testing detectionSuite. To prove it, you need to try DatasetEvaluationApp. To check the operation of this app you have to execute:

./DatasetEvaluationApp --configFile=appConfig.txt

My appConfig.txt is:







To check its operation, you have to select the following:

However, I have had some errors in execution width CUDA:

mask_scale: Using default '1,000000'
CUDA Error: no kernel image is available for execution on the device
CUDA Error: no kernel image is available for execution on the device: El archivo ya existe

Week 13: Reinstall DetectionSuite[edit]

This week, I reinstalled Darknet and DetectionSuite. You can find the installation information in the following link: https://github.com/JdeRobot/dl-DetectionSuite/blob/master/DeepLearningSuite/Dockerfile/Dockerfile.

Week 12: GUI with C++[edit]

This week, I created some GUIs and I followed this http://zetcode.com/gui/gtk2/introduction/ and https://developer.gnome.org/gtkmm-tutorial/stable/index.html.en.

Week 11: Component for object detection[edit]

This week, I created a component in Python for people detection using an SSD network. The component can be seen in the following video:

Week 10: Embedding Python in C++, SSD[edit]

This week, I searched for information on embedding Python code in C++. You can find information in the following links: https://docs.python.org/2/extending/embedding.html, https://www.codeproject.com/Articles/11805/Embedding-Python-in-C-C-Part-I, https://realmike.org/blog/2012/07/05/supercharging-c-code-with-embedded-python/, https://skebanga.github.io/embedded-python-pybind11/. I have made a simple example (hello.cpp):

#include <stdio.h>
#include <python3.5/Python.h>

int main()
	PyObject* pInt;


	PyRun_SimpleString("print('Hello World from Embedded Python!!!')");

	printf("\nPress any key to exit...\n");

It is possible to have some compilation errors if you use:

gcc hello.cpp -o hello

That's why I compiled in the following way:

gcc hello.cpp -o hello -L/usr/lib/python2.7/config/ -lpython2.7

The result is:

Hello World from Embedded Python!!!

Press any key to exit...

I have also made another more complex example:

  • C++ code:
#include <python2.7/Python.h>

main(int argc, char *argv[])
    PyObject *pName, *pModule, *pDict, *pFunc;
    PyObject *pArgs, *pValue;
    int i;

    if (argc < 3) {
        fprintf(stderr,"Usage: call pythonfile funcname [args]\n");
        return 1;

    pName = PyString_FromString(argv[1]);
    /* Error checking of pName left out */

    pModule = PyImport_Import(pName);

    if (pModule != NULL) {
        pFunc = PyObject_GetAttrString(pModule, argv[2]);
        /* pFunc is a new reference */

        if (pFunc && PyCallable_Check(pFunc)) {
            pArgs = PyTuple_New(argc - 3);
            for (i = 0; i < argc - 3; ++i) {
                pValue = PyInt_FromLong(atoi(argv[i + 3]));
                if (!pValue) {
                    fprintf(stderr, "Cannot convert argument\n");
                    return 1;
                /* pValue reference stolen here: */
                PyTuple_SetItem(pArgs, i, pValue);
            pValue = PyObject_CallObject(pFunc, pArgs);
            if (pValue != NULL) {
                printf("Result of call: %ld\n", PyInt_AsLong(pValue));
            else {
                fprintf(stderr,"Call failed\n");
                return 1;
        else {
            if (PyErr_Occurred())
            fprintf(stderr, "Cannot find function \"%s\"\n", argv[2]);
    else {
        fprintf(stderr, "Failed to load \"%s\"\n", argv[1]);
        return 1;
    return 0;
  • Python code:
def multiply(a,b):
    print "Will compute", a, "times", b
    c = 0
    for i in range(0, a):
        c = c + b
    return c

I have compiled the code by putting in the terminal:

gcc multiply.cpp -o multiply -L/usr/lib/python2.7/config/ -lpython2.7

To execute:

./multiply multiply multiply 3 4

It is possible to have some errors. If you have errors, like ImportError: No module named multiply, you have to use:

PYTHONPATH=. ./multiply multiply multiply 3 4

And the result is:

Will compute 3 times 4
Result of call: 12

Also, this week I made a simple SSD code to detect cars based on the one that implements in https://github.com/ksketo/CarND-Vehicle-Detection. This code can be found at https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Simple%20net%20Object%20detection/car_detection. To detect the cars we execute the code detectionImages.py. The result is as follows:

Also, I made a simple SSD to detect persons. The result is as follows:

Weeks 8,9: State of the art, dataset of cultivations, and installation od DetectionSuite[edit]

These weeks, I have made the state of the art of Deep Learning for object detection. The state of the art is available in the following link: https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/State%20of%20the%20art .

Also, this week I've been looking for information on the crop dataset. Some datasets are:

  • FAOSTAT [2]: provides free access to food and agriculture data for over 245 countries and territories and covers all FAO regional groupings from 1961 to the most recent year available.
  • Covertype Data Set [3]: includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. As for primary major tree species in these areas, Neota would have spruce/fir (type 1), while Rawah and Comanche Peak would probably have lodgepole pine (type 2) as their primary species, followed by spruce/fir and aspen (type 5). Cache la Poudre would tend to have Ponderosa pine (type 3), Douglas-fir (type 6), and cottonwood/willow (type 4).
  • Plants Data Set [4]: Data has been extracted from the USDA plants database. It contains all plants (species and genera) in the database and the states of USA and Canada where they occur.
  • Urban Land Cover Data Set [5]: Contains training and testing data for classifying a high resolution aerial image into 9 types of urban land cover. Multi-scale spectral, size, shape, and texture information are used for classification. There are a low number of training samples for each class (14-30) and a high number of classification variables (148), so it may be an interesting data set for testing feature selection methods. The testing data set is from a random sampling of the image. Class is the target classification variable. The land cover classes are: trees, grass, soil, concrete, asphalt, buildings, cars, pools, shadows.
  • Area, Yield and Production of Crops [6]: Area under crops, yield of crops in tonnes per hectare, and production of crops in tonnes, classified by type of crop.
  • Target 8 Project cultivation database [7]: has been written in Microsoft Access, and provides a valuable source of information on habitats, ecology, reproductive attributes, germination protocol and so forth (see below). Attributes can be compared, and species from similar habitats can be quickly listed. This may help you in achieving the right conditions for a species, by utilising your knowledge of more familiar plants. The target 8 dataset form presents the data for the project species only, with a sub-form giving a list of submitted protocols. A full dataset to the British and Irish flora is also avaialble (Entire dataset form) which gives the same ata for not threatened taxa as well as many neophytes.
  • Datasets of Nelson Institute University of Wisconsin-Madison [8]: There are some datasets for crop or vegetation.
  • Global data set of monthly irrigated and rainfed crop areas around the year 2000 (MIRCA2000) [9][10]: is a global dataset with spatial resolution of 5 arc minutes (about 9.2 km at the equator) which provides both irrigated and rainfed crop areas of 26 crop classes for each month of the year. The crops include all major food crops (wheat, maize, rice, barley, rye, millet, sorghum, soybean, sunflower, potato, cassava, sugarcane, sugarbeet, oil palm, canola, groundnut, pulses, citrus, date palm, grape, cocoa, coffee, other perennials, fodder grasses, other annuals) as well as cotton. For some crops no adequate statistical information was available to retain them individually and these crops were grouped into ‘other’ categories (perennial, annual and fodder grasses).
  • Root crops and plants harvested green from arable land by area [11]: includes the areas where are sown potatoes (including seed), sugar beet (excluding seed), temporary grasses for grazing, hay or silage (in the crop rotation cycle), leguminous plants grown and harvested green (as the whole plant, mainly for fodder, energy or green manuring use), other cereals harvested green (excluding green maize): rye, wheat, triticale, annual sorghum, buckwheat, etc. This indicator uses the concepts of "area under cultivation" which corresponds: • before the harvest, to the sown area; • after the harvest, to the sown area excluding the non-harvested area (e.g. area ruined by natural disasters, area not harvested for economic reasons, etc.).

Also, I installed DetectionSuite. I have followed the steps indicated in the following link https://github.com/JdeRobot/dl-DetectionSuite. It is important to install version 9 of CUDA, because other versions will probably give an error.

Weeks 5,6,7: SSD and R-CNN[edit]

These weeks, I've known some object detection algorithms. Object detection is the process of finding instances of real-world objects in images or videos. Object detection algorithms usually use extracted features and learning algorithms to recognize instances of an object category. There are different algorithm for object detection. Some of them are:

  • R-CNN (Recurrent Convolutional Neural Network) [12]: is a special type of CNN that is able to locate and detect objects in images: the output is generally a set of bounding boxes that closely match each of the detected objects, as well as a class output for each detected object. Though the input is static, the activities of RCNN units evolve over time so that the activity of each unit is modulated by the activities of its neighboring units. This property enhances the ability of the model to integrate the context information, which is important for object recognition. Like other recurrent neural networks, unfolding the RCNN through time can result in an arbitrarily deep network with a fixed number of parameters. Furthermore, the unfolded network has multiple paths, which can facilitate the learning process.
  • Single Shot Multibox Detector (SSD) [13]: discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component.

Week 4: Starting with DetectionSuite[edit]

DeepLearning Suite (https://github.com/JdeRobot/DeepLearningSuite) is a set of tool that simplify the evaluation of most common object detection datasets with several object detection neural networks. It offers a generic infrastructure to evaluates object detection algorithms againts a dataset and compute most common statistics: precision, recall. DeepLearning Suite supports YOLO (darknet) and Background substraction. I've installed YOLO(darknet). I follow the following steps:

git clone https://github.com/pjreddie/darknet
cd darknet

Downloading the pre-trained weight file:

wget https://pjreddie.com/media/files/yolo.weights

Then run the detector:

./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg

Week 3: Getting started[edit]

Install Keras[edit]

Keras is a high-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano (they're both open source libraries for numerical computation optimized for GPU and CPU). It was developed with a focus on enabling fast experimentation. I followed the next steps to install Keras: https://keras.io/#installation. Currently, I'm running Keras on top of TensorFlow (https://www.tensorflow.org/) optimized for CPU.

Keras includes a module for multiple supplementary tasks called Utils. The most important functionality for the project provided by this module is the .HDF5Matrix() method. Keras employs the HDF5 file format (http://docs.h5py.org/en/latest/build.html) to save models and read datasets. According to HDF5 documentation, it is a hierarchical data format designed for high volumes of data with complex relationships.

The commands to have installed keras and HDF5 are:

sudo pip install tensorflow
sudo pip install keras
sudo apt-get install libhdf5
sudo pip install h5py

Testing the code of a neural network[edit]

After installing Keras, I've tried the tfg of my mate David Pascual (http://jderobot.org/Dpascual-tfg). In his project, he was be able to classify a well-known database of handwritten digits (MNIST) using a convolutional neural network (CNN). He created a program that is able to get images from live video and display the predictions obtained from them. I've studied his code and I tested its operation.

If you want to execute David's code you have to open two terminals and put in each one respectively:

cameraserver cameraserver_digitclassifier.cfg 
python digitclassifier.py digitclassifier.cfg

First neural network: MNIST example[edit]

After studying David's code a bit, I have trained a simple convolutional neural network with the MNIST dataset.

First of all, we have to load and adapt the input data. Keras library contains a module named datasets from which we can import a few databases, including MNIST. In order to load MNIST data, we call mnist.load_data() function. It returns images and labels from both training and test datasets (x_train, y train and x_test, y_test respectively).

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

The dimension of x_train are: (6000, 28, 28). That is, we have 6000 images with a size of 28x28. We also have to explicitly declare the depth of the samples. In this case, we're working with black and white images, so depth will be equal to 1. We reshape data using reshape() method. Depending on the backend (TensorFlow or Theano), the arguments must be passed in different order.

# Shape of image depends if you use TensorFlow or Theano
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

The next step is to convert data type from uint8 to float32 and normalize pixel values to [0,1] range.

# We normalize the data
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
('x_train shape:', (60000, 28, 28, 1))
(60000, 'train samples')
(10000, 'test samples')

We have to reshape label data using utils.to_categorical() method.

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Now that data is ready, we have to define the architecture of our neural network. We're going to define the model architecture that we will use. We use sequential model.

model = Sequential()

Next step is to add input, inner and output layers. We can add layers using the add(). Convolutional neural networks usually contain: convolutional layer, pooling layer and fully connected layer. We're going add these layers to our model.

model.add(Conv2D(nb_filters, kernel_size=nb_conv,
model.add(Conv2D(64, nb_conv, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))


We have to train the neural network. We must use the fit() method.

# We train the model
history = model.fit(x_train, y_train, batch_size=batch_size,
          epochs=epochs, verbose=1,
          validation_data=(x_test, y_test))

Last step is to evaluate the model.

# We evaluate the model
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Keras displays the results obtained.

Train on 60000 samples, validate on 10000 samples
Epoch 1/12
  128/60000 [..............................] - ETA: 4:34 - loss: 2.2873 - acc: 0  
256/60000 [..............................] - ETA: 4:12 - loss: 2.2707 - acc: 0  
384/60000 [..............................] - ETA: 3:59 - loss: 2.2541 - acc: 0  
512/60000 [..............................] - ETA: 3:51 - loss: 2.2271 - acc: 0  
640/60000 [..............................] - ETA: 3:47 - loss: 2.1912 - acc: 0


59520/60000 [============================>.] - ETA: 1s - loss: 0.0382 - acc: 0.988
59648/60000 [============================>.] - ETA: 1s - loss: 0.0382 - acc: 0.988
59776/60000 [============================>.] - ETA: 0s - loss: 0.0382 - acc: 0.988
59904/60000 [============================>.] - ETA: 0s - loss: 0.0381 - acc: 0.988
60000/60000 [==============================] - 264s 4ms/step - loss: 0.0381 - acc: 0.9889 - val_loss: 0.0293 - val_acc: 0.9905
('Test loss:', 0.029295223668735708)
('Test accuracy:', 0.99050000000000005)

The fit() function returns a history object which has a dictionary of all the metrics which were required to be tracked during training. We can use the data in the history object to plot the loss and accuracy curves to check how the training process went. You can use the history.history.keys() function to check what metrics are present in the history. It should look like the following [‘acc’, ‘loss’, ‘val_acc’, ‘val_loss’].

Week 2: BBDD of Deep Learning and C++ Tutorials[edit]

BBDD of Deep Learning[edit]

This second week, I've known some datasets used in Deep Learning. Datasets provide a means to train and evaluate algorithms, they drive research in new directions. Datasets related to object recognition can be split into three groups: object classification, object detection and semantic scene labeling.

Object classification: assigning pixels in the image to categories or classes of interest. There are different datasets for image classification. Some of them are:

  • MNIST (Modified National Institute of Standards and Technology): is a large database of handwritten digits that is commonly used for training various image processing systems. It consists of a set training of 60,000 examples and a test of 10,000 examples. Is a good database for people who want to try learning techniques and methods of pattern recognition in real-world data, while dedicating a minimum effort to preprocess and format.
  • Caltech 101: is intended to facilitate Computer Vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.
  • Caltech 256: is a set similar to the Caltech 101, but with some improvements: 1) the number of categories is more than doubled, 2) the minimum number of images in any category is increased from 31 to 80, 3) artifacts due to image rotation are avoided, and 4) a new and larger clutter category is introduced for testing background rejection.
  • CIFAR-10: is an established computer-vision dataset used for object recognition. It is a subset of the 80 million tiny images dataset and consists of 60,000 32x32 color images containing one of 10 object classes, with 6000 images per class. There are 50,000 training images and 10,000 test images in the official data.
  • CIFAR-100: is large, consisting of 100 image classes, with 600 images per class. Each image is 32x32x3 (3 color), and the 600 images are divided into 500 training, and 100 test for each class.
  • ImageNet: is organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated.

Object detection: the process of finding instances of real-world objects in images or videos. Object detection algorithms usually use extracted features and learning algorithms to recognize instances of an object category. There are different datasets for object detection. Some of them are:

  • COCO (Microsoft Common Objects in Context): is an image dataset designed to spur object detection research with a focus on detecting objects in context. COCO has the following characteristics: multiple objects in the images, more than 300,000 images, more than 2 million instances, and 80 object categories. Training sets, test and validation are used with their corresponding annotations. The annotations include pixel-level segmentation of object belonging to 80 categories, keypoint annotations for person instances, stuff segmentations for 91 categories, and five image captions per image.
  • PASCAL VOC: In 2007, the group has two large databases, one of which consists of a validation set and another of training, and the other with a single test set. Both databases contain about 5000 images in which they are represented, approximately 12,000 different objects, so, in total, this set has of about 10,000 images in which about 24,000 objects are represented. In the year 2012, this setis modified, increasing the number of images with representation to 11530 of 27450 different objects.
  • Caltech Pedestrian Dataset: consists of approximately 10 hours of 640x480 30Hz video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 frames with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated. The annotation includes temporal correspondence between bounding boxes and detailed occlusion labels.
  • TME Motorway Dataset (Toyota Motor Europe (TME) Motorway Dataset): is composed by 28 clips for a total of approximately 27 minutes (30000+ frames) with vehicle annotation. Annotation was semi-automatically generated using laser-scanner data. The dataset is divided in two sub-sets depending on lighting condition, named “daylight” (although with objects casting shadows on the road) and “sunset” (facing the sun or at dusk).

Semantic scene labeling: each pixel of an image have to be labeled as belonging to a category. There are different dataset for semantic scene labeling. Some of them are:

  • SUN dataset: provide a collection of annotated images covering a large variety of environmental scenes, places and the objects within. SUN contains 908 scene categories from the WordNet dictionary with segmented objects. The 3,819 object categories span those common to object detection datasets (person, chair, car) and to semantic scene labeling (wall, sky, floor). There are a few categories have a large number of instances (wall: 20,213, window: 16,080, chair: 7,971) while most have a relatively modest number of instances (boat: 349, airplane: 179, floor lamp: 276).
  • Cityscapes Dataset: contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames.

C++ Tutorials[edit]

In this week, I've been doing c ++ tutorials. I followed the tutorials on the next page: https://codigofacilito.com/cursos/c-plus-plus.

Week 1: Read old works[edit]

After that, I have to know the state of the art of Deep Learning, so I've read the old works from other students. First, I read the Nuria Oyaga's Final Grade Project: "Análisis de Aprendizaje Profundo con la plataforma Caffe". Then, I read the the David Pascual's Final Grade Project: "Study of Convolutional Neural Networks using Keras Framework". Finally, I read an introduction to Deep Learning.