Vmartinezf-tfm

From jderobot
Jump to: navigation, search

Contents

Proyect Card[edit]

Project Name: Visual Control with DeepLearning

Author: Vanessa Fernández Martínez [vanessa_1895@msn.com]

Academic Year: 2017/2018

Degree: Computer Vision Master

GitHub Repositories: [1]

Tags: Deep Learning, Detection, JdeRobot

Progress[edit]

Week 44: Results, Controlnet Statistics, Fixed circuit[edit]

Results table (cropped image)[edit]

Driving results (regression networks)
Manual Pilotnet v + w normal TinyPilotnet v + w Pilotnet v + w multiple (stacked) Pilotnet v + w multiple (stacked, difference images) LSTM-Tinypilotnet v + w DeepestLSTM-Tinypilotnet v + w LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 37s 100% 1min 41s 100% 1min 41s 100% 1min 39s 100% 1min 40s 100% 1min 37s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 38s 100% 1min 41s 10% 100% 1min 38s 100% 1min 38s 100% 1min 38s
Monaco (clockwise) 100% 1min 15s 100% 1min 20s 100% 1min 19s 85% 45% 50% 55%
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 19s 100% 1min 18s 15% 5% 35% 55%
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 04s 100% 1min 04s 8% 8% 40% 100% 1min 04s
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 06s 100% 1min 05s 80% 50% 50% 80%
Curve GP (clockwise) 100% 2min 13s 100% 2min 16s 25% 25% 25% 100% 2min 17s 100% 2min 18s
Curve GP (anti-clockwise) 100% 2min 09s 100% 2min 12s 75% 75% 75% 100% 2min 04s 100% 2min 12s



Driving results (classification networks)
Manual Classification 5v+7w biased Classification 5v+7w balanced Classification 5v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 41s 75% 100% 1min 42s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 39s 100% 1min 39s 100% 1min 43s
Monaco (clockwise) 100% 1min 15s 100% 1min 20s 70% 85%
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 18s 8% 100% 1min 20s
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 03s 100% 1min 03s 100% 1min 05s
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 05s 80% 80%
Curve GP (clockwise) 100% 2min 13s 100% 2min 06s 97% 100% 2min 15s
Curve GP (anti-clockwise) 100% 2min 09s 100% 2min 11s 100% 2min 05s 100% 2min 15s

Results table (whole image)[edit]

Driving results (regression networks)
Manual Pilotnet normal TinyPilotnet Pilotnet multiple (stacked) Pilotnet multiple (stacked, difference images) LSTM-Tinypilotnet DeepestLSTM-Tinypilotnet Controlnet LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 41s 100% 1min 39s 100% 1min 40s 100% 1min 43s 100% 1min 39s 100% 1min 39s 100% 1min 46s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 39s 100% 1min 38s 100% 1min 46s 10% 100% 1min 40s 100% 1min 41s 100% 1min 37s
Monaco (clockwise) 100% 1min 15s 100% 1min 21s 100% 1min 19s 50% 5% 50% 50% 5%
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 23s 100% 1min 20s 7% 5% 12% 100% 1min 21s 5%
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 03s 100% 1min 05s 50% 8% 20% 100% 1min 05s 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 06s 100% 1min 06s 80% 50% 80% 100% 1min 07s 75%
Curve GP (clockwise) 100% 2min 13s 100% 2min 20s 100% 2min 11s 25% 25% 100% 2min 20s 25% 25%
Curve GP (anti-clockwise) 100% 2min 09s 100% 2min 16s 100% 2min 06s 100% 2min 07s 75% 100% 2min 25s 100% 2min 04s 3%




Driving results (classification networks)
Manual Classification 5v+7w biased Classification 5v+7w balanced Classification 5v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 35% 10% 90%
Simple (anti-clockwise) 100% 1min 33s 100% 1min 49s 100% 1min 46s 90%
Monaco (clockwise) 100% 1min 15s 100% 1min 24s 5% 100% 1min 23s
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 29s 8% 100% 1min 24s
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 10s 8% 90%
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 07s 8% 100% 1min 09s
Curve GP (clockwise) 100% 2min 13s 95% 80% 25%
Curve GP (anti-clockwise) 100% 2min 09s 7% 3% 20%

Controlnet statistics[edit]

Dataset (Simple, Monaco, Nurburgrin)[edit]

  • v:
v results:

MSE: 0.5360334346978631

MAE: 0.21453057245415982


  • w:


w results:

MSE: 0.00272241766280656

MAE: 0.03184364641428021


Dataset CurveGP[edit]

  • v:
v results:

MSE: 0.3284508873436601

MAE: 0.17399486924267757


  • w:


w results:

MSE: 0.001533648088125278

MAE: 0.02667656092817528




Fixed circuit[edit]

  • CurveGP circuit:

Week 43: Problems with the circuits, Tests with CurveGP circuit, Controlnet statistics, Circuit[edit]

Problems with circuits[edit]

  • Simple circuit:


  • Monaco circuit:


  • Nurburgrin circuit:


  • Small circuit:


  • CurveGP circuit:

Tests with CurveGP circuit[edit]

I've done tests with a CurveGP circuit:

Results table (cropped image)[edit]

Driving results (regression networks)
Manual Pilotnet normal TinyPilotnet Pilotnet multiple (stacked) Pilotnet multiple (stacked, difference images) LSTM-Tinypilotnet DeepestLSTM-Tinypilotnet LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
CurveGP (clockwise) 100% 2min 13s 50% 25% 25% 2% 50% 25%
CurveGP (anti-clockwise) 100% 2min 09s 2% 2% 2% 1% 2% 2%


Driving results (classification networks)
Manual Classification 5v+7w biased Classification 5v+7w balanced Classification 5v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
CurveGP (clockwise) 100% 2min 13s 100% 2min 11s 100% 2min 04s 100% 2min 11s
CurveGP (anti-clockwise) 100% 2min 09s 100% 2min 07s 100% 2min 03s 100% 2min 09s

Results table (whole image)[edit]

Driving results (regression networks)
Manual Pilotnet normal TinyPilotnet Pilotnet multiple (stacked) Pilotnet multiple (stacked, difference images) LSTM-Tinypilotnet DeepestLSTM-Tinypilotnet Controlnet LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
CurveGP (clockwise) 100% 2min 13s 2% 1% 1% 2% 1% 15% 1%
CurveGP (anti-clockwise) 100% 2min 09s 1% 2% 1% 1% 2% 2% 1%



Driving results (classification networks)
Manual Classification 5v+7w biased Classification 5v+7w balanced Classification 5v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
CurveGP (clockwise) 100% 2min 13s 2% 2% 1%
CurveGP (anti-clockwise) 100% 2min 09s 1% 1% 1%

Controlnet statistics[edit]

  • Train:

v results:

MSE: 1.308563

MAE: 0.433917
w results:

MSE: 0.009177

MAE: 0.053194
  • Test

v results:

MSE: 4.017514

MAE: 0.816513
w results:

MSE: 0.055743

MAE: 0.173289

Circuit[edit]

Week 42: Circuit[edit]

I've tried to create a new circuit with Blender, but in Gazebo it isn't seen.

Blender:

Gazebo:

Week 41:Temporal difference network[edit]

Temporal difference network[edit]

I've tested a network that takes a gray scale difference image as the input image, but I've made a preprocess:

margin = 10
i1 = cv2.cvtColor(imgs[i], cv2.COLOR_BGR2GRAY)
i2 = cv2.cvtColor(imgs[i - (margin + 1)], cv2.COLOR_BGR2GRAY)
i1 = cv2.GaussianBlur(i1, (5, 5), 0)
i2 = cv2.GaussianBlur(i2, (5, 5), 0)
difference = np.zeros((i1.shape[0], i1.shape[1], 1))
difference[:, :, 0] = cv2.subtract(np.float64(i1), np.float64(i2))
mask1 = cv2.inRange(difference[:, :, 0], 15, 255)
mask2 = cv2.inRange(difference[:, :, 0], -255, -15)
mask = mask1 + mask2
difference[:, :, 0][np.where(mask == 0)] = 0
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
difference[:, :, 0] = cv2.morphologyEx(difference[:, :, 0], cv2.MORPH_CLOSE, kernel)
im2 = difference
if np.ptp(im2) != 0:
    img_resized = 256 * (im2 - np.min(im2)) / np.ptp(im2) - 128
else:
    img_resized = 256 * (im2 - np.min(im2)) / 1 - 128


I've used a margin of 10 images between the 2 images. The result is:



Driving results (Temporal difference network, whole image)
Manual Temporal difference
Circuits Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 35%
Simple (anti-clockwise) 100% 1min 33s 10%
Monaco (clockwise) 100% 1min 15s 3%
Monaco (anti-clockwise) 100% 1min 15s 3%
Nurburgrin (clockwise) 100% 1min 02s 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 3%

Week 40: Tests with other circuit, Controlnet, Temporal difference network[edit]

Tests with other circuit[edit]

I've done tests with a circuit that hasn't been used for training.


Results table (cropped image)[edit]

Driving results (regression networks)
Manual Pilotnet normal TinyPilotnet Pilotnet multiple (stacked) Pilotnet multiple (stacked, difference images) LSTM-Tinypilotnet DeepestLSTM-Tinypilotnet LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Small (clockwise) 100% 1min 00s 10% 100% 1min 14s 100% 1min 08 10% 10% 100% 1min 09s
Small (anti-clockwise) 100% 59 s 20% 100% 1min 17s 100% 1min 08s 20% 80% 100% 1min 07s


Driving results (classification networks)
Manual Classification 5v+7w biased Classification 5v+7w balanced Classification 5v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Small (clockwise) 100% 1min 00s 100% 1min 02s 100% 1min 03s 100% 1min 07s
Small (anti-clockwise) 100% 59s 100% 1min 05s 100% 1min 02s 100% 1min 08s

Results table (whole image)[edit]

Driving results (regression networks)
Manual Pilotnet normal TinyPilotnet Pilotnet multiple (stacked) Pilotnet multiple (stacked, difference images) LSTM-Tinypilotnet DeepestLSTM-Tinypilotnet Controlnet LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Small (clockwise) 100% 1min 00s 85% 100% 1min 09s 80% 100% 1min 03s 10% 100% 1min 01s 20%
Small (anti-clockwise) 100% 59 s 100% 1min 08s 100% 1min 13s 20% 100% 1min 04s 20% 20% 20%



Driving results (classification networks)
Manual Classification 5v+7w biased Classification 5v+7w balanced Classification 5v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Small (clockwise) 100% 1min 00s 100% 1min 10s 80% 100% 1min 07s
Small (anti-clockwise) 100% 59s 100% 1min 07s 15% 75%

Controlnet[edit]

Driving results (Controlnet network, whole image)
Manual Controlnet
Circuits Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 46s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 38s
Monaco (clockwise) 100% 1min 15s 5%
Monaco (anti-clockwise) 100% 1min 15s 5%
Nurburgrin (clockwise) 100% 1min 02s 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 75%


Temporal difference network[edit]

I'e tested a network that takes a gray scale difference image as the input image, but I've made a preprocess:

margin = 10
i1 = cv2.cvtColor(imgs[i], cv2.COLOR_BGR2GRAY)
i2 = cv2.cvtColor(imgs[i - (margin + 1)], cv2.COLOR_BGR2GRAY)
i1 = cv2.GaussianBlur(i1, (5, 5), 0)
i2 = cv2.GaussianBlur(i2, (5, 5), 0)
difference = np.zeros((i1.shape[0], i1.shape[1], 1))
difference[:, :, 0] = cv2.absdiff(i1, i2)
_, difference[:, :, 0] = cv2.threshold(difference[:, :, 0], 15, 255, cv2.THRESH_BINARY)
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
difference[:, :, 0] = cv2.morphologyEx(difference[:, :, 0], cv2.MORPH_CLOSE, kernel)


I've used a margin of 10 images between the 2 images. The result is:



Driving results (Temporal difference network, whole image)
Manual Temporal difference
Circuits Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 25%
Simple (anti-clockwise) 100% 1min 33s 10%
Monaco (clockwise) 100% 1min 15s 5%
Monaco (anti-clockwise) 100% 1min 15s 3%
Nurburgrin (clockwise) 100% 1min 02s 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 3%

Week 39: Study of difference images, Results, Controlnet[edit]

Results table (regression, cropped image)[edit]

Driving results (regression networks)
Manual Pilotnet v + w normal TinyPilotnet v + w Pilotnet v + w multiple (stacked) Pilotnet v + w multiple (stacked, difference images) LSTM-Tinypilotnet v + w DeepestLSTM-Tinypilotnet v + w LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 37s 100% 1min 41s 100% 1min 41s 100% 1min 39s 100% 1min 40s 100% 1min 37s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 38s 100% 1min 41s 10% 100% 1min 38s 100% 1min 38s 100% 1min 38s
Monaco (clockwise) 100% 1min 15s 100% 1min 20s 100% 1min 19s 85% 45% 50% 55%
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 19s 100% 1min 18s 15% 5% 35% 55%
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 04s 100% 1min 04s 8% 8% 40% 100% 1min 04s
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 06s 100% 1min 05s 80% 50% 50% 80%

Results table (regression, whole image)[edit]

Driving results (regression networks)
Manual Pilotnet normal TinyPilotnet Pilotnet multiple (stacked) Pilotnet multiple (stacked, difference images) LSTM-Tinypilotnet DeepestLSTM-Tinypilotnet Controlnet LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 41s 100% 1min 39s 100% 1min 40s 100% 1min 43s 100% 1min 39s 100% 1min 39s 100% 1min 46s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 39s 100% 1min 38s 100% 1min 46s 10% 10% 100% 1min 41s 100% 1min 37s
Monaco (clockwise) 100% 1min 15s 100% 1min 21s 100% 1min 19s 50% 5% 100% 1min 27s 50% 5%
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 23s 100% 1min 20s 7% 5% 50% 100% 1min 21s 5%
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 03s 100% 1min 05s 50% 8% 100% 1min 08s 100% 1min 05s 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 06s 100% 1min 06s 80% 50% 50% 100% 1min 07s 8%

Study of temporal images[edit]

I've tried to create a difference image with only two channels: HV. First, I made the absolute difference of the two images (separated 5 frames) for each channel. Then I normalized the difference between 0 and 255. It isn't a good solution for driving.

Straight line:

Curve:


I've create a sum image using numpy.add(x1, x2). The image result is:

Controlnet[edit]


Driving results (Controlnet network, whole image)
Manual Controlnet
Circuits Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 46s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 37s
Monaco (clockwise) 100% 1min 15s 5%
Monaco (anti-clockwise) 100% 1min 15s 5%
Nurburgrin (clockwise) 100% 1min 02s 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 8%

Week 38: Reading information, Temporal difference network, Results[edit]

Results(cropped image)[edit]

Driving results (classification networks)
Manual Classification 5v+7w biased Classification 5v+7w balanced Classification 5v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 41s 75% 100% 1min 42s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 39s 100% 1min 39s 100% 1min 43s
Monaco (clockwise) 100% 1min 15s 100% 1min 20s 70% 85%
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 18s 8% 100% 1min 20s
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 03s 100% 1min 03s 100% 1min 05s
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 05s 80% 80%

Results(whole image)[edit]

Driving results (classification networks)
Manual Classification 5v+7w biased Classification 5v+7w balanced Classification 5v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 35% 10% 90%
Simple (anti-clockwise) 100% 1min 33s 100% 1min 49s 100% 1min 46s 90%
Monaco (clockwise) 100% 1min 15s 100% 1min 24s 5% 100% 1min 23s
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 29s 8% 100% 1min 24s
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 10s 8% 90%
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 07s 8% 100% 1min 09s

Reading information[edit]

End to End Learning for Self-Driving Cars[edit]

In this paper (https://www.researchgate.net/publication/301648615_End_to_End_Learning_for_Self-Driving_Cars, https://github.com/Kejie-Wang/End-to-End-Learning-for-Self-Driving-Cars), a convolutional neural network (CNN) is trained to map raw pixels from a single front-facing camera directly to steering commands. The system automatically learns internal representations of the necessary processing steps such as detecting useful road features with only the human steering angle as the training signal.

Images are fed into a CNN which then computes a proposed steering command. The proposed command is compared to the desired command for that image and the weights of the CNN are adjusted to bring the CNN output closer to the desired output. The weight adjustment is accomplished using back propagation. Once trained, the network can generate steering from the video images of a single center camera.

Training data was collected by driving on a wide variety of roads and in a diverse set of lighting and weather conditions. Most road data was collected in central New Jersey, although highway data was also collected from Illinois, Michigan, Pennsylvania, and New York. Other road types include two-lane roads (with and without lane markings), residential roads with parked cars, tunnels, and unpaved roads. Data was collected in clear, cloudy, foggy, snowy, and rainy weather, both day and night. 72 hours of driving data was collected.

They train the weights of their network to minimize the mean squared error between the steering command output by the network and the command of either the human driver, or the adjusted steering command for off-center and rotated images. The network consists of 9 layers, including a normalization layer, 5 convolutional layers and 3 fully connected layers. The input image is split into YUV planes and passed to the network.

The first layer of the network performs image normalization. The convolutional layers were designed to perform feature extraction and were chosen empirically through a series of experiments that varied layer configurations. Theye use strided convolutions in the first three convolutional layers with a 2×2 stride and a 5×5 kernel and a non-strided convolution with a 3×3 kernel size in the last two convolutional layers. They follow the five convolutional layers with three fully connected layers leading to an output control value which is the inverse turning radius. The fully connected layers are designed to function as a controller for steering, but it is not possible to make a clean break between which parts of the network function primarily as feature extractor and which serve as controller.

To train a CNN to do lane following they only select data where the driver was staying in a lane and discard the rest. They then sample that video at 10 FPS. A higher sampling rate would result in including images that are highly similar and thus not provide much useful information. After selecting the final set of frames they augment the data by adding artificial shifts and rotations to teach the network how to recover from a poor position or orientation.

Before road-testing a trained CNN, they first evaluate the networks performance in simulation.

VisualBackProp: efficient visualization of CNNs[edit]

In this paper (https://arxiv.org/pdf/1611.05418.pdf)


Target-driven Visual Navigation in Indoor Scenesusing Deep Reinforcement Learning[edit]

In this paper (https://arxiv.org/pdf/1609.05143.pdf, https://www.youtube.com/watch?v=SmBxMDiOrvs)


Temporal difference network[edit]

In this method I test a new network with the difference image of it and it-5. The results are:

Driving results
Temporal_dif constant v + w [whole image] Temporal_dif v + w [whole image] Temporal_dif constant v + w [cropped image] Temporal_dif v + w [cropped image]
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 3min 37s 100% 1min 43s 100% 3min 37s 100% 1min 39s
Simple (anti-clockwise) 100% 3min 38s 100% 1min 44s 100% 3min 38s 100% 1min 42s
Monaco (clockwise) 45% 5% 45% 5%
Monaco (anti-clockwise) 45% 5% 8% 5%
Nurburgrin (clockwise) 8% 8% 8% 8%
Nurburgrin (anti-clockwise) 90% 8% 90% 8%


Difference image:

Week 37: Driving videos, Pilotnet multiple (stacked), Metrics table, Basic LSTM[edit]

Driving videos[edit]

Pilotnet network [whole image][edit]

I've used the predictions of the Pilotnet network (regression network) to driving a formula 1 (test3):

  • Simple circuit clockwise (simulation time: 1min 41s):


  • Simple circuit anti-clockwise (simulation time: 1min 39s):


  • Monaco circuit clockwise (simulation time: 1min 21s):


  • Monaco circuit anti-clockwise (simulation time: 1min 23s):


  • Nurburgrin circuit clockwise (simulation time: 1min 03s):


  • Nurburgrin circuit anti-clockwise (simulation time: 1min 06s):


Tinypilotnet network [whole image][edit]

I've used the predictions of the Tinypilotnet network (regression network) to driving a formula 1:

  • Simple circuit clockwise (simulation time: 1min 39s):


  • Simple circuit anti-clockwise (simulation time: 1min 38s):


  • Monaco circuit clockwise (simulation time: 1min 19s):


  • Monaco circuit anti-clockwise (simulation time: 1min 20s):


  • Nurburgrin circuit clockwise (simulation time: 1min 05s):


  • Nurburgrin circuit anti-clockwise (simulation time: 1min 06s):


Biased classfication network [cropped image][edit]

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to driving a formula 1 (simulation time: 1min 38s):

Results table (cropped image)[edit]

Driving results (regression networks)
Manual Pilotnet v + w normal TinyPilotnet v + w Pilotnet v + w multiple (stacked) Pilotnet v + w multiple (stacked, difference images) LSTM-Tinypilotnet v + w DeepestLSTM-Tinypilotnet v + w LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 40s 100% 1min 40s 100% 1min 41s 100% 1min 39s 100% 1min 40s 100% 1min 37s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 45s 100% 1min 40s 10% 100% 1min 38s 100% 1min 38s 100% 1min 38s
Monaco (clockwise) 100% 1min 15s 85% 85% 85% 45% 50% 55%
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 20s 100% 1min 18s 15% 5% 35% 55%
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 04s 100% 1min 04s 8% 8% 40% 100% 1min 04s
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 05s 100% 1min 05s 80% 50% 50% 80%


Driving results (classification networks)
Manual Classification 1v+7w biased Classification 4v+7w biased Classification 1v+7w balanced Classification 4v+7w balanced Classification 1v+7w imbalanced Classification 4v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 2min 16s 100% 1min 38s 100% 2min 16s 98% 100% 2min 16s 100% 1min 42s
Simple (anti-clockwise) 100% 1min 33s 100% 2min 16s 100% 1min 38s 100% 2min 16s 100% 1min 41s 100% 2min 16s 100% 1min 39s
Monaco (clockwise) 100% 1min 15s 45% 5% 5% 5% 5% 5%
Monaco (anti-clockwise) 100% 1min 15s 15% 5% 5% 5% 5% 5%
Nurburgrin (clockwise) 100% 1min 02s 8% 8% 8% 8% 8% 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 80% 90% 80% 80% 80% 80%

Results table (whole image)[edit]

Driving results (regression networks)
Manual Pilotnet v + w normal TinyPilotnet v + w Pilotnet v + w multiple (stacked) Pilotnet v + w multiple (stacked, difference images) LSTM-Tinypilotnet v + w DeepestLSTM-Tinypilotnet v + w LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 1min 41s 100% 1min 39s 100% 1min 40s 100% 1min 43s 100% 1min 39s 100% 1min 39s
Simple (anti-clockwise) 100% 1min 33s 100% 1min 39s 100% 1min 38s 100% 1min 46s 10% 10% 100% 1min 41s
Monaco (clockwise) 100% 1min 15s 100% 1min 21s 100% 1min 19s 50% 5% 100% 1min 27s 50%
Monaco (anti-clockwise) 100% 1min 15s 100% 1min 23s 100% 1min 20s 7% 5% 50% 100% 1min 21s
Nurburgrin (clockwise) 100% 1min 02s 100% 1min 03s 100% 1min 05s 50% 8% 100% 1min 08s 100% 1min 05s
Nurburgrin (anti-clockwise) 100% 1min 02s 100% 1min 06s 100% 1min 06s 80% 50% 50% 100% 1min 07s


Driving results (classification networks)
Manual Classification 1v+7w biased Classification 4v+7w biased Classification 1v+7w balanced Classification 4v+7w balanced Classification 1v+7w imbalanced Classification 4v+7w imbalanced
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 2min 17s 70% 75% 7% 100% 2min 17s 40%
Simple (anti-clockwise) 100% 1min 33s 100% 2min 17s 10% 100% 2min 16s 7% 100% 2min 16s 10%
Monaco (clockwise) 100% 1min 15s 5% 5% 5% 5% 5% 5%
Monaco (anti-clockwise) 100% 1min 15s 5% 5% 5% 5% 5% 5%
Nurburgrin (clockwise) 100% 1min 02s 8% 8% 8% 8% 8% 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 8% 8% 8% 8% 8% 8%

Pilotnet multiple (stacked)[edit]

In this method (stacked frames), we concatenate multiple subsequent input images to create a stacked image. Then, we feed this stacked image to the network as a single input. In this case, we have stacked 2 images separated by 10 frames. The results are:

Driving results
Pilotnet constant v + w multiple (stacked) [whole image] Pilotnet v + w multiple (stacked) [whole image] Pilotnet constant v + w multiple (stacked) [cropped image] Pilotnet v + w multiple (stacked) [cropped image]
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 3min 45s 100% 1min 40s 100% 3min 46s 100% 1min 41s
Simple (anti-clockwise) 100% 3min 44s 100% 1min 46s 100% 3min 46s 10%
Monaco (clockwise) 100% 2min 56s 50% 100% 2min 56s 85%
Monaco (anti-clockwise) 7% 7% 7% 15%
Nurburgrin (clockwise) 8% 50% 8% 8%
Nurburgrin (anti-clockwise) 100% 2min 27s 80% 90% 80%


We have also tried to stack 2 images, but separated but one is the image in the instantaneous it and the other is the difference image of it and it-10. The results are:

Driving results
Pilotnet constant v + w multiple (stacked) [whole image] Pilotnet v + w multiple (stacked) [whole image] Pilotnet constant v + w multiple (stacked) [cropped image] Pilotnet v + w multiple (stacked) [cropped image]
Circuits Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 3min 45s 100% 1min 43s 100% 3min 46s 100% 1min 39s
Simple (anti-clockwise) 100% 3min 36s 10% 100% 3min 46s 100% 1min 38s
Monaco (clockwise) 45% 5% 50% 45%
Monaco (anti-clockwise) 5% 5% 7% 5%
Nurburgrin (clockwise) 8% 8% 8% 8%
Nurburgrin (anti-clockwise) 90% 50% 90% 50%

Metrics table (cropped image)[edit]

Metrics results (Classification networks) [Train data]
Classification 7w biased Classification 4v biased Classification 7w balanced Classification 4v balanced Classification 7w imbalanced Classification 4v imbalanced
Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score
97% 99% 98% 97% 97% 98% 99% 98% 98% 98% 95% 99% 96% 95% 95% 94% 97% 95% 95% 95% 98% 99% 99% 99% 99% 98% 99% 98% 98% 98%


Metrics results (Classification networks) [Test data]
Classification 7w biased Classification 4v biased Classification 7w balanced Classification 4v balanced Classification 7w imbalanced Classification 4v imbalanced
Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score
94% 99% 95% 95% 95% 95% 98% 95% 95% 95% 93% 99% 94% 94% 94% 92% 96% 94% 93% 93% 95% 99% 95% 95% 95% 95% 97% 95% 95% 95%




Metrics results (Regression networks) [Train data]
Pilotnet w Pilotnet v Pilotnet w multiple (stacked) Pilotnet v multiple (stacked)
Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error
0.001754 0.027871 0.626956 0.452977 0.110631 0.230633 5.215044 1.563034


Metrics results (Regression networks) [Test data]
Pilotnet w Pilotnet v Pilotnet w multiple (stacked) Pilotnet v multiple (stacked)
Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error
0.002206 0.030515 0.849241 0.499219 0.108316 0.226848 5.272124 1.552658

Metrics table (whole image)[edit]

Metrics results (Classification networks) [Train data]
Classification 7w biased Classification 4v biased Classification 7w balanced Classification 4v balanced Classification 7w imbalanced Classification 4v imbalanced
Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score
97% 99% 97% 97% 97% 97% 99% 98% 98% 98% 95% 99% 96% 96% 96% 90% 95% 90% 90% 90% 98% 99% 98% 98% 98% 96% 98% 96% 96% 96%


Metrics results (Classification networks) [Test data]
Classification 7w biased Classification 4v biased Classification 7w balanced Classification 4v balanced Classification 7w imbalanced Classification 4v imbalanced
Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score Accuracy Accuracy top 2 Precision Recall F1-score
95% 99% 95% 95% 95% 94% 97% 95% 95% 95% 93% 99% 94% 93% 93% 89% 95% 91% 89% 90% 95% 99% 95% 95% 95% 94% 97% 95% 95% 95%




Metrics results (Regression networks) [Train data]
Pilotnet w Pilotnet v Pilotnet w multiple (stacked) Pilotnet v multiple (stacked) DeepestLSTM-Tinypilotnet w DeepestLSTM-Tinypilotnet v
Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error
0.000660 0.015514 0.809848 0.548209 0.068739 0.167565 8.973208 1.997035 0.001065 0.021000 0.491759 0.383216



Metrics results (Regression networks) [Test data]
Pilotnet w Pilotnet v Pilotnet w multiple (stacked) Pilotnet v multiple (stacked) DeepestLSTM-Tinypilotnet w DeepestLSTM-Tinypilotnet v
Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error Mean squared error Mean absolute error
0.000938 0.017433 1.374714 0.659400 0.067305 0.164354 9.402403 2.039585 0.000982 0.020472 0.549310 0.399267

Basic CNN+LSTM[edit]

I have created a network cnn + lstm and I have trained it with a set of 10 images. There are very few data, but so I tested the network that did not work with the original dataset.The code is:

import glob
import cv2
import numpy as np

from time import time
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

from keras.models import Sequential
from keras.layers import Flatten, Dense, Conv2D, BatchNormalization, Dropout, Reshape, MaxPooling2D, Activation
from keras.layers.recurrent import LSTM
from keras.optimizers import Adam


def get_images(list_images):
    # We read the images
    array_imgs = []
    for name in list_images:
        img = cv2.imread(name)
        img = cv2.resize(img, (img.shape[1] / 6, img.shape[0] / 6))
        array_imgs.append(img)

    return array_imgs


def lstm_model(img_shape):
    model = Sequential()

    model.add(Conv2D(32, (3, 3), padding='same', input_shape=img_shape, activation="relu"))
    model.add(BatchNormalization(axis=-1))
    model.add(MaxPooling2D(pool_size=(3, 3)))
    model.add(Dropout(0.25))

    model.add(Conv2D(64, (3, 3), padding='same', activation="relu"))
    model.add(BatchNormalization(axis=-1))
    model.add(Conv2D(64, (3, 3), padding='same', activation="relu"))
    model.add(BatchNormalization(axis=-1))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Conv2D(128, (3, 3), padding='same', activation="relu"))
    model.add(BatchNormalization(axis=-1))
    model.add(Conv2D(128, (3, 3), padding='same', activation="relu"))
    model.add(BatchNormalization(axis=-1))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(1024))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))

    model.add(Reshape((1024, 1)))
    model.add(LSTM(10, return_sequences = True))
    model.add(Dropout(0.2))
    model.add(LSTM(10))
    model.add(Dropout(0.2))
    model.add(Dense(5, activation="relu"))
    model.add(Dense(1))
    adam = Adam(lr=0.0001)
    model.compile(optimizer=adam, loss="mse", metrics=['accuracy', 'mse', 'mae'])
    return model


if __name__ == "__main__":

    # Load data
    list_images = glob.glob('Images/' + '*')
    images = sorted(list_images, key=lambda x: int(x.split('/')[1].split('.png')[0]))

    y = [71.71, 56.19, -44.51, 61.90, 67.86, -61.52, -75.73, 44.75, -89.51, 44.75]
    # We preprocess images
    x = get_images(images)

    X_train = x
    y_train = y
    X_t, X_val, y_t, y_val = train_test_split(x, y, test_size=0.20, random_state=42)

    # Variables
    batch_size = 8
    nb_epoch = 200
    img_shape = (39, 53, 3)


    # We adapt the data
    X_train = np.stack(X_train, axis=0)
    y_train = np.stack(y_train, axis=0)
    X_val = np.stack(X_val, axis=0)
    y_val = np.stack(y_val, axis=0)


    # Get model
    model = lstm_model(img_shape)

    model_history_v = model.fit(X_train, y_train, epochs=nb_epoch, batch_size=batch_size, verbose=2,
                              validation_data=(X_val, y_val))
    print(model.summary())


    # We evaluate the model
    score = model.evaluate(X_val, y_val, verbose=0)
    print('Evaluating')
    print('Test loss: ', score[0])
    print('Test accuracy: ', score[1])
    print('Test mean squared error: ', score[2])
    print('Test mean absolute error: ', score[3])


The results are:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 39, 53, 32)        896       
_________________________________________________________________
batch_normalization_1 (Batch (None, 39, 53, 32)        128       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 17, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 13, 17, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 13, 17, 64)        18496     
_________________________________________________________________
batch_normalization_2 (Batch (None, 13, 17, 64)        256       
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 13, 17, 64)        36928     
_________________________________________________________________
batch_normalization_3 (Batch (None, 13, 17, 64)        256       
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 6, 8, 64)          0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 6, 8, 64)          0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 6, 8, 128)         73856     
_________________________________________________________________
batch_normalization_4 (Batch (None, 6, 8, 128)         512       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 6, 8, 128)         147584    
_________________________________________________________________
batch_normalization_5 (Batch (None, 6, 8, 128)         512       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 3, 4, 128)         0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 3, 4, 128)         0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 1536)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              1573888   
_________________________________________________________________
activation_1 (Activation)    (None, 1024)              0         
_________________________________________________________________
batch_normalization_6 (Batch (None, 1024)              4096      
_________________________________________________________________
dropout_4 (Dropout)          (None, 1024)              0         
_________________________________________________________________
reshape_1 (Reshape)          (None, 1024, 1)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 1024, 10)          480       
_________________________________________________________________
dropout_5 (Dropout)          (None, 1024, 10)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 10)                840       
_________________________________________________________________
dropout_6 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 55        
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 6         
=================================================================
Total params: 1,858,789
Trainable params: 1,855,909
Non-trainable params: 2,880
_________________________________________________________________
None
Evaluating
('Test loss: ', 5585.3828125)
('Test accuracy: ', 0.0)
('Test mean squared error: ', 5585.3828125)
('Test mean absolute error: ', 72.8495864868164)

Basic LSTM[edit]

I've followed a LSTM tutorial (https://rubikscode.net/2018/03/26/two-ways-to-implement-lstm-network-using-python-with-tensorflow-and-keras/) to create an LSTM network in Keras. We've classified reviews from the IMDB dataset (https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification). The LSTM networks aren't keeping just propagating output information to the next time step, but they are also storing and propagating the state of the so-called LSTM cell. This cell is holding four neural networks inside – gates, which are used to decide which information will be stored in cell state and pushed to output. So, the output of the network at one time step is not depending only on the previous time step but depends on n previous time steps.

The dataset was collected by Stanford researchers back in 2011. It contains 25000 movie reviews (good or bad) for training and the same amount of reviews for testing. Our goal is to create a network that will be able to determine which of these reviews are positive and which are negative. Words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

The code is the following:

from keras.preprocessing import sequence 
from keras.models import Sequential 
from keras.layers import Dense, Dropout, Embedding, LSTM 
from keras.datasets import imdb

# We load dataset of top 1000 words
num_words = 1000 
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)

# We need to divide this dataset and create and pad sequences (using sequence from keras.preprocessing)
# In the padding we used number 200, meaning that our sequences will be 200 words long
X_train = sequence.pad_sequences(X_train, maxlen=200) 
X_test = sequence.pad_sequences(X_test, maxlen=200)

# Define network architecture and compile 
model = Sequential() 
model.add(Embedding(num_words, 50, input_length=200)) 
model.add(Dropout(0.2)) 
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) 
model.add(Dense(250, activation='relu')) 
model.add(Dropout(0.2)) 
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 

# We train the model
model.fit(X_train, y_train, batch_size=64, epochs=15) 

# We evaluate the model
score = model.evaluate(X_test, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])


We got the accuracy of 86.42%.

ETA: 0s - loss: 0.2874 - acc: 0.825000/25000 [==============================] - 134s 5ms/step - loss: 0.2875 - acc: 0.8776
25000/25000 [==============================] - 47s 2ms/step
('Test loss:', 0.32082191239356994)
('Test accuracy:', 0.86428)

Week 36: Results table, Data analysis, CARLA simulator, Udacity simulator[edit]

CARLA simulator[edit]

CARLA (http://proceedings.mlr.press/v78/dosovitskiy17a/dosovitskiy17a.pdf, http://carla.org/, https://carla.readthedocs.io/en/latest/, https://github.com/carla-simulator/carla) is an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. The simulation platform supports flexible specification of sensor suites and environmental conditions.

CARLA simulates a dynamic world and provides a simple interface between the world and an agent that interacts with the world. To support this functionality, CARLA is designed as a server-client system, where the server runs the simulation and renders the scene. The client API is implemented in Python and is responsible for the interaction between the autonomous agent and the server via sockets. The client sends commands and meta-commands to the server and receives sensor readings in return. Commands control the vehicle and include steering, accelerating, and braking. Meta-commands control the behavior of the server and are used for resetting the simulation, changing the properties of the environment, and modifying the sensor suite. Environmental properties include weather conditions, illumination, and density of cars and pedestrians. When the server is reset, the agent is re-initialized at a new location specified by the client.

CARLA allows for flexible configuration of the agent’s sensor suite. Sensors are limited to RGB cameras and to pseudo-sensors that provide ground-truth depth and semantic segmentation. The number of cameras and their type and position can be specified by the client. Camera parameters include 3D location, 3D orientation with respect to the car’s coordinate system, field of view, and depth of field. Its semantic segmentation pseudo-sensor provides 12 semantic classes: road, lane-marking, traffic sign, sidewalk, fence, pole,wall, building, vegetation, vehicle, pedestrian, and other.

CARLA provides a range of measurements associ-ated with the state of the agent and compliance with traffic rules. Measurements of the agent’s stateinclude vehicle location and orientation with respect to the world coordinate system, speed, acceleration vector, and accumulated impact from collisions. Measurements concerning traffic rules include the percentage of the vehicle’s footprint that impinges on wrong-way lanes or sidewalks, as well as states of the traffic lights and the speed limit at the current location ofthe vehicle. Finally, CARLA provides access to exact locations and bounding boxes of all dynamic objects in the environment. These signals play an important role in training and evaluating driving policies.

CARLA has the following features:

  • Scalability via a server multi-client architecture: multiple clients in the same or in different nodes can control different actors.
  • Flexible API: CARLA exposes a powerful API that allows users to control all aspects related to the simulation, including traffic generation, pedestrian behaviors, weathers, sensors, and much more.
  • Autonomous Driving sensor suite: users can configure diverse sensor suites including LIDARs, multiple cameras, depth sensors and GPS among others.
  • Fast simulation for planning and control: this mode disables rendering to offer a fast execution of traffic simulation and road behaviors for which graphics are not required.
  • Maps generation: users can easily create their own maps following the OpenDrive standard via tools like RoadRunner.
  • Traffic scenarios simulation: our engine ScenarioRunner allows users to define and execute different traffic situations based on modular behaviors.
  • ROS integration: CARLA is provided with integration with ROS via our ROS-bridge.
  • Autonomous Driving baselines: we provide Autonomous Driving baselines as runnable agents in CARLA, including an AutoWare agent and a Conditional Imitation Learning agent.

CARLA requires Ubuntu 16.04 or later. CARLA consists mainly of two modules, the CARLA Simulator and the CARLA Python API module. The simulator does most of the heavy work, controls the logic, physics, and rendering of all the actors and sensors in the scene; it requires a machine with a dedicated GPU to run. The CARLA Python API is a module that you can import into your Python scripts, it provides an interface for controlling the simulator and retrieving data. With this Python API you can, for instance, control any vehicle in the simulation, attach sensors to it, and read back the data these sensors generate.

Udacity's Self-Driving Car Simulator[edit]

Udacity's Self-Driving Car Simulator (https://github.com/udacity/self-driving-car-sim, https://www.youtube.com/watch?v=EcS5JPSH-sI, https://eu.udacity.com/course/self-driving-car-engineer-nanodegree--nd013) was built for Udacity's Self-Driving Car Nanodegree, to teach students how to train cars how to navigate road courses using deep learning.

Data analysis[edit]

I've analyzed the speed data (v and w) of the dataset. In particular, I've counted the data number for different speed ranges. For the angular velocity I have divided the angles of 0.3 into 0.3. And I have divided the linear velocities into negative, speed equal to 5, 9, 11 and 13.

The results (number of data) for w are:

w < -2.9             ===> 1
-2.9 <= w < -2.6     ===> 13
-2.6 <= w < -2.3     ===> 20
-2.3 <= w < -2.0     ===> 50
-2.0 <= w < -1.7     ===> 95
-1.7 <= w < -1.4     ===> 165
-1.4 <= w < -1.1     ===> 385
-1.1 <= w < -0.8     ===> 961
-0.8 <= w < -0.5     ===> 2254
-0.5 <= w < -0.2     ===> 1399
-0.2 <= w < -0.0     ===> 3225
0.0 <= w < 0.2       ===> 3399
0.2 <= w < 0.5       ===> 1495
0.5 <= w < 0.8       ===> 2357
0.8 <= w < 1.1       ===> 937
1.1 <= w < 1.4       ===> 300
1.4 <= w < 1.7       ===> 128
1.7 <= w < 2.0       ===> 76
2.0 <= w < 2.3       ===> 41
2.3 <= w < 2.6       ===> 31
2.6 <= w < 2.9       ===> 8
w >= 2.9             ===> 1


The results (number of data) for v are:

v <= 0             ===> 197
v = 5              ===> 9688
v = 9              ===> 3251
v = 11             ===> 2535
v = 13             ===> 1670

Results table (cropped image)[edit]

Driving results
Manual Classification 1v+7w biased Classification 4v+7w biased Classification 1v+7w balanced Classification 4v+7w balanced Classification 1v+7w imbalanced Classification 4v+7w imbalanced Pilotnet constant v + w normal Pilotnet v + w normal Pilotnet constant v + w multiple (stacked) Pilotnet v + w multiple (stacked) LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 75% 10% 80% 25% 10% 10% 100% 3min 46s 10% 100% 3min 46s 10%
Simple (anti-clockwise) 100% 1min 33s 25% 15% 65% 65% 100% 2min 16s 25% 100% 3min 46s 10% 100% 3min 46s 25%
Monaco (clockwise) 100% 1min 15s 45% 5% 45% 5% 45% 45% 45% 100% 1min 19s 100% 2min 56s 5%
Monaco (anti-clockwise) 100% 1min 15s 5% 10% 7% 5% 7% 10% 60% 100% 1min 23s 50% 5%
Nurburgrin (clockwise) 100% 1min 02s 8% 8% 8% 8% 8% 8% 8% 8% 8% 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 5% 5% 80% 80% 80% 80% 60% 80% 60% 3%

Results table (whole image)[edit]

Driving results
Manual Classification 1v+7w biased Classification 4v+7w biased Classification 1v+7w balanced Classification 4v+7w balanced Classification 1v+7w imbalanced Classification 4v+7w imbalanced Pilotnet constant v + w normal Pilotnet v + w normal Pilotnet constant v + w multiple (stacked) Pilotnet v + w multiple (stacked) DeepestLSTM-Tinypilotnet constant v + w DeepestLSTM-Tinypilotnet v + w LSTM
Circuits Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time Percentage Time
Simple (clockwise) 100% 1min 35s 100% 2min 17s 75% 90% 10% 75% 25% 100% 3min 46s 25% 10% 10% 10% 10%
Simple (anti-clockwise) 100% 1min 33s 100% 2min 17s 65% 98% 10% 75% 65% 100% 3min 46s 25% 15% 12% 25% 10%
Monaco (clockwise) 100% 1min 15s 5% 5% 5% 5% 5% 5% 45% 93% 100% 2min 56s 5% 45% 100% 1min 20s
Monaco (anti-clockwise) 100% 1min 15s 5% 5% 5% 5% 5% 30% 8% 100% 1min 26s 7% 5% 100% 2min 56s 95%
Nurburgrin (clockwise) 100% 1min 02s 8% 8% 8% 8% 8% 8% 8% 8% 8% 8% 8% 8%
Nurburgrin (anti-clockwise) 100% 1min 02s 3% 10% 80% 10% 80% 80% 80% 80% 3% 3% 3% 3%

DeepestLSTM-Tinypilotnet[edit]

I've trained a new model: DeepestLSTM-Tinypilotnet:

Week 35: Driving videos, Stacked network[edit]

Driving videos[edit]

Stacked network[edit]

I've used the predictions of the stacked (pilotnet with stacked frames) network (regression network) to driving a formula 1:

I've used the predictions of the stacked (pilotnet with stacked frames) network (regression network) for w and constant v to driving a formula 1:

Pilotnet network[edit]

I've used the predictions of the pilotnet network (regression network) to driving a formula 1 (test2):

In the following video complete one lap (simulation time: 1 min 26s):


I've used the predictions of the pilotnet network (regression network) for w and constant v to driving a formula 1:

Biased classification network[edit]

I've used the predictions of the classification network according to w (7 classes) and constant v to driving a formula 1:


Stacked network[edit]

In this method (stacked frames), we concatenate multiple subsequent input images to create a stacked image. Then, we feed this stacked image to the network as a single input. We refer to this method as stacked. This means that for image it at time/frame t, images it−1, it−2, ... will be concatenated. In our case, we have stacked 3 images separated by 2 frames. This means that for image it at time / frame t we concatenate the image t, t-3 and t-6.

Weeks 34: Driving videos[edit]

Driving videos[edit]

Biased classfication network[edit]

I've used the predictions of the classification network according to w (7 classes) and constant v to driving a formula 1 (simulation time: 2min 17s):


Balanced classfication network[edit]

I've used the predictions of the classification network according to w (7 classes) and constant v to driving a formula 1:


Pilotnet network[edit]

I've used the predictions of the pilotnet network (regression network) to driving a formula 1 (test1):

I've used the predictions of the pilotnet network (regression network) to driving a formula 1 (test2):


I've used the predictions of the pilotnet network (regression network) for w and constant v to driving a formula 1 (simulation time: 3min 46s):

Reading information[edit]

This week, I read some papers about Deep Learning for Steering Autonomous Vehicles. Some of these papers are:

Interpretable Learning for Self-Driving Cars by Visualizing Causal Attention[edit]

In this paper (https://arxiv.org/pdf/1703.10631.pdf), they use a visual attention model to train a convolution network end-to-end from images to steering angle. The attention model highlights image regions that potentially influence the network’s output. Some of these are true influences, but some are spurious. They then apply a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network’s behavior. They demonstrate the effectiveness of their model on three datasets totaling 16 hours of driving. they first show that training with attention does not degrade the performance of the end-to-end network. Then they show that the network causally cues on a variety of features that are used by humans while driving.

Their model predicts continuous steering angle commands from input raw pixels in an end-to-end manner. Their model predicts the inverse turning radius ût at every timestep t instead of steering angle commands, which depends on the vehicle’s steering geometry and also result in numerical instability when predicting near zero steering angle commands. The relationship between the inverse turning radius ut and the steering angle command θt can be approximated by Ackermann steering geometry.

To reduce computational cost, each raw input image is down-sampled and resized to 80×160×3 with nearest-neighbor scaling algorithm. For images with different raw aspect ratios, they cropped the height to match the ratio before down-sampling. They also normalized pixel values to [0, 1] in HSV colorspace. They utilize a single exponential smoothing method to reduce the effect of human factors-related performance variation and the effect of measurement noise.

They use a convolutional neural network to extract a set of encoded visual feature vector, which we refer to as a convolutional feature cube xt. Each feature vectors may contain high-level object descriptions that allow the attention model to selectively pay attention to certain parts of an input image by choosing a subset of feature vectors. They use a 5-layered convolution network that is utilized by Bojarski (https://arxiv.org/pdf/1604.07316.pdf) to learn a model for self-driving cars. They omit max-pooling layers to prevent spatial locational information loss as the strongest activation propagates through the model. They collect a three-dimensional convolutional feature cube xt from the last layer by pushing the preprocessed image through the model, and the output feature cube will be used as an input of the LSTM layers. Using this convolutional feature cube from the last layer has advantages in generating high-level object descriptions, thus increasing interpretability and reducing computational burdens for a real-time system.

They utilize a deterministic soft attention mechanism that is trainable by standard backropagation methods, which thus has advantages over a hard stochastic attention mechanism that requires reinforcement learning. They use a long short-term memory (LSTM) network that predicts the inverse turning radius and generates attention weights each timestep t conditioned on the previous hidden state and a current convolutional feature cube xt. More formally, let us assume a hidden layer conditioned on the previous hidden state and the current feature vectors. The attention weight for each spatial location i is then computed by multinomial logistic regression function.

The last step of their pipeline is a fine-grained decoder, in which they refine a map of attention and detect local visual saliencies. Though an attention map from their coarse-grained decoder provides probability of importance over a 2D image space, their model needs to determine specific regions that cause a causal effect on prediction performance. To this end, they assess a decrease in performance when a local visual saliency on an input raw image is masked out. They first collect a consecutive set of attention weights and input raw images for a user-specified T timesteps. They then create a map of attention (Mt). Their 5-layer convolutional neural network uses a stack of 5×5 and 3×3 filters without any pooling layer, and therefore the input image of size 80×160 is processed to produce the output feature cube of size 10×20×64, while preserving its aspect ratio. To extract a local visual saliency, they first randomly sample 2D N particles with replacement over an input raw image conditioned on the attention map Mt. They also use time-axis as the third dimension to consider temporal features of visual saliencies. They thus store spatio-temporal 3D particles. They then apply a clustering algorithm to find a local visual saliency by grouping 3D particles into clusters. In their experiment, they use DBSCAN, a density-based clustering algorithm that has advantages to deal with a noisy dataset because they group particles together that are closely packed, while marking particles as outliers that lie alone in low-density regions. For points of each cluster and each time frame t, they compute a convex hull to find a local region of each visual saliency detected.


Event-based Vision meets Deep Learning on Steering Prediction for Self-driving Cars[edit]

This paper (http://openaccess.thecvf.com/content_cvpr_2018/papers/Maqueda_Event-Based_Vision_Meets_CVPR_2018_paper.pdf, https://www.youtube.com/watch?v=_r_bsjkJTHA&feature=youtu.be) presents a deep neural network approach that unlocks the potential of event cameras on the prediction of a vehicle’s steering angle. They evaluate the performance of their approach on a publicly available large scale event-camera dataset (≈1000 km). They present qualitative and quantitative explanations of why event cameras (bio-inspired sensors that do not acquire full images at a fixed frame-rate but rather have independent pixels that output only intensity changes asynchronously at the time they occur) allow robust steering prediction even in cases where traditional cameras fail. Finally, they demonstrate the advantages of leveraging transfer learning from traditional to event-based vision, and show that their approach outperforms state-of-the-art algorithms based on standard cameras.

They propose a learning approach that takes as input the visual information acquired by an event camera and outputs the vehicle’s steering angle. The events are converted into event frames by pixel-wise accumulation over a constant time interval. Then, a deep neural network maps the event frames to steering angles by solving a regression task.

They preprocess the data. The steering angle’s distribution of a driving car is mainly picked in [−5º, 5º]. This unbalanced distribution results in a biased regression. In those situations where vehicles stand still, only noisy events will be produced. To handle those problems, they preprocessed the steering angles to allow successful learning. To cope with the first issue, only 30 % of the data corresponding to a steering angle lower than 5º is deployed at training time. For the latter they filtered out data corresponding to a vehicle’s speed smaller than 20km/h. To remove outliers, the filtered steering angles are then trimmed at three times their standard deviation and normalized to the range [−1, 1]. At testing time, all data corresponding to a steering angle lower than 5º is considered, as well as scenarios under 20km/h. The regressed steering angles are denormalized to output values in the range [−180º, 180º]. Finally, they scaled the network input (i.e., event images) to the range [0, 1].

Initially, they stack event frames of different polarity, creating a 2D event image. Afterwards, they deploy a series of ResNet architectures, i.e., ResNet18 and ResNet50. They use them as feature extractors for their regression problem, considering only their convolutional layers. To encode the image features extracted from the last convolutional layer into a vectorized descriptor, they use a global average pooling layer that returns the features’ channel-wise mean. After the global average pooling, they add a fully-connected (FC) layer (256-dimensional for ResNet18 and 1024-dimensional for ResNet50), followed by a ReLU non-linearity and the final one-dimensional fully-connected layer to output the predicted steering angle.

They use DAVIS Driving Dataset 2017 (DDD17). It contains approximately 12 hours of annotated driving recordings collected by a car under different and challenging weather, road and illumination conditions. It contains approximately 12 hours of annotated driving recordings collected by a car under different and challenging weather, road and illumination conditions. The dataset includes asynchronous events as well as synchronous, grayscale frames.

They predicted steering angles using three different types of visual inputs: 1. grayscale images, 2. difference of grayscale images, 3. images created by event accumulation. To evaluate the performance, they use the root-mean-squared error (RMSE) and the explained variance (EVA).

They analyze the performance of the network as a function of the integration time used to generate the input event images from the event stream (10, 25, 50, 100, and 200 ms). It can be observed that the larger the integration time, the larger is the trace of events appearing at the contours of objects. This is due to the fact that they moved a longer distance on the image plane during that time. They hypothesize that the network exploits such motion cues to provide a reliable steering prediction. The network performs best when it is trained on event images corresponding to 50 ms, and the performance degrades for smaller and larger integration times.

They perform an extensive study to evaluate the advantages of event frames over grayscale-based ones for different parts of the day. For fair comparison, they deploy the same convolutional network architectures as feature encoders for all considered inputs, but we train each network independently. The average RMSE is slightly diverse among different sets. This is to be expected, since RMSE, being dependent on the absolute value of the steering ground truth, is not a good metric for cross comparison between sequences. On the other hand, EVA gives a better way to compare the quality of the learned estimator across different sequences. They we observe a very large performance gap between the grayscale difference and the event images for the ResNet18 architecture. The main reasons behind this behavior that they identified are: (i) abrupt changes in lighting conditions occasionally produced artifacts in grayscale images (and therefore also in their differences), and (ii) at high velocities, grayscale images get blurred and their difference becomes also very noisy. However, that the ResNet50 architecture produced a significant performance improvement for both baselines (grayscale images and difference of grayscale images).

To evaluate the ability of their proposed methodology to cope with large variations in illumination, driving and weather conditions, we trained a single regressor over the entire dataset. They compare their approach to state-of-the-art architectures that use traditional frames as input: (i) Boarski (https://arxiv.org/pdf/1604.07316.pdf) and (ii) the CNN-LSTM architecture, advocated in https://arxiv.org/pdf/1612.01079.pdf, but without the additional segmentation loss that is not available in their dataset. All their proposed architectures based on event images largely outperform the considered baselines based on traditional frames.

Weeks 33: Reading information[edit]

Reading information[edit]

I've read some information about self-driving. I've read:

From Pixels to Actions: Learning to Drive a Car with Deep Neural Networks[edit]

In this paper (http://homes.esat.kuleuven.be/~jheylen/FromPixelsToActions/FromPixelsToActionsPaper_Wacv18.pdf), they analyze an end-to-end neural network to predict a car’s steering actions on a highway based on images taken from a single car-mounted camera. They focus their analysis on several aspects: the input data format, the temporal dependencies between consecutive inputs, and the origin of the data. For the task at hand, regression networks outperform their classifier counterparts. In addition, there seems to be a small difference between networks that use coloured images and ones that use grayscale images as input. For the second aspect, by feeding the network three concatenated images, we get a significant decrease of 30% in mean squared error. For the third aspect, by using simulation data we are able to train networks that have a performance comparable to networks trained on real-life datasets. They also qualitatively demonstrate that the standard metrics that are used to evaluate networks do not necessarily accurately reflect a system’s driving behaviour. They show that a promising confusion matrix may result in poor driving behaviour while a very ill-looking confusion matrix may result in good driving behaviour.

The main architecture is a variation of either the NVIDIA, AlexNet, or VGG19 architecture. For the Alexnet architecture, they removed the dropout of the final two dense layers and reduced their sizes to 500 and 200 neurons as this resulted in better performance. The output layer of the network depends on its type (regression or classification) and, for a classification network, on the amount of classes. For the case of the classification type, they quantize the steering angle measurements into discrete values, which represents the class labels. This quantization is needed as input when training a classifier network and allows to balance the data through sample weighting. This weighting acts as a coefficient for the network’s learning rate for each sample. A sample’s weight is directly related to the class that it belongs to when quantized. These class weights are defined as 1 divided by the amount of samples in the training set that belong to that class, multiplied by a constant so that the smallest class weight is equal to 1. Sample weighting is done for both classifier networks and regression networks.

They train and evaluate different networks on the Comma.ai dataset, which consists of 7.25 hours of driving, most of which is done on highways and during daytime. Images are captured at 20 Hz which results in approximately 552,000 images. They discarded the few sequences that were made during the night due to their high imbalance when compared to those captured during daytime. They limit ourselves to only considering images that were captured while driving on highways. The data is then split into two mutually exclusive partitions: a training set of 161,500 images and a validation set of 10,700 images.

They evaluate performance of their networks using the following performance metrics: accuracy, mean class accuracy (MCA), mean absolute error (MAE) and mean squared error (MSE) metrics. They base their conclusions on the MSE metric, since it allows them to take the magnitude of the error into account and assign a higher loss to larger errors than MAE does. This is desirable since this may lead to better driving behaviour, as they assume that it is easier for the system to recover from many small mistakes than from a few big mistakes. A large prediction error could result in a big sudden change of the steering wheel angle.

In the first experiment (Quantization granularity), they look into the influence that the specifications of the class quantization procedure have on the system’s performance. These specifications consist of the amount of classes and the mapping from the input range to these classes. They compare classifier networks with varying degrees of input measurement granularity. They also compare them to regression networks, which can be seen as having infinitely many classes, although using a different loss function. They conduct this experiment by comparing a coarse-grained quantization scheme with 7 classes and a finer-grained scheme with 17 classes. The difference between 7 and 17 for regression is in the class weighting. Each sample is given a weight based on their relative occurrences in 7 or 17 classes. Also, to be able to compare regression vs classification, the predicted regression outputs were discretized into 7 and 17 classes to calculate MCA in the same way this happened for the classification networks. The coarse-grained scheme scores better on the accuracy and MCA metric. Regression networks significantly outperform classifier networks on the MAE and MSE metrics, which are the most important metrics. Finally, they notice that class weighting does not have a significant impact on the performance of regression networks.

The second expetiment is about image colour scheme. They observed that there is no significant difference in performance between networks that use coloured (RGB) and grayscale images as input. This suggests that, for the task at hand, the system is not able to take much advantage of the colour information.

They evaluate methods that enable our system to take advantage of information that co-occurs in consecutive inputs. This could lead to significant increase in performance as the input images are obtained from successive frames of a video which introduces temporal consistencies.

In the first method (stacked frames), they concatenate multiple subsequent input images to create a stacked image. Then, they feed this stacked image to the network as a single input. We refer to this method as stacked. This means that for image it at time/frame t, images it−1, it−2, ... will be concatenated. To measure the influence of this stacked input, the input size must be the only variable. For this reason, the images are concatenated in the depth (channel) dimension and not in a new, 4th dimension. For example, stacking two previous images to the current RGB image of 160x320x3 pixels would change its size to 160x320x9 pixels. By doing this, the architecture stays the same since the first layer remains a 2D convolution layer. They expect that by taking advantage of the temporal information between consecutive inputs, the network should be able to outperform networks that perform independent predictions by taking single images as inputs. They compare single images to stacked frames of 2, 5 or 8 images. The results show that feeding the network stacked frames increases the performance on all metrics. Looking at MSE, we see a significant decrease of about 30% when comparing single images to stacked frames of 3 images. We assume that this is because the network can make a prediction based on the average information of multiple images. For a single image, the predicted value may be too high or too low. For concatenated images, the combined information could cancel each other out, giving a better ’averaged’ prediction. Increasing the amount of concatenated images only leads to small improvements with diminishing returns. Assuming that the network averages the images in some way, they do not want to increase this amount because the network loses responsiveness. Based on the observations, in our setting the configuration with 3 concatenated frames is preferable. It offers a significant boost in performance while the system remains relatively responsive.

In the second technique, they modify our architecture to include recurrent neural network layers. They use Long-term short-memory (LSTM) layers. These layers allow to capture the temporal information between consecutive inputs. The networks are trained on an input vector that consists of the input image and a number of preceding images, just like the stacked frames. Together with our training methodology, this results in a time window. Due to the randomization in our training, this is not a sliding window but a window at a random point of time for every input sample. They compared many variations of the NVIDIA architecture. They experimented with a configuration where they changed one or both of the two dense layers to LSTM layers, one where they added an LSTM layer after the dense layers and one where they changed the output layer to LSTM. Training these networks from scratch led to very poor performance. Perhaps, this might be caused by the fact that as the LSTM offers increased capabilities, it also has more parameters that need to be learned. They hypothesize that their dataset is too small to do this, especially without data augmentation. They load a pretrained network when we create a LSTM network. This pretrained network is the NVIDIA network variant from their granularity experiment with the corresponding output type. Depending on the exact architecture of the LSTM network, the weights of corresponding layers are copied. Weights of non-corresponding layers are initialized as usual. The weights of the convolutional layers are frozen as they have already been trained to detect the important features and this reduces the training time. The results show that the incorporation of LSTM layers did not increase nor reduce the network’s performance.

A last aspect they investigate is the origin of the data. They look into the advantages of a simulator over a real-world dataset and the uses of such a simulator. A simulator brings many advantages. Some examples are that data gathering is easy, cheap and can be automated. First the Udacity simulator is used to generate three datasets. The first dataset is gathered by manually driving around the first test-track in the simulator. The second dataset consists of recovery cases only. It is gathered by diverging from the road, after which the recovery to the middle of the road is recorded. A third validation dataset is gathered by driving around the track in the same way as with the first dataset. The NVIDIA architecture with a regression output is used and no sample weighting is applied during training.

The first experiment tests the performance of a network trained solely on the first dataset. The metrics are comparable to other runs on the real dataset. As the confusion matrix has a dense diagonal, good real-time driving performance is expected. When driving in the simulator, the network starts off quite well and stays nicely in the middle of the road. When it encounters a more difficult sharp turn, the network slightly miss-predicts some frames. The car deviates from the middle of the road and is not able to recover from its miss-predictions, eventually going completely off-track. We conclude that despite promising performance on the traditional metrics, the system fails to keep the car on the road.

The second experiment evaluates the influence of adding recovery data. First a new network is trained solely on the recovery dataset. The confusion matrix is focused on steering sharply to the left or right. As it does not look very promising and the MCA is very low, it is expected this network will not perform very well during real-time driving. Despite the low performance on these metrics, the network manages to keep the car on track. The car however does not stay exactly in the middle of the road. Instead, it diverts from the centre of the road, after which it recovers back towards the middle. It then diverts towards the other side and back to the middle again, and so on. The car thus wobbles softly during the straight parts of the track, but handles the sharp turns surprisingly well.

A third network is trained on both datasets and has a confusion matrix similar to the first network. In the simulator, it performs quite well, driving smoothly in the middle of the lane on the straight parts as well as in sharp turns. We conclude that recovery cases have a significant impact on the system’s driving behaviour. By adding these recovery cases, the driving performance of the system is improved while its performance on metrics deteriorates. This again suggests that the standard metrics might not be a good tool to accurately assess a network’s driving behaviour.

GTA V is integrated as a more realistic simulator platform and it offers a dataset. This dataset is composed by 600k images split into 430k training images and 58k validation images. A NVIDIA and an AlexNet regression network, are trained on the dataset with sample weights based on 17 classes. The network shows performance metrics similar to the NVIDIA regression network trained on the real-world dataset. They evaluate real-time driving performance on an easy, non-urban road with clear lane markings. The network performs quite well and stays around the centre of the lane. When approaching a road with vague lane markings, such as a small bridge, the car deviates towards the opposite lane. When it reaches a three-way crossing, the network can not decide whether to go left or right, as it was equally trained on both cases. Because of this, it drives straight and goes off-road. In an urban environment, the network struggles with the same problem, resulting in poor real-time performance. Current metrics are not always representative for real-time driving performance.

Weeks 31, 32: Driving video, New Dataset, Number of data for each class, Data statistics, Effect of image colourspace on CNN[edit]

Driving videos[edit]

Lstm_tinypilotnet[edit]

I've used the predictions of the lstm_tinypilotnet network (regression network) to driving a formula 1:

Imbalanced classfication network[edit]

I've used the predictions of the classification network according to w (7 classes) and constant v to driving a formula 1:

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to driving a formula 1:


Balanced classfication network[edit]

I've used the predictions of the classification network according to w (7 classes) and constant v to driving a formula 1:

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to driving a formula 1:


Biased classfication network[edit]

I've used the predictions of the classification network according to w (7 classes) and constant v to driving a formula 1:

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to driving a formula 1:

New Dataset[edit]

I've based on the code created for the follow-line practice of JdeRobot Academy (http://vanessavisionrobotica.blogspot.com/2018/05/practica-1-follow-line-prueba-2.html) in order to create a new dataset. This new dataset has been generated using 3 circuits so that the data is more varied. The circuits of the worlds monacoLine.world, f1.launch (simpleCircuit.world) and f1-chrono.launch have been used. Once the dataset is complete, a file (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Follow%20Line/dl-driver/Net/split_train_test.py) has used to divide the data into train and test. It has been decided to divide the dataset by 70% for train and 30% for test. Since the dataset was 17341 pairs of values, we now have 12138 pairs of train values ​​and 5203 pairs of test values.

Number of data for each class[edit]

I've used a script (evaluate_class.py) that shows a graph of the number of examples that exist for each class (7 classes of w). In the following images we can see the graph for the training data:



Data statistics[edit]

To analyze the data, I've analyzed two lines of each image and calculated the centroids of the corresponding lines (row 250 and row 360). On the x-axis of the graph, the centroid of row 350 is represented and the y-axis represents the centroid of row 260 of the image. In the following image we can see the representation of this statistic of the training set (new dataset) (red circles) against the driving data (blue crosses).

Effect of image colourspace on CNN[edit]

I've read the paper "Effect Of Image Colourspace On Performance Of Convolution Neural Networks" (https://ieeexplore.ieee.org/document/8256949). Generally in CNN the processing of images is done in RGB colourspace even though we have many other colourspaces available. In this paper they try to understand the effect of image colourspace on the performance of CNN models in recognizing the objects present in the image. They evaluate this on CIFAR10 dataset, by converting all the original RGB images into four other colourspaces like HLS, HSV, LUV, YUV etc. To compare results they have trained AlexNet (5 convolution layers and 3 fully connected layers) with fixed set of parameters on all five colourspaces, including RGB. They have observed that LUV colourspace is the best alternative to RGB colourspace to use with CNN models with almost equal performance on the test set of CIFAR10 dataset. While YUV colourspace is the worst to use with CNN models.

They've used CIFAR10 dataset, which has 60000 RGB images (32x32x3). The dataset has 60000 images with 10 classes where 50000 images are for training and 10000 images are for testing. They have converted the entire dataset of 60000 RGB colourspace images into four different colourspaces HSL, HSV, LUV, and YUV. In each dataset the training set of 50000 images is further split into two parts for training and validation, where 80% of images are used for training and 20% images are used for validation.

The architecture of AlexNet (5 convolution layers and 3 fully connected layers) that was used for training is: Convolution + MaxPooling + Batch Normalization, Convolution + MaxPooling + Batch Normalization, Convolution, Convolution, Convolution + MaxPooling, Fully Connected, Fully Connected, Fully Connected. They train AlexNet using the datasets separately. The training procedure for each dataset involves preprocessing, hyper-parameter selection and training.

They observed that LUV colourspace is a very good alternative to the RGB colourspace while training CNN model to recognize the objects in the image. Also we have observed that the YUV colourspace is the worst among others. RGB colourspace gives a test accuracy of 67.84% where LUV colourspace gives a test accuracy of 61%. The test accuracy of YUV colourspace is 25% less than the RGB colourspace and 19% less than the LUV colourspace. RGB and LUV have correct prediction along with high confidence scores. But in the case of colourspaces like YUV, HLS and HSV either the prediction is wrong or the confidence of prediction is very less.

Week 30: Driving videos, Classification network, LSTM-Tinypilotnet[edit]

Driving videos[edit]

I've used the predictions of the classification network according to w (7 classes) and constant v to driving a formula 1:

Problems with Ubuntu and Gazebo[edit]

This week, I had problems with Ubutu, I had to reinstall it several times, and when installing Gazebo the worlds look different. For this reason, the red line hasn't stripes and looks darker. Now, the trained models don't work correctly for these reasons. I have tried to put several lights in the world and it seems that the line looks more clear, but not being so similar to the training data causes the car crashes. Probably, the BGR color space is not the best to train. It may be that HSV is better, because it is more invariant in light changes.

  • BRG:


  • HSV:


Driving analysis[edit]

I've relabelled the images from https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/Failed_driving/4v_7w (classification network 4v+7w) at my criteria. And I've compared the relabelled data with the driving data. A 82% accuracy is obtained for w and 77% for v.

I've relabelled the images from https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/Failed_driving/7w (classification network 7w) at my criteria. And I've compared the relabelled data with the driving data. A 64% accuracy is obtained for w.


LSTM-Tinypilotnet[edit]

I've trained a new model: LSTM-Tinypilotnet:

Classification network[edit]

I've imbalanced the classification network for w. The result can be watched in diving videos.

Week 29: Driving videos, Dataset coherence study, Driving analysis, New gui[edit]

Driving videos[edit]

I've used the predictions of the classification network according to w (7 classes) and v constant to driving a formula 1:


I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to driving a formula 1:


I've used the predictions of the regression network for w and constant v to driving a formula 1:


I've used the predictions of the regression network for w and v to driving a formula 1:

Dataset coherence study[edit]

To analyze the data, I've analyzed two lines of each image and calculated the centroids of the corresponding lines (row 250 and row 360). On the x-axis of the graph, the centroid of row 350 is represented and the y-axis represents the centroid of row 260 of the image. In the following images we can see the representation of this statistic of the training set (new dataset) for w and for v.

In the next image we see how the points are divided by colors according to their class of w (7 classes). Class "radically_left" is represented in red, class "moderately_left" is represented in blue, class "slightly_left" is represented in green, class "slight" is represented in cyan, "slightly_right" is represented in purple, "moderately_right" is represented in yellow, and "radically_right" is represented in black.


In the next image we see how the points are divided by colors according to their class of v (4 classes). Class "slow" is represented in red, class "moderate" is represented in blue, class "fast" is represented in green, and "very_fast" is represented in purple.


Diving analysis[edit]

I've relabelled the images from https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/Failed_driving at my criteria. And I've compared the relabelled data with the driving data. A 72% accuracy is obtained.

New gui[edit]

I've modified the gui adding LEDs. Under the image of the car camera there are 7 leds that correspond to the 7 classes of w. The LED that corresponds to the class predicted by the network will light. To the right of the image there are 4 leds for v that correspond to the 4 classes of v.

Week 28: Classification Network, Regression Network, Reading information[edit]

This week, I've retrained the models with the new dataset. This new dataset is divided into 11275 pairs of images and speed data for training, and 4833 pairs of images and data for testing.

Driving videos[edit]

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to driving a formula 1:


I've used the predictions of the regression network (223 epochs for v and 212 for w) to driving a formula 1:



Data statistics[edit]

To analyze the data, a new statistic was created (analysis_vectors.py). I've analyzed two lines of each image and calculated the centroids of the corresponding lines (row 250 and row 360). On the x-axis of the graph, the centroid of row 350 is represented and the y-axis represents the centroid of row 260 of the image. In the following image we can see the representation of this statistic of the training set (new dataset) (red circles) iagainst the driving data (blue crosses).


Reading information[edit]

I've read some information about self-driving. I've read about different architectures:


Classification network for w[edit]

I've retrained the classification network for w with the new dataset. The test results are:

Classification network for v[edit]

I've retrained the classification network for v with the new dataset. The test results are:

Week 27: Data's Analysis, New Dataset[edit]

Number of data for each class[edit]

At https://jderobot.org/Vmartinezf-tfm#Follow_line_with_classification_network_and_with_regression_network we saw that the car was not able to complete the entire circuit with the classification network of 7 classes and constant v. For this reason, we want to evaluate our dataset and see if it is representative. For this we've saved the images that the F1 sees during the driving with the neural network and some data. This data can be found at https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/Failed_driving.

I've created a script (evaluate_class.py) that shows a graph of the number of examples that exist for each class (7 classes of w). In the following images we can see first the graph for the training data and then the graph for the driving data.


Data statistics[edit]

To analyze the data, a new statistic was created (analysis_vectors.py). I've analyzed two lines of each image and calculated the centroids of the corresponding lines (row 250 and row 360). On the x-axis of the graph, the centroid of row 350 is represented and the y-axis represents the centroid of row 260 of the image. In the following image we can see the representation of this statistic of the training set (red circles) iagainst the driving data (blue crosses).



Entropy[edit]

I've used the entropy how measure of simility. The Shannon entropy is the measure of information of a set. Entropy is defined as the expected value of the information. I've follow a Python example of the book "Machine Learning In Action" (http://www2.ift.ulaval.ca/~chaib/IFT-4102-7025/public_html/Fichiers/Machine_Learning_in_Action.pdf). In this book uses the following function to calculate the Shannon entropy of a dataset:

def calculate_shannon_entropy(dataset):
    numEntries = len(dataset)
    labelCounts = {}
    for featVec in dataset: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 
        labelCounts[currentLabel] += 1
    print(labelCounts)
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

First, you calculate a count of the number of instances in the dataset. This could have been calculated inline, but it’s used multiple times in the code, so an explicit variable is created for it. Next, you create a dictionary whose keys are the values in the final column. If a key was not encountered previously, one is created. For each key, you keep track of how many times this label occurs. Finally, you use the frequency of all the different labels to calculate the probability of that label. This probability is used to calculate the Shannon entropy, and you sum this up for all the labels.

In this example the dataset is:

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

In my case, we want to measure the entropy of train set and entropy of driving train. In my case, I haven't used the labels and I've used the centroid of 3 rows of images. For this reason, I've modified the function "calculate_shannon_entropy":

def calculate_shannon_entropy(dataset):
    numEntries = len(dataset)
    labels = []
    counts = []
    for featVec in dataset: #the the number of unique elements and their occurance
        found = False
        for i in range(0, len(labels)):
            if featVec == labels[i]:
                found = True
                counts[i] += 1
        if not found:
            labels.append(featVec)
            counts.append(0)
    shannonEnt = 0.0
    for num in counts:
        prob = float(num)/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

My dataset is like:

[[32, 445, 34], [43, 12, 545], [89, 67, 234]]


The entropy's results are:

Shannon entropy of driving: 0.00711579828413
Shannon entropy of dataset: 0.00336038443482

SSIM and MSE[edit]

In addition, to verify the difference between the piloting data and the training data I've used the SSIM and MSE measurements. The MSE value is obtained, although it isn't a very representative value of the similarity between images. Structural similarity aims to address this shortcoming by taking texture into account.

The Structural Similarity (SSIM) index is a method for measuring the similarity between two images. The SSIM index can be viewed as a quality measure of one of the images being compared, provided the other image is regarded as of perfect quality.

I've analyzed the image that the car saw just before leaving the road. I've compared this image with the whole training set. For this I've calculated the average SSIM. In addition, I'e calculated the minimum SSIM and the maximum SSIM. The minimum SSIM is given in the case that we compare our image with a very disparate one. And the maximum SSIM is given in the case that we compare the image with the one that is closest to the training set. Next, the case of the minimum SSIM is shown on the left side and the case of the maximum SSIM on the right side. For each case, the corresponding SSIM, the MSE, the medium SSIM, the average MSE, the image, the iamgen of the dataset with which the supplied SSIM corresponds, and the SSIM image are printed.



In addition, the same images are provided for a piloting image of each class of w.


  • 'radically_left':



  • 'moderately_left':



  • 'slightly_left':



  • 'slight':



  • 'slightly_right':



  • 'moderately_right':



  • 'radically_right':


New Dataset[edit]

I've based on the code created for the follow-line practice of JdeRobot Academy (http://vanessavisionrobotica.blogspot.com/2018/05/practica-1-follow-line-prueba-2.html) in order to create a new dataset. This new dataset has been generated using 3 circuits so that the data is more varied. The circuits of the worlds monacoLine.world, f1.launch and f1-chrono.launch have been used.

Week 26: Follow line with classification network, Studying Tensorboard, Classification network for v, Regression network for w and v[edit]

Follow line with classification network and with regression network[edit]

I've used the predictions of the classification network according to w (7 classes) to pilot a formula 1. Depending on the class of w, a different angle of rotation is given to the vehicle and the linear speed remains constant. With this network part of the circuit is achieved, but the car crashes when leaving a curve. Below, you can see an example:

I've used the predictions of the classification network according to w (7 classes) and v (4 classes) to pilot a formula 1. Depending on the class of w, a different angle of rotation is given to the vehicle and depending on the class of v, a different linear speed is given to the vehicle. With this network part of the circuit is achieved, but the car crashes when leaving a curve. Below, you can see an example:

I've used the predictions of the regression network to drive a formula 1 (223 epochs for v and 212 epochs for w):

Studying Tensorboard[edit]

Tensorboard (https://www.tensorflow.org/guide/summaries_and_tensorboard, https://github.com/tensorflow/tensorboard) is a suite of visualization tools that makes it easier to understand, debug, and optimize TensorFlow programs. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it. Tensorboard can also be used with Keras. I've followed some tutorials: https://www.datacamp.com/community/tutorials/tensorboard-tutorial, http://fizzylogic.nl/2017/05/08/monitor-progress-of-your-keras-based-neural-network-using-tensorboard/, https://keras.io/callbacks/.

Tensorboard is a separate tool you need to install on your computer. You can install Tensorboard using pip the python package manager:

pip install Tensorboard

To use Tensorboard you have to modify the Keras code a bit. You need to create a new TensorBoard instance and point it to a log directory where data should be collected. Next you need to modify the fit call so that it includes the tensorboard callback.

from time import time
from keras.callbacks import TensorBoard

tensorboard = TensorBoard(log_dir="logs/{}".format(time()))
model.fit(X_train, y_train, epochs=nb_epochs, batch_size=batch_size, callbacks=[tensorboard])

We can pass different arguments to the callback:

  • log_dir: the path of the directory where to save the log files to be parsed by TensorBoard.
  • histogram_freq: frequency (in epochs) at which to compute activation and weight histograms for the layers of the model. If set to 0, histograms won't be computed. Validation data (or split) must be specified for histogram visualizations.
  • write_graph: whether to visualize the graph in TensorBoard. The log file can become quite large when write_graph is set to True.
  • write_grads: whether to visualize gradient histograms in TensorBoard. histogram_freq must be greater than 0.
  • batch_size: size of batch of inputs to feed to the network for histograms computation.
  • write_images: whether to write model weights to visualize as image in TensorBoard.
  • embeddings_freq: frequency (in epochs) at which selected embedding layers will be saved. If set to 0, embeddings won't be computed. Data to be visualized in TensorBoard's Embedding tab must be passed as embeddings_data.
  • embeddings_layer_names: a list of names of layers to keep eye on. If None or empty list all the embedding layer will be watched.
  • embeddings_metadata: a dictionary which maps layer name to a file name in which metadata for this embedding layer is saved. In case if the same metadata file is used for all embedding layers, string can be passed.
  • embeddings_data: data to be embedded at layers specified in embeddings_layer_names. Numpy array (if the model has a single input) or list of Numpy arrays (if the model has multiple inputs).

The callback raises a ValueError if histogram_freq is set and no validation data is provided. Using Tensorboard callback will work while eager execution is enabled, however outputting histogram summaries of weights and gradients is not supported, and thus histogram_freq will be ignored.

To run TensorBoard, use the following command:

tensorboard --logdir=path/to/log-directory

where logdir points to the directory where the FileWriter serialized its data. If this logdir directory contains subdirectories which contain serialized data from separate runs, then TensorBoard will visualize the data from all of those runs. For example, in our case we use:

tensorboard --logdir=logs/

Once TensorBoard is running, navigate your web browser to localhost:6006 to view the TensorBoard. When looking at TensorBoard, you will see the navigation tabs in the top right corner. Each tab represents a set of serialized data that can be visualized.

Tensorboard has different views which take inputs of different formats and display them differently. You can change them on the orange top bar. Different views of Tensorboard are:

  • Scalars: Visualize scalar values, such as classification accuracy.
  • Graph: Visualize the computational graph of your model, such as the neural network model.
  • Distributions: Visualize how data changes over time, such as the weights of a neural network.
  • Histograms: A fancier view of the distribution that shows distributions in a 3-dimensional perspective.
  • Projector: Can be used to visualize word embeddings (that is, word embeddings are numerical representations of words that capture their semantic relationships).
  • Image: Visualizing image data.
  • Audio: Visualizing audio data.
  • Text: Visualizing text (string) data.

If we see the graphs of the training we can check when our model is being overfitting. For example, in a training of a classification model (4 classes) with a batch_size of 32 and an epochs of 40 we can see the point where the training stops being efficient. In the graphs of the validation set we can see that from epochs 23 training is no longer efficient.

Classification network for v[edit]

The files data.json, train.json and test.json have been modified to add a new classification that divides the linear speed into 4 classes. The classes are the following:

slow: if the linear speed is v <= 7.
moderate: if the linear speed is v > 7 and v <= 9.
fast: if the linear speed is v > 9 and v <= 11.
very_fast: if the linear speed is v > 11.

I've trained a model with the 4 classes mentioned above. The CNN architecture I am using is SmallerVGGNet, a simplified version of VGGNet. After training the network, we save the model (models/model_smaller_vgg_4classes_v.h5) and show the graphs of loss and accuracy for training and validation according to the epochs. For that, I've used Tensorboard:

In addition, I evaluate the accuracy, precision, recall, F1-score (in test set) and we paint the confusion matrix. The results are the following:


Regression network for w and v[edit]

I've trained two regression networks (for v and for w) following the Pilotnet architecture. To get an idea of ​​the efficiency of the training I've used Tensorboard. I've trained both networks with 1000 epochs to see how they behaved. The results can be seen below, where the red curve represents the model of v and the blue curve represents the model of w.


  • Accuracy:

  • Loss:

  • Mean squared error:

  • Mean absolute error:

Week 25: Correction of the binary classification model, correction of driver node, accuracy top2, Pilotnet network[edit]

Correction of the binary classification model[edit]

I modified the binary classification models/model_classification.h5, which had some error. That model has been removed and is now called model_binary_classification.h5. After training the network, we save the model (models/model_binary_classification.h5) and evaluate the model with the validation set:

In addition, we evaluate the accuracy, precision, recall, F1-score and we paint the confusion matrix. The results are the following:


Correction of driver node[edit]

The driver node had an error that caused Formula 1 to stop sometimes. This error was due to the fact that the telemarketer was interfering with the speed of the vehicle. This error has been corrected and the F1 can be driven correctly.


Accuracy top 2 of multiclass classification network[edit]

To get an idea of ​​the results with the multiclass classification network that we trained previously, we have obtained a measure of accuracy top 2. To obtain this metric we give as good a prediction if it is equal to the true label or if it is equal to the adjacent label, that is, we take into account only 1 accuracy jump. In this way we obtain a 100% accuracy.

Pilotnet network[edit]

This week, one of the objectives was to train a regression network to learn values ​​of linear speed and angular speed of the car. For this we've followed the article "Explaining How to Deep Neural Network Trained with End-to-End Learning Steers to Car" (https://arxiv.org/pdf/1704.07911.pdf) and we've created a network with the Pilotnet architecture. NVIDIA has created a neural-network-based system, known as PilotNet, which outputs steering angles given images of the road ahead. PilotNet is trained using road images paired with the steering angles generated by a human driving a data-collection car. In our case we have pairs of images and (v, w). The Pilonet architecture is as follows:


The Pilotnet model can be seen below:

model = Sequential()
# Normalization
model.add(BatchNormalization(epsilon=0.001, axis=-1, input_shape=img_shape))
# Convolutional layer
model.add(Conv2D(24, (5, 5), strides=(2, 2), activation="relu", input_shape=img_shape))
# Convolutional layer
model.add(Conv2D(36, (5, 5), strides=(2, 2), activation="relu"))
# Convolutional layer
model.add(Conv2D(48, (5, 5), strides=(2, 2), activation="relu"))
# Convolutional layer
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation="relu"))
# Convolutional layer
model.add(Conv2D(64, (3, 3), strides=(1, 1), activation="relu"))
# Flatten
model.add(Flatten())
model.add(Dense(1164, activation="relu"))
# Fully-connected layer
model.add(Dense(100, activation="relu"))
# Fully-connected layer
model.add(Dense(50, activation="relu"))
# Fully-connected layer
model.add(Dense(10, activation="relu"))
# Output: vehicle control
model.add(Dense(1))
model.compile(optimizer="adam", loss="mse", metrics=['accuracy'])


In order to learn the angular speed (w) and the linear speed (v) we train two networks, one for each value. The model for v is called model_pilotnet_v.h5 (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/dl-driver/Net/Keras/models) and the model for w is called model_pilotnet_w.h5.

After training the networks, we save the models (models/model_pilotnet_v.h5, models/model_pilotnet_w.h5) and evaluate (loss, accuracy, mean squared error and mean absolute error) the models with the validation set:

In addition, we evaluate the networks with the test set:

The w's evaluation had a problem. This problem is fixed and the result is:

Evaluation w:
('Test loss:', 0.0021951024647809555)
('Test accuracy:', 0.0)
('Test mean squared error: ', 0.0021951024647809555)
('Test mean absolute error: ', 0.0299175349963251)

Week 24: Adding new class, Classification network[edit]

Adding new class[edit]

The files data.json, train.json and test.json have been modified to add a new classification that divides the angles of rotation into 7 classes. The classes are the following:

radically_right: if the rotation's angle is w <= -1.
moderately_right: if the rotation's angle is -1 < w <= -0.5.
slightly_right: if the rotation's angle is -0.5 < w <= -0.1.
slight: if the rotation's angle is -0.1 < w < 0.1.
slightly_left: if the rotation's angle is 0.1 <= w < 0.5.
moderately_left: if the rotation's angle is 0.5 <= w < 1.
radically_left: if the rotation's angle is w >= 1.


Multiclass classification network[edit]

I've followed this blog https://www.pyimagesearch.com/2018/05/07/multi-label-classification-with-keras/ as example to make the classification network. In this case I have trained a model with the 7 classes mentioned above.

The CNN architecture I am using is SmallerVGGNet, a simplified version of VGGNet. The VGGNet model was first introduced by Simonyan and Zisserman in their 2014 paper, Very Deep Convolutional Networks for Large Scale Image Recognition (https://arxiv.org/pdf/1409.1556/). In this case we keep an image of the model with plot_model to see the architecture of the network. The model (classification_model.py) is as follows:

After training the network, we save the model (models/model_smaller_vgg_7classes_w.h5) and evaluate the model with the validation set:

We also show the graphs of loss and accuracy for training and validation according to the epochs:


In addition, we evaluate the accuracy, precision, recall, F1-score and we paint the confusion matrix. The results are the following:


Week 23: Improving driver node, classification network, and driver test[edit]

Driver node[edit]

The driver node has been modified to make an inference per cycle. To do this, the threadNetwork.py file and the classification_network.py file have been created. threadNetwork allows you to make a prediction per cycle by calling the predict method of the class ClassificationNetwork (classification_network.py).

Classification network[edit]

A file (add_classification_data.py) has been created to modify the data.json file and add the left/right classification. If the data w is positive then the classification will be left, while if w is negative the classification will be right.

Once the dataset is complete, a file (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Follow%20Line/dl-driver/Net/split_train_test.py) has been created to divide the data into train and test. It has been decided to divide the dataset by 70% for train and 30% for test. Since the dataset was 5006 pairs of values, we now have 3504 pairs of train values ​​and 1502 pairs of test values. The train data is i https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/dl-driver/Net/Dataset/Train, and test is in https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/dl-driver/Net/Dataset/Test.

The classification_train.py file has been created that allows training a classification network. This classification network aims to differentiate between left and right. In this file before training we eliminate the pairs of values ​​where the angle is close to 0 (with margin 0.08), because they will not be very significant data. In addition, we divide the train set by 80% for train and 20% for validation.

In our case we will use a very small convnet with few layers and few filters per layer, alongside dropout. Dropout also helps reduce overfitting, by preventing a layer from seeing twice the exact same pattern. Our model have a simple stack of 3 convolution layers with a ReLU activation and followed by max-pooling layers. This is very similar to the architectures that Yann LeCun advocated in the 1990s for image classification (with the exception of ReLU). On top of it we stick two fully-connected layers. We end the model with a single unit and a sigmoid activation, which is perfect for a binary classification. To go with it we will also use the binary_crossentropy loss to train our model. The model (classification_model.py) is as follows:

    model = Sequential()
    model.add(Conv2D(32, (3, 3), input_shape=input_shape))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(32, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(64, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    # The model so far outputs 3D feature maps (height, width, features)

    model.add(Flatten()) # This converts our 3D feature maps to 1D feature vectors
    model.add(Dense(64))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])


After training the network, we save the model (models/model_classification.h5) and evaluate the model with the validation set:



We also show the graphs of loss and accuracy for training and validation according to the epochs:



In addition, the classification_test.py file has been created to evaluate the model in a data set that has not been seen by the network. In this file the test set is used and accuracy, precision, recall, F1-score are evaluated and the confusion matrix is ​​painted. The results are the following:


Driver test[edit]

I've tried to go around the circuit with formula 1 based on the predictions of left or right. To do this, a prediction is made in each iteration that will tell us right or left. If the prediction is right we give a negative angular speed and if it is left it will be positive. At all times we leave a constant linear speed. The result is not good, as the car hits a curve.

Week 22: Improving driver node[edit]

Driver node[edit]

This week, I've improved driver node. Now, you can can see the linear and angular speeds in the gui, and you can save or remove the data (Dataset).

Week 21: Dataset generator and driver node[edit]

Dataset generator[edit]

The final goal of the project is to make a follow-line using Deep Learning. For this it is necessary to collect data. For this reason, I have based on the code created for the follow-line practice of JdeRobot Academy (http://vanessavisionrobotica.blogspot.com/2018/05/practica-1-follow-line-prueba-2.html) in order to create a dataset. The created dataset (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Follow%20Line/Dataset) contains the input images with the corresponding linear and angular speeds (in a json file). To create this dataset I have made a Python file that contains functions that allow to create it (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Follow%20Line/generator.py).


Driver node[edit]

In addition, I've created a driver node based on the objectdetector node, which allows to connect neural networks. For now the initial gui looks like this:

Weeks 19,20: First steps with Follow Line[edit]

Reading some information about autonomous driving[edit]

These weeks, I've read the article "Self-Driving a Car in simulation through a CNN" and he tfg (https://ebuah.uah.es/dspace/handle/10017/33946) on which this article was based. This paper researches into Convolutional Neural Networks and its architecture, to develop a new CNN with the capacity to control lateral and longitudinal movements of a vehicle in an open-source driving simulator (Udacity's Self-Driving Car Simulator) replicating the human driving behavior.

Training data is collected by driving the simulator car manually (using the Logitech Force Pro steering wheel), obtaining images from a front-facing camera and synchronizing steering angle and throttle values ​​performed by the driver. The dataset is augmented by horizontal flipping, changing steering angles sign and taking information from left and right cameras from the car. The simulator provides a file (drive.py) in charge of establishing the communication between the vehicle and the neural network, so that the network receives the image captured by the camera and returns an acceleration value and a rotation angle.

The CNNs tested in this project are described, trained and tested using Keras. The CNNs tested are:

  • TinyPilotNet: developed as a reduction from NVIDIA PilotNet CNN used for self-driving a Car. TinyPilotNet network is composed by a 16x32x1 pixels image input (image has a single channel formed by saturation channel

from HSV color space), followed by two convolutional layers, a dropout layer and a flatten layer. The output of this architecture is formed by two fully connected layers that leads into a couple of neurons, each one of them dedicated to predict steering and throttle values respectively.

  • DeeperLSTM-TinyPilotNet: formed by more layers and higher input resolution.
  • DeepestLSTM-TinyPilotNet: formed by three 3x3 kernel convolution layers, combined with maxpooling layers, followed by three 5x5 convolutional LSTM layers and two fully connected layers.
  • Long Short-Term Memory (LSTM) layers are included to TinyPilotNet architecture with the aim of improving the CNN driving performance and predict new values influenced by previous ones, and not just from the current input image. These layers are located at the end of the network, previous to the fully-connected layers. During the training, the dataset is used sequentially, not shuffled.

The training data is an essential part for a correct performance by the convolutional neuronal network. However, extracting an important amount of data is a great difficulty. The data augmentation allows you to modify or increase the training information of the network from a data bank previously obtained so that the CNN more easily understand what data to extract to obtain the expected results. The training data treatment effects carried out in the study are:

  • RGB image: The input image of the network is modified, which previously it was an image that took only the saturation of the HSV color space, by an image of 3 RGB color channels. To be implemented in a CNN, it is simply necessary to modify the first layer of the network.
  • Increase in resolution: it supposes a modification of the input layer's dimension.
  • Image cropping: consists of extracting a specific area from the image in which considers that the information relevant to CNN is concentrated. The image that CNN will analyze it only contains information about the road, eliminating the landscape part of the frame.
  • Edge detector filter: consists of extracting the edges of the input image to highlight them on the original image. This is achieved by using a Canny filter.

In order to compare the performance of the CNN with other networks, a frame-to-frame comparison is made between CNN steering angle and throttle values and human values as ground truth. The metric mean square error (RMSE) is used, obtained with the difference between the CNN address and the accelerator predicted values ​​and given humans. Driving data collected by a human driver is needed to compare the CNN given values for each frame with the values used by the human driver. This parameter does not evaluate the ability of the network to use previous steering and throttle values to predict the new ones, so the LSTM layer does not make effect and appears underrated in comparison with other CNNs that do not use these kind of layers.

To solve this problem, new quality parameters have been proposed to quantify the performance of the network driving the vehicle. These parameters are measured with the information extracted from the simulator when the network is being tested. To calculate these parameters, center points of the road -named as waypoints- are needed, separated 2 meters. These new metrics are:

  • Center of road deviation: One of these parameters is the shifting from the center of the road, or center of road deviation. The lower this parameter is, the better the performance will be, because the car will drive in the center of the road instead of driving in the limits. To calculate deviation, nearest waypoint is needed to calculate distance between vehicle and segment bounded by that waypoint and previous or next.
  • Heading angle: the lower this parameter, the better the performance will be, because lower heading angle means softer driving, knowing better the direction to follow.

To train the network, a data collection is carried out by manual driving in the simulator, traveling several laps in the circuit to obtain a large volume of information. This information will contain recordings of images of the dashboard collected by a central camera and two lateral ones located on both sides of the vehicle at 14 FPS, in addition to data of steering wheel angle, acceleration, brake and absolute speed linked to images. All CNNs have been trained using the left, center and right images of the vehicle by applying a 5° offset to the steering angle. In addition, the training dataset has been increased by performing a horizontal image flip (mirroring) to the images, also inverting the steering wheel angle value, but maintaining the acceleration value.

In order to determine the performance of the CNNs using the metrics previously established, different experiments are made. It experiments with different CNN for steering wheel control:

  • Steering wheel control with a CNN: The results of these modifications are analyzed on the control of the angle of rotation of the steering wheel, setting an acceleration of 5% that makes the vehicle circulate at its speed maximum on the circuit. The following networks are analyzed: TinyPilotNet (16x32 pixel input image), Higher resolution TinyPilotNet (20x40 pixel input image), HD-TinyPilotNet (64x128 pixel input image), RGB-TinyPilotNet (3-channel input image), LSTM-TinyPilotNet (LSTM layers are added at the exit to the network), DeeperLSTM-TinyPilotNet (combines the effects of the LSTM network and the higher resolution network, increasing the size of the input image up to 20x40 pixels), Cropping-DeeperLSTM-TinyPilotNet (is similar to the DeeperLSTM-TinyPilotNet network developed above, but the image cropping effect is applied to its input), Edge-DeeperLSTM-TinyPilotNet (an edge detection is made and a sum type fusion is made, highlighting the edges in the image). It is observed that the only network that improves the RMSE is the TinyPilotNet with higher resolution. However, visually in the simulator it is seen that driving is much better through CNNs containing LSTM layers. The use of RGB-TinyPilotNet, as well as HD-TinyPilotNet, is discarded, as these networks aren't able to guide the vehicle without leaving the road. Following the criterion of improving the average error with respect to the center of the lane for the rest of the networks, the order of performance from best to worst is the following: 1.DeeperLSTM-TinyPilotNet, 2.Cropping-DeeperLSTM-TinyPilotNet, 3.TinyPilotNet, 4.LSTM-TinyPilotNet, 5.Higher resolution TinyPilotNet, 6.Edge-DeeperLSTM-TinyPilotNet. Based on the pitch angle, once the RGB network has been discarded, the order of the different networks ordered from best to worst performance is as follows: 1.Higher resolution TinyPilotNet, 2.DeeperLSTM-TinyPilotNet, 3.Edge-DeeperLSTM-TinyPilotNet, 4.LSTM-TinyPilotNet, 5.Cropping-DeeperLSTM-TinyPilotNet, 6.TinyPilotNet.
  • Control of steering wheel angle and acceleration through CNNs disconnected: the angle of the steering wheel and the acceleration of the vehicle at each moment are controlled simultaneously. For this, two convolutional neuronal networks will be configured, each specialized in a type of displacement. Based on the results obtained in the steering wheel control, it test the following networks: TinyPilotNet, Higher resolution TinyPilotNet, DeeperLSTM-TinyPilotNet, Edge-DeeperLSTM-TinyPilotNet. The only network that improves this the RMSE is the Higher resolution TinyPilotNet. Networks that include LSTM layers aren't able to keep the vehicle inside the circuit. Higher resolution TinyPilotNet equals TinyPilotNet with respect to the deviation from the lane center. None of the trained CNNs is capable of improving the pitch factor of the simple network. In conclusion, the training of separate networks to control longitudinal and lateral displacement is not a reliable method.
  • Control of the angle of the steering wheel and acceleration through the same CNN: it proposes to control the vehicle through a single convolutional neural network, so that it has two outputs, which stores the values ​​of angle and acceleration. For this, it is necessary to modify the architecture of the previously used networks slightly, changing the output neuron for a pair of neurons. The networks trained following this method are the following: TinyPilotNet, Higher resolution TinyPilotNet, DeeperLSTM-TinyPilotNet, Edge-DeeperLSTM-TinyPilotNet, DeepestLSTM-TinyPilotNet, Edge-DeepestLSTM-TinyPilotNet. The only network that improves this the RMSE is the Higher resolution TinyPilotNet. The networks TinyPilotNet, DeeperLSTM-TinyPilotNet and the trained with detection of edges aren't able to keep the vehicle on the road. Following the criterion of the improvement of the average error with respect to the center of the lane for the rest of the networks, the order of performance from best to worst is the following: 1.DeepestLSTM-TinyPilotNet, 2.Higher resolution TinyPilotNet. The only networks that improve the average pitch parameter of the network TinyPilotNet are the ones that include a greater number of layers. The DeepestLSTM-TinyPilotNet network improves up to 36% pitch, producing a smoother and more natural circulation without applying edge detection to the information. The use of a network with LSTM layers and a greater number of configurable parameters produces a more natural driving, with less pitch and minor deviation. However, the fact of controlling the steering wheel and accelerator make the results slightly worse than those obtained in the case of steering wheel control exclusively.

In conclusion:

  • A slight increase in the resolution of the input image produces notable improvements in both quality factors without assuming a significant increase in the size of the network or the processing time.
  • The inclusion of Long Short-Term Memory (LSTM) layers in the output of a convolutional neural network provides an influence of the values ​​previously contributed by it, which leads to a smoother conduction.
  • The use of the RGB color input image instead of using only the saturation of the HSV color space results in a lesser understanding of the environment by CNN, which leads to bad driving. When using the saturation channel, the road remains highlighted in black, while the exterior of this one obtains lighter colors, producing a simple distinction of the circuit.
  • The information about acceleration doesn't produce better control of the steering angle.
  • To obtain evaluation metric values ​​similar to those obtained by a CNN that only controls the steering wheel on a CNN that controls steering wheel and accelerator is necessary increase the depth of the network.
  • The greater definition of edges in the input image through the Canny filter doesn't produce significant improvement.
  • Make a cropping in the input image to extract only the information from the road doesn't improve driving.

Creation of the synthetic dataset[edit]

I have created a synthetic dataset. To create this dataset I have created a background image and I have created a code (https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Follow%20Line/First%20steps/dataset_generator.py) that allows you to modify this background and add a road line. This code allows to generate a dataset of 200 images with lines at different angles. The angle of each road in each image has been saved in a txt file. Next, we can see an example of the images.

Background:

Image:


Getting started with Neural Network for regression[edit]

Also, I've started to study Neural Network for regression. A regression model allows us to predict a continuous value based on data that it already know. I've followed a tutorial (https://medium.com/@rajatgupta310198/getting-started-with-neural-network-for-regression-and-tensorflow-58ad3bd75223) that creates a model of neural networks for regression on financial data (https://in.finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI&guccounter=1). The code can be found at https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Examples%20Deep%20Learning/Neural%20Network%20Regression/Tensorflow. Next, the code is explained:

At the beginning, we preprocess the data and leave 60% of data for training and 40% for test.

We built our neural net model:

  • tf.Variable will create a variable of which value will be changing during optimization steps.
  • tf.random_uniform will generate random number of uniform distribution of dimension specified ([input_dim,number_of_nodes_in_layer]).
  • tf.zeros will create zeros of dimension specified (vector of (1,number_of_hidden_node)).
  • tf.add() will add two parameters.
  • tf.matmul() will multiply two matrices (Weight matrix and input data matrix).
  • tf.nn.relu() is an activation function that after multiplication and addition of weights and biases we apply activation function.
  • tf.placeholder() will define gateway for data to graph.
  • tf.reduce_mean() and tf.square() are function for mean and square in mathematics.
  • tf.train.GradientDescentOptimizer() is class for applying gradient decent.
  • GradientDescentOptimizer() has method minimize() to mimize target function/cost function.

We will train neural network by iterating it through each sample in dataset. Two for loops used one for epochs and other for iteration of each data. Completion of outer for loop will signify that an epoch is completed.

  • tf.Session() initiate current session.
  • sess.run() is function that run elements in graph.
  • tf.global_variables_initializer() will initialize all variables.
  • tf.train.Saver() class will help us to save our model.
  • sess.run([cost,train],feed_dict={xs:X_train[j,:], ys:y_train[j]}) this actually running cost and train step with data feeding to neural network one sample at a time.
  • sess.run(output,feed_dict={xs: X_train}) this running neural network feeding with only test features from dataset.

So finally we completed our neural net in Tensorflow for predicting stock market price. The result can be seen in the following graph:

Weeks 17,18: Understanding LSTM[edit]

Understanding LSTM[edit]

Sometimes it is necessary to use previous information to process the current information. Traditional neural networks can not do this, and it seems an important deficiency. Recurrent neural networks address this problem. They are networks with loops that allow the information to persist. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning ... Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version.

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task. Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies”. In practice, RNNs don’t seem to be able to learn them. LSTMs don’t have this problem.

LSTMs (Long Short-Term Memory networks) (http://www.bioinf.jku.at/publications/older/2604.pdf, https://colah.github.io/posts/2015-08-Understanding-LSTMs/) are a type of RNN (Recurrent Neural Network) architecture that addresses the vanishing/exploding gradient problem and allows learning of long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997). They work very well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior. All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure. LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The main idea consists in a memory cell as a interchangeably block which can maintain its state over time. The key to LSTMs is the cell state (the horizontal line running through the top of the diagram). The cell state runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through”. An LSTM has three of these gates, to protect and control the cell state.


The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer". The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

It’s now time to update the old cell state, Ct−1, into the new cell state Ct. Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.


First exampke of LSTM[edit]

To implement a first simple example of this kind of networks a tutorial (https://www.knowledgemapper.com/knowmap/knowbook/jasdeepchhabra94@gmail.comUnderstandingLSTMinTensorflow(MNISTdataset) has been followed in which we discover how to develop an LSTM network in tensorflow. In this example, we use MNIST as our dataset. The result of this implementation can be found in https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/blob/master/Examples%20Deep%20Learning/LSTM/Tensorflow/lstm_mnist.py.

Week 16: Follow line dataset[edit]

The final goal of the project is to make a follow-line using Deep Learning. For this it is necessary to collect data. For this reason, I have based on the code created for the follow-line practice of JdeRobot Academy (http://vanessavisionrobotica.blogspot.com/2018/05/practica-1-follow-line-prueba-2.html) in order to create a dataset. The created dataset contains the input images with the corresponding linear and angular speeds. In addition, a dataset has been created that contains images with a single row with the corresponding linear and angular speeds. The dataset is in: https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/follow%20line/dataset. In the file speeds.txt, in each line there is the linear speed and the angular speed corresponding to the image that is in the Image folder. The same goes for the row images that are in the Images_640_1 folder.

Week 15: Read papers about Deep Learning for Steering Autonomous Vehicles, CNN with Tensorflow[edit]

This week, I read some papers about Deep Learning for Steering Autonomous Vehicles. Some of these papers are:


  • End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies (https://arxiv.org/pdf/1710.03804.pdf): In this work, they propose a Convolutional Long Short-Term Memory Recurrent Neural Network (C-LSTM), that is end-to-end trainable, to learn both visual and dynamic temporal dependencies of driving. To train and validate their proposed methods, they used the publicly available Comma.ai dataset. The system they propose is comprised of a front-facing RGB camera and a composite neural network consisting of a CNN and LSTM network that estimate the steering wheel angle based on the camera input. Camera images are processed frame by frame by the CNN. The CNN is pre-trained, on the Imagenet dataset, that features 1.2 million images of approximately 1000 different classes and allows for recognition of a generic set of features and a variety of objects with a high precision. Then, they transfer the trained neural network from that broad domain to another specific one focusing on driving scene images. The LSTM then processes a sequence of w fixed-length feature vectors (sliding window) from the CNN. In turn, the LSTM layers learn to recognize temporal dependences leading to a steering decision Yt based on the inputs from Xt−w to Xt. Small values of t lead to faster reactions, but the network learns only short-term dependences and susceptibility for individually misclassified frames increases. Whereas large values of t lead to a smoother behavior, and hence more stable steering predictions, but increase the chance of learning wrong long-term dependences. The sliding window concept allows the network to learn to recognize different steering angles from the same frame Xi but at different temporal states of the LSTM layers. For the domain-specific training, the classification layer of the CNN is re-initialized and trained on camera road data. Training of the LSTM layer is conducted in a many-to-one fashion; the network learns the steering decisions that are associated with intervals of driving.


  • Reactive Ground Vehicle Control via Deep Networks (https://pdfs.semanticscholar.org/ec17/ec40bb48ec396c626506b6fe5386a614d1c7.pdf): They present a deep learning based reactive controller that uses a simple network architecture requiring few training images. Despite its simple structure and small size, their network architecture, called ControlNet, outperforms more complex networks in multiple environments using different robot platforms. They evaluate ControlNet in structured indoor environments and unstructured outdoor environments. This paper focuses on the low-level task of reactive control, where the robot must avoid obstacles that were not present during map construction such as dynamic obstacles and items added to the environment after map construction. ControlNet abstracts RGB images to generate control commands: turn left, turn right, and go straight. ControlNet’s architecture consists of alternating convolutional layers with max pooling layers, followed by two fully connected layers. The convolutional and pooling layers extract geometric information about the environment while the fully connected layers act as a general classifier. A long short-term memory (LSTM) layer allows the robot to incorporate temporal information by allowing it to continue moving in the same direction over several frames. ControlNet has 63223 trainable parameters.


  • Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car (https://arxiv.org/pdf/1704.07911.pdf): NVIDIA has created a neural-network-based system, known as PilotNet, which outputs steering angles given images of the road ahead. PilotNet is trained using road images paired with the steering angles generated by a human driving a data-collection car. It derives the necessary domain knowledge by observing human drivers. Road tests demonstrated that PilotNet can successfully perform lane keeping in a wide variety of driving conditions, regardless of whether lane markings are present or not. PilotNet training data contains single images sampled from video from a front-facing camera in the car, paired with the corresponding steering command (1/r), where r is the turning radius of the vehicle. The training data is augmented with additional image/steering-command pairs that simulate the vehicle in different off-center and off-orientation poistions. The PilotNet's network consists of 9 layers, including a normalization layer, 5 convolutional layers and 3 fully connected layers. The input image is split into YUV planes and passed to the network. The central idea in discerning the salient objects is finding parts of the image that correspond to locations where the feature maps have the greatest activations. The activations of the higher-level maps become masks for the activations of lower levels using the following algorithm: (1) in each layer, the activations of the feature maps are averaged; (2) the top most averaged map is scaled up to the size of the map of the layer below; (3) the up-scaled averaged map from an upper level is then multiplied with the averaged map from the layer below; (4) the intermediate mask is scaled up to the size of the maps of layer below in the same way as described Step (2); (5) the up-scaled intermediate map is again multiplied with the averaged map from the layer below; (6) Steps (4) and (5) above are repeated until the input is reached. The last mask which is of the size of the input image is normalized to the range from 0.0 to 1.0 and becomes the final visualization mask (shows which regions of the input image contribute most to the output of the network).


  • Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues (https://arxiv.org/pdf/1708.03798.pdf): in this work they focus on a visionbased model that directly maps raw input images to steering angles using deep networks. First, the model is learned and evaluated on real human driving videos that are time-synchronized with other vehicle sensors. This differs from many prior models trained from synthetic data in racing games. Second, state-of-the-art models, such as PilotNet, mostly predict the wheel angles independently on each video frame, which contradicts common understanding of driving as a stateful process. Instead, their proposed model strikes a combination of spatial and temporal cues, jointly investigating instantaneous monocular camera observations and vehicle’s historical states. This is in practice accomplished by inserting carefully-designed recurrent units (e.g., LSTM and Conv-LSTM) at proper network layers. Third, to facilitate the interpretability of the learned model, they utilize a visual back-propagation scheme for discovering and visualizing image regions crucially influencing the final steering prediction.


  • Agile Autonomous Driving using End-to-End Deep Imitation Learning (https://arxiv.org/pdf/1709.07174.pdf): they present an end-to-end imitation learning system for agile, off-road autonomous driving using only low-cost on-board sensors. By imitating a model predictive controller equipped with advanced sensors, they train a deep neural network control policy to map raw, high-dimensional observations to continuous steering and throttle commands. Compared with recent approaches to similar tasks, their method requires neither state estimation nor on-the-fly planning to navigate the vehicle. Their approach relies on, and experimentally validates, recent imitation learning theory.


In addition, I followed the Tensorflow convolutional neural networks tutorial (https://www.tensorflow.org/tutorials/layers). In this tutorial, I've learnt how to use layers to build a convolutional neural network model to recognize the handwritten digits in the MNIST data set. As the model trains, you'll see log output like the following:

INFO:tensorflow:loss = 2.36026, step = 1
INFO:tensorflow:probabilities = [[ 0.07722801  0.08618255  0.09256398, ...]]
...
INFO:tensorflow:loss = 2.13119, step = 101
INFO:tensorflow:global_step/sec: 5.44132
...
INFO:tensorflow:Saving checkpoints for 20000 into /tmp/mnist_convnet_model/model.ckpt.
INFO:tensorflow:Loss for final step: 0.14782684.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-01-15:31:44
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/mnist_convnet_model/model.ckpt-20000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-06-01-15:31:53
INFO:tensorflow:Saving dict for global step 20000: accuracy = 0.9695, global_step = 20000, loss = 0.10200113
{'loss': 0.10200113, 'global_step': 20000, 'accuracy': 0.9695}

Here, I've achieved an accuracy of 96.95% on our test data set.

Week 14: Testing DetectionSuite[edit]

This week, I've been testing detectionSuite. To prove it, you need to try DatasetEvaluationApp. To check the operation of this app you have to execute:

./DatasetEvaluationApp --configFile=appConfig.txt

My appConfig.txt is:

--datasetPath
/home/vanejessi/DeepLearningSuite/DeepLearningSuite/DatasetEvaluationApp/sampleFiles/datasets/

--evaluationsPath
/home/vanejessi/DeepLearningSuite/DeepLearningSuite/DatasetEvaluationApp/sampleFiles/evaluations

--weightsPath
/home/vanejessi/DeepLearningSuite/DeepLearningSuite/DatasetEvaluationApp/sampleFiles/weights/yolo_2017_07

--netCfgPath
/home/vanejessi/DeepLearningSuite/DeepLearningSuite/DatasetEvaluationApp/sampleFiles/cfg/darknet

--namesPath
/home/vanejessi/DeepLearningSuite/DeepLearningSuite/DatasetEvaluationApp/sampleFiles/cfg/SampleGenerator

--inferencesPath
/home/vanejessi/DeepLearningSuite/DeepLearningSuite/DatasetEvaluationApp/sampleFiles/evaluations


To check its operation, you have to select the following:

However, I have had some errors in execution width CUDA:

mask_scale: Using default '1,000000'
CUDA Error: no kernel image is available for execution on the device
CUDA Error: no kernel image is available for execution on the device: El archivo ya existe

Week 13: Reinstall DetectionSuite[edit]

This week, I reinstalled Darknet and DetectionSuite. You can find the installation information in the following link: https://github.com/JdeRobot/dl-DetectionSuite/blob/master/DeepLearningSuite/Dockerfile/Dockerfile.

Week 12: GUI with C++[edit]

This week, I created some GUIs and I followed this http://zetcode.com/gui/gtk2/introduction/ and https://developer.gnome.org/gtkmm-tutorial/stable/index.html.en.

Week 11: Component for object detection[edit]

This week, I created a component in Python for people detection using an SSD network. The component can be seen in the following video:

Week 10: Embedding Python in C++, SSD[edit]

This week, I searched for information on embedding Python code in C++. You can find information in the following links: https://docs.python.org/2/extending/embedding.html, https://www.codeproject.com/Articles/11805/Embedding-Python-in-C-C-Part-I, https://realmike.org/blog/2012/07/05/supercharging-c-code-with-embedded-python/, https://skebanga.github.io/embedded-python-pybind11/. I have made a simple example (hello.cpp):

#include <stdio.h>
#include <python3.5/Python.h>

int main()
{
	PyObject* pInt;

	Py_Initialize();

	PyRun_SimpleString("print('Hello World from Embedded Python!!!')");
	
	Py_Finalize();

	printf("\nPress any key to exit...\n");
}

It is possible to have some compilation errors if you use:

gcc hello.cpp -o hello

That's why I compiled in the following way:

gcc hello.cpp -o hello -L/usr/lib/python2.7/config/ -lpython2.7

The result is:

Hello World from Embedded Python!!!

Press any key to exit...


I have also made another more complex example:

  • C++ code:
#include <python2.7/Python.h>

int
main(int argc, char *argv[])
{
    PyObject *pName, *pModule, *pDict, *pFunc;
    PyObject *pArgs, *pValue;
    int i;

    if (argc < 3) {
        fprintf(stderr,"Usage: call pythonfile funcname [args]\n");
        return 1;
    }

    Py_Initialize();
    pName = PyString_FromString(argv[1]);
    /* Error checking of pName left out */

    pModule = PyImport_Import(pName);
    Py_DECREF(pName);

    if (pModule != NULL) {
        pFunc = PyObject_GetAttrString(pModule, argv[2]);
        /* pFunc is a new reference */

        if (pFunc && PyCallable_Check(pFunc)) {
            pArgs = PyTuple_New(argc - 3);
            for (i = 0; i < argc - 3; ++i) {
                pValue = PyInt_FromLong(atoi(argv[i + 3]));
                if (!pValue) {
                    Py_DECREF(pArgs);
                    Py_DECREF(pModule);
                    fprintf(stderr, "Cannot convert argument\n");
                    return 1;
                }
                /* pValue reference stolen here: */
                PyTuple_SetItem(pArgs, i, pValue);
            }
            pValue = PyObject_CallObject(pFunc, pArgs);
            Py_DECREF(pArgs);
            if (pValue != NULL) {
                printf("Result of call: %ld\n", PyInt_AsLong(pValue));
                Py_DECREF(pValue);
            }
            else {
                Py_DECREF(pFunc);
                Py_DECREF(pModule);
                PyErr_Print();
                fprintf(stderr,"Call failed\n");
                return 1;
            }
        }
        else {
            if (PyErr_Occurred())
                PyErr_Print();
            fprintf(stderr, "Cannot find function \"%s\"\n", argv[2]);
        }
        Py_XDECREF(pFunc);
        Py_DECREF(pModule);
    }
    else {
        PyErr_Print();
        fprintf(stderr, "Failed to load \"%s\"\n", argv[1]);
        return 1;
    }
    Py_Finalize();
    return 0;
}
  • Python code:
def multiply(a,b):
    print "Will compute", a, "times", b
    c = 0
    for i in range(0, a):
        c = c + b
    return c

I have compiled the code by putting in the terminal:

gcc multiply.cpp -o multiply -L/usr/lib/python2.7/config/ -lpython2.7

To execute:

./multiply multiply multiply 3 4

It is possible to have some errors. If you have errors, like ImportError: No module named multiply, you have to use:

PYTHONPATH=. ./multiply multiply multiply 3 4

And the result is:

Will compute 3 times 4
Result of call: 12


Also, this week I made a simple SSD code to detect cars based on the one that implements in https://github.com/ksketo/CarND-Vehicle-Detection. This code can be found at https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/Simple%20net%20Object%20detection/car_detection. To detect the cars we execute the code detectionImages.py. The result is as follows:


Also, I made a simple SSD to detect persons. The result is as follows:

Weeks 8,9: State of the art, dataset of cultivations, and installation od DetectionSuite[edit]

These weeks, I have made the state of the art of Deep Learning for object detection. The state of the art is available in the following link: https://github.com/RoboticsURJC-students/2017-tfm-vanessa-fernandez/tree/master/State%20of%20the%20art .

Also, this week I've been looking for information on the crop dataset. Some datasets are:

  • FAOSTAT [2]: provides free access to food and agriculture data for over 245 countries and territories and covers all FAO regional groupings from 1961 to the most recent year available.
  • Covertype Data Set [3]: includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. As for primary major tree species in these areas, Neota would have spruce/fir (type 1), while Rawah and Comanche Peak would probably have lodgepole pine (type 2) as their primary species, followed by spruce/fir and aspen (type 5). Cache la Poudre would tend to have Ponderosa pine (type 3), Douglas-fir (type 6), and cottonwood/willow (type 4).
  • Plants Data Set [4]: Data has been extracted from the USDA plants database. It contains all plants (species and genera) in the database and the states of USA and Canada where they occur.
  • Urban Land Cover Data Set [5]: Contains training and testing data for classifying a high resolution aerial image into 9 types of urban land cover. Multi-scale spectral, size, shape, and texture information are used for classification. There are a low number of training samples for each class (14-30) and a high number of classification variables (148), so it may be an interesting data set for testing feature selection methods. The testing data set is from a random sampling of the image. Class is the target classification variable. The land cover classes are: trees, grass, soil, concrete, asphalt, buildings, cars, pools, shadows.
  • Area, Yield and Production of Crops [6]: Area under crops, yield of crops in tonnes per hectare, and production of crops in tonnes, classified by type of crop.
  • Target 8 Project cultivation database [7]: has been written in Microsoft Access, and provides a valuable source of information on habitats, ecology, reproductive attributes, germination protocol and so forth (see below). Attributes can be compared, and species from similar habitats can be quickly listed. This may help you in achieving the right conditions for a species, by utilising your knowledge of more familiar plants. The target 8 dataset form presents the data for the project species only, with a sub-form giving a list of submitted protocols. A full dataset to the British and Irish flora is also avaialble (Entire dataset form) which gives the same ata for not threatened taxa as well as many neophytes.
  • Datasets of Nelson Institute University of Wisconsin-Madison [8]: There are some datasets for crop or vegetation.
  • Global data set of monthly irrigated and rainfed crop areas around the year 2000 (MIRCA2000) [9][10]: is a global dataset with spatial resolution of 5 arc minutes (about 9.2 km at the equator) which provides both irrigated and rainfed crop areas of 26 crop classes for each month of the year. The crops include all major food crops (wheat, maize, rice, barley, rye, millet, sorghum, soybean, sunflower, potato, cassava, sugarcane, sugarbeet, oil palm, canola, groundnut, pulses, citrus, date palm, grape, cocoa, coffee, other perennials, fodder grasses, other annuals) as well as cotton. For some crops no adequate statistical information was available to retain them individually and these crops were grouped into ‘other’ categories (perennial, annual and fodder grasses).
  • Root crops and plants harvested green from arable land by area [11]: includes the areas where are sown potatoes (including seed), sugar beet (excluding seed), temporary grasses for grazing, hay or silage (in the crop rotation cycle), leguminous plants grown and harvested green (as the whole plant, mainly for fodder, energy or green manuring use), other cereals harvested green (excluding green maize): rye, wheat, triticale, annual sorghum, buckwheat, etc. This indicator uses the concepts of "area under cultivation" which corresponds: • before the harvest, to the sown area; • after the harvest, to the sown area excluding the non-harvested area (e.g. area ruined by natural disasters, area not harvested for economic reasons, etc.).


Also, I installed DetectionSuite. I have followed the steps indicated in the following link https://github.com/JdeRobot/dl-DetectionSuite. It is important to install version 9 of CUDA, because other versions will probably give an error.

Weeks 5,6,7: SSD and R-CNN[edit]

These weeks, I've known some object detection algorithms. Object detection is the process of finding instances of real-world objects in images or videos. Object detection algorithms usually use extracted features and learning algorithms to recognize instances of an object category. There are different algorithm for object detection. Some of them are:

  • R-CNN (Recurrent Convolutional Neural Network) [12]: is a special type of CNN that is able to locate and detect objects in images: the output is generally a set of bounding boxes that closely match each of the detected objects, as well as a class output for each detected object. Though the input is static, the activities of RCNN units evolve over time so that the activity of each unit is modulated by the activities of its neighboring units. This property enhances the ability of the model to integrate the context information, which is important for object recognition. Like other recurrent neural networks, unfolding the RCNN through time can result in an arbitrarily deep network with a fixed number of parameters. Furthermore, the unfolded network has multiple paths, which can facilitate the learning process.
  • Single Shot Multibox Detector (SSD) [13]: discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component.

Week 4: Starting with DetectionSuite[edit]

DeepLearning Suite (https://github.com/JdeRobot/DeepLearningSuite) is a set of tool that simplify the evaluation of most common object detection datasets with several object detection neural networks. It offers a generic infrastructure to evaluates object detection algorithms againts a dataset and compute most common statistics: precision, recall. DeepLearning Suite supports YOLO (darknet) and Background substraction. I've installed YOLO(darknet). I follow the following steps:

git clone https://github.com/pjreddie/darknet
cd darknet
make

Downloading the pre-trained weight file:

wget https://pjreddie.com/media/files/yolo.weights

Then run the detector:

./darknet detect cfg/yolo.cfg yolo.weights data/dog.jpg

Week 3: Getting started[edit]

Install Keras[edit]

Keras is a high-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano (they're both open source libraries for numerical computation optimized for GPU and CPU). It was developed with a focus on enabling fast experimentation. I followed the next steps to install Keras: https://keras.io/#installation. Currently, I'm running Keras on top of TensorFlow (https://www.tensorflow.org/) optimized for CPU.

Keras includes a module for multiple supplementary tasks called Utils. The most important functionality for the project provided by this module is the .HDF5Matrix() method. Keras employs the HDF5 file format (http://docs.h5py.org/en/latest/build.html) to save models and read datasets. According to HDF5 documentation, it is a hierarchical data format designed for high volumes of data with complex relationships.

The commands to have installed keras and HDF5 are:

sudo pip install tensorflow
sudo pip install keras
sudo apt-get install libhdf5
sudo pip install h5py

Testing the code of a neural network[edit]

After installing Keras, I've tried the tfg of my mate David Pascual (http://jderobot.org/Dpascual-tfg). In his project, he was be able to classify a well-known database of handwritten digits (MNIST) using a convolutional neural network (CNN). He created a program that is able to get images from live video and display the predictions obtained from them. I've studied his code and I tested its operation.

If you want to execute David's code you have to open two terminals and put in each one respectively:

cameraserver cameraserver_digitclassifier.cfg 
python digitclassifier.py digitclassifier.cfg


First neural network: MNIST example[edit]

After studying David's code a bit, I have trained a simple convolutional neural network with the MNIST dataset.

First of all, we have to load and adapt the input data. Keras library contains a module named datasets from which we can import a few databases, including MNIST. In order to load MNIST data, we call mnist.load_data() function. It returns images and labels from both training and test datasets (x_train, y train and x_test, y_test respectively).

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

The dimension of x_train are: (6000, 28, 28). That is, we have 6000 images with a size of 28x28. We also have to explicitly declare the depth of the samples. In this case, we're working with black and white images, so depth will be equal to 1. We reshape data using reshape() method. Depending on the backend (TensorFlow or Theano), the arguments must be passed in different order.

# Shape of image depends if you use TensorFlow or Theano
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

The next step is to convert data type from uint8 to float32 and normalize pixel values to [0,1] range.

# We normalize the data
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
('x_train shape:', (60000, 28, 28, 1))
(60000, 'train samples')
(10000, 'test samples')

We have to reshape label data using utils.to_categorical() method.

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Now that data is ready, we have to define the architecture of our neural network. We're going to define the model architecture that we will use. We use sequential model.

model = Sequential()

Next step is to add input, inner and output layers. We can add layers using the add(). Convolutional neural networks usually contain: convolutional layer, pooling layer and fully connected layer. We're going add these layers to our model.

model.add(Conv2D(nb_filters, kernel_size=nb_conv,
                     activation='relu',
                     input_shape=input_shape))
model.add(Conv2D(64, nb_conv, activation='relu'))
model.add(MaxPooling2D(pool_size=nb_pool))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer=keras.optimizers.Adadelta(),
                  metrics=['accuracy'])

We have to train the neural network. We must use the fit() method.

# We train the model
history = model.fit(x_train, y_train, batch_size=batch_size,
          epochs=epochs, verbose=1,
          validation_data=(x_test, y_test))

Last step is to evaluate the model.

# We evaluate the model
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Keras displays the results obtained.

Train on 60000 samples, validate on 10000 samples
Epoch 1/12
  128/60000 [..............................] - ETA: 4:34 - loss: 2.2873 - acc: 0  
256/60000 [..............................] - ETA: 4:12 - loss: 2.2707 - acc: 0  
384/60000 [..............................] - ETA: 3:59 - loss: 2.2541 - acc: 0  
512/60000 [..............................] - ETA: 3:51 - loss: 2.2271 - acc: 0  
640/60000 [..............................] - ETA: 3:47 - loss: 2.1912 - acc: 0

...

59520/60000 [============================>.] - ETA: 1s - loss: 0.0382 - acc: 0.988
59648/60000 [============================>.] - ETA: 1s - loss: 0.0382 - acc: 0.988
59776/60000 [============================>.] - ETA: 0s - loss: 0.0382 - acc: 0.988
59904/60000 [============================>.] - ETA: 0s - loss: 0.0381 - acc: 0.988
60000/60000 [==============================] - 264s 4ms/step - loss: 0.0381 - acc: 0.9889 - val_loss: 0.0293 - val_acc: 0.9905
('Test loss:', 0.029295223668735708)
('Test accuracy:', 0.99050000000000005)

The fit() function returns a history object which has a dictionary of all the metrics which were required to be tracked during training. We can use the data in the history object to plot the loss and accuracy curves to check how the training process went. You can use the history.history.keys() function to check what metrics are present in the history. It should look like the following [‘acc’, ‘loss’, ‘val_acc’, ‘val_loss’].

Week 2: BBDD of Deep Learning and C++ Tutorials[edit]

BBDD of Deep Learning[edit]

This second week, I've known some datasets used in Deep Learning. Datasets provide a means to train and evaluate algorithms, they drive research in new directions. Datasets related to object recognition can be split into three groups: object classification, object detection and semantic scene labeling.

Object classification: assigning pixels in the image to categories or classes of interest. There are different datasets for image classification. Some of them are:

  • MNIST (Modified National Institute of Standards and Technology): is a large database of handwritten digits that is commonly used for training various image processing systems. It consists of a set training of 60,000 examples and a test of 10,000 examples. Is a good database for people who want to try learning techniques and methods of pattern recognition in real-world data, while dedicating a minimum effort to preprocess and format.
  • Caltech 101: is intended to facilitate Computer Vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.
  • Caltech 256: is a set similar to the Caltech 101, but with some improvements: 1) the number of categories is more than doubled, 2) the minimum number of images in any category is increased from 31 to 80, 3) artifacts due to image rotation are avoided, and 4) a new and larger clutter category is introduced for testing background rejection.
  • CIFAR-10: is an established computer-vision dataset used for object recognition. It is a subset of the 80 million tiny images dataset and consists of 60,000 32x32 color images containing one of 10 object classes, with 6000 images per class. There are 50,000 training images and 10,000 test images in the official data.
  • CIFAR-100: is large, consisting of 100 image classes, with 600 images per class. Each image is 32x32x3 (3 color), and the 600 images are divided into 500 training, and 100 test for each class.
  • ImageNet: is organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated.


Object detection: the process of finding instances of real-world objects in images or videos. Object detection algorithms usually use extracted features and learning algorithms to recognize instances of an object category. There are different datasets for object detection. Some of them are:

  • COCO (Microsoft Common Objects in Context): is an image dataset designed to spur object detection research with a focus on detecting objects in context. COCO has the following characteristics: multiple objects in the images, more than 300,000 images, more than 2 million instances, and 80 object categories. Training sets, test and validation are used with their corresponding annotations. The annotations include pixel-level segmentation of object belonging to 80 categories, keypoint annotations for person instances, stuff segmentations for 91 categories, and five image captions per image.
  • PASCAL VOC: In 2007, the group has two large databases, one of which consists of a validation set and another of training, and the other with a single test set. Both databases contain about 5000 images in which they are represented, approximately 12,000 different objects, so, in total, this set has of about 10,000 images in which about 24,000 objects are represented. In the year 2012, this setis modified, increasing the number of images with representation to 11530 of 27450 different objects.
  • Caltech Pedestrian Dataset: consists of approximately 10 hours of 640x480 30Hz video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 frames with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated. The annotation includes temporal correspondence between bounding boxes and detailed occlusion labels.
  • TME Motorway Dataset (Toyota Motor Europe (TME) Motorway Dataset): is composed by 28 clips for a total of approximately 27 minutes (30000+ frames) with vehicle annotation. Annotation was semi-automatically generated using laser-scanner data. The dataset is divided in two sub-sets depending on lighting condition, named “daylight” (although with objects casting shadows on the road) and “sunset” (facing the sun or at dusk).


Semantic scene labeling: each pixel of an image have to be labeled as belonging to a category. There are different dataset for semantic scene labeling. Some of them are:

  • SUN dataset: provide a collection of annotated images covering a large variety of environmental scenes, places and the objects within. SUN contains 908 scene categories from the WordNet dictionary with segmented objects. The 3,819 object categories span those common to object detection datasets (person, chair, car) and to semantic scene labeling (wall, sky, floor). There are a few categories have a large number of instances (wall: 20,213, window: 16,080, chair: 7,971) while most have a relatively modest number of instances (boat: 349, airplane: 179, floor lamp: 276).
  • Cityscapes Dataset: contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames.


C++ Tutorials[edit]

In this week, I've been doing c ++ tutorials. I followed the tutorials on the next page: https://codigofacilito.com/cursos/c-plus-plus.

Week 1: Read old works[edit]

After that, I have to know the state of the art of Deep Learning, so I've read the old works from other students. First, I read the Nuria Oyaga's Final Grade Project: "Análisis de Aprendizaje Profundo con la plataforma Caffe". Then, I read the the David Pascual's Final Grade Project: "Study of Convolutional Neural Networks using Keras Framework". Finally, I read an introduction to Deep Learning.