ERFNet | Notion

A good trade-off between high quality and computational resources is yet not present in state-of-the-art segmentation architectures.

Recently, the residual layers proposed in [6] have supposed a new trend in ConvNets design. Their reformulation of the convolutional layers to avoid the degradation problem of deep architectures has allowed recent works to achieve very high accuracies with networks that stack large amounts of layers.

Our proposal aims at solving an efficiency limitation that is inherently present in commonly adopted versions of the residual layer,

Residual layers [6] have the property of allowing convolutional layers to approximate residual functions, as the output vector y of a layer vector input x becomes:

Where Ws is usually an identity mapping and F(x, {Wi}) represents the residual mapping to be learned. This residual formulation facilitates learning and significantly reduces the degradation problem present in architectures that stack a large amount of layers

Untitled

The original work proposes two instances of this residual layer: the non-bottleneck design with two 3x3 convolutions or the bottleneck version. Both versions have similar number of parameters and almost equivalent accuracy.

We propose to redesign the non-bottleneck residual module in a more optimal way by entirely using convolutions with 1D filters.

Untitled

By leveraging this decomposition, we propose a new implementation of the residual layer that makes use of the described 1D factorization to accelerate and reduce the parameters of the original non-bottleneck layer. This module is faster (as in computation time) and has less parameters than the bottleneck design, while keeping a learning capacity and accuracy equivalent to the non-bottleneck one.

Untitled

Both non-bottleneck and bottleneck implementations can be factorized into 1D kernels. However, the non-bottleneck design is clearly more benefited, by receiving a direct 33% reduction in both convolutions and greatly accelerating its execution time.

Our main motivation is to obtain an architecture that gets the best possible trade-off between accuracy and efficiency.

We followed the current trend of using convolutions with residual connections as the core elements of our architecture,

non-bottleneck-1D (non-bt-1D) layer is the core of our architecture. Our network is designed by stacking sequentially the proposed non-bt-1D layers in a way that best leverages their learning performance and efficiency.

We follow an encoder-decoder architecture.

Untitled

The layers from 1 to 16 in our architecture form the encoder, composed of residual blocks and downsampling blocks. Downsampling (reducing the spatial resolution) has the drawback of reducing the pixel precision (coarser outputs), but it also has two benefits: it lets the deeper layers gather more context (to improve classification) and it helps to reduce computation. Therefore, to keep a good balance we perform three downsamplings: at layers 1, 2 and 8. Our downsampler block, inspired by the initial block of ENet [11], performs downsampling by concatenating the parallel outputs of a single 3x3 convolution with stride 2 and a Max-Pooling module.

We use it in all the downsampling layers that are present in our architecture. Additionally, we also interleave some dilated convolutions [27] in our non-bt-1D layers to gather more context, which led to an improvement in accuracy in our experiments.