PIDNet | Notion

An overview of the basic architecture of our proposed Proportional-Integral-Derivative Network (PIDNet).

S and B denote semantic and boundary, and Add and Up refer to element-wise summation and bilinear Upsampling operation, respectively; BAS-Loss represents the boundary-awareness CE loss. Dashed lines and associate blocks will be ignored in inference stage.

Pag: Selective Learning High-level Semantics:

The lateral connection enhances the information transmission between different feature maps and improves the representation ability of their models. In PIDNet, the rich and accurate semantic information provided by I branch is crucial for detail parsing of P branch, which contains relatively less layers and channels. Thus, we could treat I branch as the backup for other two branches and enable it to provide required information to them. Different from D branch that directly adds the provided feature maps, we introduce a Pixel-attention-guided fusion module (Pag), for P branch to selectively learn the useful semantic features from I branch without being overwhelmed. Basically, the underlying concept for Pag is borrowed from self-attention mechanism [46] but Pag computes the attention locally for real-time requirement.

Illustration of Pag module in lateral connection.

Define the vectors for the corresponding pixels in feature maps provided by P branch and I branch as ~vp and ~vi, respectively, then the output of Sigmoid function will become:

Untitled

where σ represents the possibility of these two pixels are from the same object. If σ is high, we trust ~vi more since I branch is semantically accurate, and vise versa. Thus, the output of the Pag module could be written as:

Untitled

PAPPM: Fast Aggregation of Contexts:

History:

For better global scene prior construction, Spatial Pyramid Pooling (SPP) was adopted in SwiftNet to parse the global dependencies.

PSPNet introduced a Pyramid Pooling Module (PPM), which concatenates multiscale pooling maps before convolution layer to form local and global context representations.

Deep Aggregation PPM (DAPPM) proposed by Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation further improved the context embedding ability of PPM and showed superior performance. However, the computation of DAPPM cannot be parallelized regarding its depth, which is time-consuming and DAPPM contains too many channels for each scale, which surpasses the representation ability of lightweight models.

The detailed architecture of Deep Aggregation Pyramid Pooling Module

Taking feature maps of 1/64 image resolution as input, large pooling kernels with exponential strides are performed to generate feature maps of 1/128, 1/256, 1/512 image resolution. Input feature maps and image-level information generated by global average pooling are also utilized. We argue that it is inadequate to blend all the multi-scale contextual information by a single 3×3 or 1×1 convolution.

Inspired by Res2Net, we first upsample the feature maps and then uses more 3×3 convolutions to fuse contextual information of different scales in a hierarchial-residual way. Considering an input x, each scale yi can be written as:

Untitled

where C1×1 is 1×1 convolution, C3×3 is 3×3 convolution, U denotes upsampling operation, Pj,k denotes the pool layer of which kernel size is j and stride is k, Pglobal denotes the global average pooling. In the end, all feature maps are concatenated and compressed using a 1×1 convolution. Besides, a 1×1 projection shortcut is added for easy optimization.

Inside a DAPPM, contexts extracted by larger pooling kernels are integrated with deeper information flow, and multiscale nature is formed by integrating different depths with different sizes of pooling kernels.

In PIDNet, they slightly change the connections in DAPPM to make it parallelized, and reduce the number of channels for each scale from 128 to 96.

The parallel structure of PAPPM.

This new context harvesting module is called Parallel Aggregation PPM (PAPPM) and is applied in PIDNet-M and PIDNet-S to improve their speeds. For the deep model: PIDNet-L, they still choose the DAPPM considering its depth but change its number of channels for each scale from 128 to 112.

BAG(Balancing the Details and Contexts):

Given the boundary features extracted by ADB, our proposal is to employ the boundary attention to guide the fusion of detailed (P) and context (I) representations. Therefore, we design a Boundary-attention-guided fusion module (Bag) to fuse the features provides by three branches. Note that the context branch is semantically rich and could presents more accurate semantics but it loses too much spatial and geometric details especially for the boundary region and small object. Thanks to the detailed branch, which preserves the spatial details better, we force the model to trust the detailed branch more along the boundary region and utilize the context features to fill the area inside object, which could be accomplished by Bag.