Team Code Autopilot:

Expert driver:

Its performance is an upper bound for the learning-based TransFuser agent.

The autopilot has access to the complete state of the environment including vehicle and pedestrian locations and actions. The expert also has access to a dense set of waypoints along the route to be followed, terminating at the agent's destination.

Nav Planner:

Dependency classes :

Has PIDController, Plotter and RoutePlanner classes and some helper functions for

PIDController → Simple PID controller class.

Transformer Part :

Multi-Modal Fusion Transformer

The transformer architecture takes as input a sequence consisting of discrete tokens, each represented by a feature vector. The feature vector is supplemented by a positional encoding to incorporate spatial inductive biases. Formally, we denote the input sequence as Fin ∈ R^(N ×Df), where N is the number of tokens in the sequence, and each token is represented by a feature vector of dimensionality Df.

The transformer uses linear projections for computing a set of queries, keys, and values (Q, K, and V),

Untitled

where Mq ∈ R^(Df ×Dq) , Mk ∈ R^(Df ×Dk) and Mv ∈ R^(Df ×Dv) are weight matrices.

It uses the scaled dot products between Q and K to compute the attention weights and then aggregates the values for each query,

Untitled

Finally, the transformer uses a non-linear transformation to calculate the output features, Fout which are of the same shape as the input features, Fin.

Untitled

The transformer applies the attention mechanism multiple times throughout the architecture, resulting in L attention layers. Each layer in a standard transformer has multiple parallel attention ‘heads’, which involve generating several Q, K and V values per Fin and concatenating the resulting values of A.

The image data is being encoded with pretrained ResNET34 model. The lidar data is being encoded with pretrained ResNET18 model(after each layer a transformer is applied to data).

For semantic and depth segmentation, the encoded data is inserted into a decoder directly.