LAV | Notion

This system uses the behaviors of other agents to create more diverse driving scenarios without collecting additional data.

The main difficulty in learning from other vehicles is that there is no sensor information. We use a set of supervisory tasks to learn an intermediate representation that is invariant to the viewpoint of the controlling vehicle. This not only provides a richer signal at training time but also allows more complex reasoning during inference. Learning how all vehicles drive helps predict their behavior at test time and can avoid collisions.

Learning from all vehicles

Untitled

We opt for an end-to-end differentiable three-stage modular pipeline: A perception module, a motion planner, and a low-level controller.

The perception module is trained from massive labeled supervision with two goals in mind: To create a robust and generalizable representation of the surrounding world, and to build vehicle-invariant features that help supervise the motion planner. It maps raw sensor readings to a map-view feature representation.
The motion planner uses the map-view features of the perception model to produce a series of waypoints describing the future trajectory of the vehicles. We learn motion planning from all vehicles that surround the ego-vehicle. This is possible because our perception system produces vehicle-invariant features as inputs; it is also because the outputs of the motion planner, the future trajectories, can be easily obtained from ground truth driving logs.
The low-level controller converts motion plans into actual steering and acceleration commands that are executed on the vehicle.

Vehicle Independent Perception Module

Goals: build an intermediate representation that readily generalizes from training to test conditions, build input features to the motion planner that are indistinguishable between the current vehicle and nearby vehicles.

Here, we opt for a metric map-based output representation. In a metric map, rotated ROI pooling extracts fixed-sized feature representations for training vehicles.

Specifically, we use three RGB cameras surrounding the vehicle and one LiDAR sensor as an input. We combine the color and LiDAR inputs using point- painting from RGB inputs and a light-weight CenterPoint with PointPillars 3D backbone. The backbone provides us with a map-view feature representation f ∈ R W ×H×C of width W and height H with C channels.

Untitled

We train a 3D perception model using detection and semantic mapping as the supervision signal. Both tasks help learn a viewpoint-invariant spatial representation. Detection additionally predicts other vehicles’ poses which we use to forecast their future trajectories at inference.

More Technical Details

Motion Planner

We propose a novel two-stage motion planner that combines geometric GPS targets and discrete high-level commands. We use a standard RNN formulation [28, 38] to predict n = 10 future waypoints y1 , . . . , yn ∈ R2 . The motion planner uses a high-level command c and intermediate GNSS coordinate goal g ∈ R2 to perform different driving maneuvers. Possible high-level commands c include turn-left, turn-right, go-straight, follow-lane, change-lane-to-left, change-lane-to-right.