This system uses the behaviors of other agents to create more diverse driving scenarios without collecting additional data.
The main difficulty in learning from other vehicles is that there is no sensor information. We use a set of supervisory tasks to learn an intermediate representation that is invariant to the viewpoint of the controlling vehicle. This not only provides a richer signal at training time but also allows more complex reasoning during inference. Learning how all vehicles drive helps predict their behavior at test time and can avoid collisions.

We opt for an end-to-end differentiable three-stage modular pipeline: A perception module, a motion planner, and a low-level controller.
Vehicle Independent Perception Module
Goals: build an intermediate representation that readily generalizes from training to test conditions, build input features to the motion planner that are indistinguishable between the current vehicle and nearby vehicles.
Here, we opt for a metric map-based output representation. In a metric map, rotated ROI pooling extracts fixed-sized feature representations for training vehicles.
Specifically, we use three RGB cameras surrounding the vehicle and one LiDAR sensor as an input. We combine the color and LiDAR inputs using point- painting from RGB inputs and a light-weight CenterPoint with PointPillars 3D backbone. The backbone provides us with a map-view feature representation f ∈ R W ×H×C of width W and height H with C channels.

We train a 3D perception model using detection and semantic mapping as the supervision signal. Both tasks help learn a viewpoint-invariant spatial representation. Detection additionally predicts other vehicles’ poses which we use to forecast their future trajectories at inference.
Motion Planner
We propose a novel two-stage motion planner that combines geometric GPS targets and discrete high-level commands. We use a standard RNN formulation [28, 38] to predict n = 10 future waypoints y1 , . . . , yn ∈ R2 . The motion planner uses a high-level command c and intermediate GNSS coordinate goal g ∈ R2 to perform different driving maneuvers. Possible high-level commands c include turn-left, turn-right, go-straight, follow-lane, change-lane-to-left, change-lane-to-right.