What we are doing:

We are trying to generate semantic segmentation output using camera and lidar sensor values. To achieve generating semantic segmentation output, there should be a fusion method which fuses lidar and camera values.

Most lidar-camera fusion methods work as following: There are two seperate branches, one for lidar and one for camera. Lidar branch carries the BEV features and camera branch carries the camera features. They share their features along the branch with each other to improve features.

The problem with these methods: Lidar branch only carries the BEV feature and BEV feature has some missing properties of Lidar. Thus we propose a method where there are 3 branches: one for Lidar BEV features, one for camera features and one for camera+Lidar projected on camera features. These 3 branches share their features along the branch with each other.

Model architecture:

Untitled

What we have done so far:

We have implemented the model architecture(we are still improving it). We have implemented our model on top of PIDNet network.
We have gathered dataset from Carla simulation environment for training the model. The dataset contains output of three rgb cameras, one 3d lidar and one semantic segmentation sensor. Dataset contains 7 different town and for each town train, val and test sets are created seperately.
We have implemented training script of the model.

Paper

Introduction

Related Work