What we are doing:
We are trying to generate semantic segmentation output using camera and lidar sensor values. To achieve generating semantic segmentation output, there should be a fusion method which fuses lidar and camera values.
Most lidar-camera fusion methods work as following: There are two seperate branches, one for lidar and one for camera. Lidar branch carries the BEV features and camera branch carries the camera features. They share their features along the branch with each other to improve features.
The problem with these methods: Lidar branch only carries the BEV feature and BEV feature has some missing properties of Lidar. Thus we propose a method where there are 3 branches: one for Lidar BEV features, one for camera features and one for camera+Lidar projected on camera features. These 3 branches share their features along the branch with each other.
Model architecture:

What we have done so far: