Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor

🎉 The 18th European Conference on Computer Vision (ECCV 2024) 🎉

Contents

Andrea Conti · Matteo Poggi · Valerio Cambareri · Stefano Mattoccia

Overview

In the last decade, RGB-D camera systems have become prominent in robotics, automotive and augmented reality. Moreover, they are now available on mobile handheld devices, usually coupled with RGB cameras. However, such sensors either do not provide high frame rate, high resulution or suffer heating and high energy consumption, particularly meaningful in the mobile use case. These limitations prevent coupling them with odiern cheap high resolution and fast RGB cameras.

We propose Depth on Demand (DoD), a framework addressing the three major issues related to active depth sensors in streaming dense depth maps: spatial sparsity, limited frame rate and energy consumption of the depth sensors.

DoD allows to stream high resolution depth from an RGB camera and a depth sensor without requiring the depth sensor neither to be dense or to match the frame-rate of the RGB camera.

In Fig. A is showed an example of indoor reconstruction with DoD where only the red frames provide sparse depth information while the reconstruction exploits the whole RGB video stream executed at a much higher frame rate.

Figure A: Indoor Reconstruction. Top view of indoor reconstruction with DoD on ScanNetV2 where red frames provide depth information while other frames are RGB-only

Method

Figure B: Depth on Demand Framework. DoD embeds multi-view cues and monocular features in the Visual Cues Integration, then integrates sparse depth updates in the Depth Cues Integration. To properly exploit both these information these stages are applied iteratively in the form of depth updates.

Depth on Demand aims to improve the temporal resolution of an active depth sensor by utilizing the higher frame rate of an RGB camera. It estimates depth for each RGB frame, even for those that do not have direct depth sensor measurements. This is achieved through multi-view geometry on the RGB video stream, combining both frames with and without depth data. The method is structured in three main steps: multi-modal encoding, iterative multi-modal integration, and depth decoding.

Depth on Demand processes input data from three different sources: multi-view geometry, monocular cues, and sparse depth measurements. Features from target and source views are extracted using a ResNet18-based encoder, allowing the computation of epipolar correlation features from pixel matches between views. Additionally, monocular information is encoded separately using a ResNet34 to fill in gaps where geometric cues are missing, particularly in cases of motion or large pose changes. Sparse depth data from a previous time is projected onto the current view, providing a coarse initial depth map for further refinement.

Depth on Demand combines the encoded features iteratively to refine the depth map over a fixed number of iterations. The first stage, visual cues integration, merges monocular and geometric information using a Gated Recurrent Unit (GRU). The second stage, depth cues integration, adjusts depth using the sparse depth data, correcting predictions where necessary. This iterative process improves depth accuracy and manages outliers, such as occlusions or moving objects. These stages are represented in Fig. B.

Depth decoding

FInally, Depth on Demand takes the refined depth map and upscales it to the original resolution using a learned upsampling technique inspired by convex upsampling. The method applies this upsampling iteratively to refine depth maps, balancing efficiency with accuracy by embedding monocular cues and maintaining smooth depth transitions. This final step enables efficient depth prediction with improved spatial resolution.

Qualitatives

Depth on Demand is thoroughly evaluated in many different scenarios. In Fig. C are showed dynamic on-line mesh reconstructions on ScanNetV2 to represent the indoor use case.

Figure C: ScanNetV2 Qualitatives. Depth on Demand shines at 3D reconstruction in indoor scenarios like the ones provided by ScanNetV2

Indoor reconstruction is on its own an interesting task. Nonetheless, Depth on Demand can be applied in many other different contexts like outdoor scenes. In Fig. D is showed depth predictions on the TartanAir dataset. Below, can be clearly noticed how the projected sparse points do not cover the whole target field of view due to the low frame rate ratio. Moreover, moving objects and occlusions generate outliers in the projected depth. Nonetheless, Depth on Demand is able to deal with all these issues providing a smooth reconstruction.

Figure D: TartanAir Qualitatives. Depth on Demand allows reconstruction from drone-like views like the ones provided by the TartanAir dataset. From left to right, the source view aligned with the sparse depth sensor at a lower frame rate, the target view with projected sparse points from the latest source frame, the prediction and the ground-truth.

Finally, we show qualitative results on the KITTI dataset to demonstate applicability also to the automotive use case. This scenario is quite different from the previous ones since a LiDAR 360° sensor is usually used. Thus, the gap between the last depth frame and the target view doesn’t lead to large areas empty of sparse depth but instead to scan lines sparsification. This can be seen in Fig. E.

Figure E: KITTI Qualitatives. On top, an example of a 360° LiDAR caption and of the target RGB view on which DoD predicts depth. Due to the gap between the two the scanlines used as input to the framework will be sparsier but the field of view almost completely covered. On the bottom a set of video samples, from left to right: the source view, the target view with projected sparse depth, the prediction and finally the ground-truth.

Reference

@InProceedings{Conti_2024_ECCV,
    author    = {Conti, Andrea and Poggi, Matteo and Cambareri, Valerio and Mattoccia, Stefano},
    title     = {Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor},
    booktitle = {European Conference on Computer Vision (ECCV)},
    month     = {October},
    year      = {2024},
}
}