Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor
🎉 The 18th European Conference on Computer Vision (ECCV 2024) 🎉
Andrea Conti · Matteo Poggi · Valerio Cambareri · Stefano Mattoccia
Overview
In the last decade, RGB-D camera systems have become prominent in robotics, automotive and augmented reality. Moreover, they are now available on mobile handheld devices, usually coupled with RGB cameras. However, such sensors either do not provide high frame rate, high resulution or suffer heating and high energy consumption, particularly meaningful in the mobile use case. These limitations prevent coupling them with odiern cheap high resolution and fast RGB cameras.
We propose Depth on Demand (DoD), a framework addressing the three major issues related to active depth sensors in streaming dense depth maps: spatial sparsity, limited frame rate and energy consumption of the depth sensors.
DoD allows to stream high resolution depth from an RGB camera and a depth sensor without requiring the depth sensor neither to be dense or to match the frame-rate of the RGB camera.
In Fig. A is showed an example of indoor reconstruction with DoD where only the red frames provide sparse depth information while the reconstruction exploits the whole RGB video stream executed at a much higher frame rate.
Method
Depth on Demand aims to improve the temporal resolution of an active depth sensor by utilizing the higher frame rate of an RGB camera. It estimates depth for each RGB frame, even for those that do not have direct depth sensor measurements. This is achieved through multi-view geometry on the RGB video stream, combining both frames with and without depth data. The method is structured in three main steps: multi-modal encoding, iterative multi-modal integration, and depth decoding.
Multi-modal encoding
Depth on Demand processes input data from three different sources: multi-view geometry, monocular cues, and sparse depth measurements. Features from target and source views are extracted using a ResNet18-based encoder, allowing the computation of epipolar correlation features from pixel matches between views. Additionally, monocular information is encoded separately using a ResNet34 to fill in gaps where geometric cues are missing, particularly in cases of motion or large pose changes. Sparse depth data from a previous time is projected onto the current view, providing a coarse initial depth map for further refinement.
Iterative multi-modal integration
Depth on Demand combines the encoded features iteratively to refine the depth map over a fixed number of iterations. The first stage, visual cues integration, merges monocular and geometric information using a Gated Recurrent Unit (GRU). The second stage, depth cues integration, adjusts depth using the sparse depth data, correcting predictions where necessary. This iterative process improves depth accuracy and manages outliers, such as occlusions or moving objects. These stages are represented in Fig. B.
Depth decoding
FInally, Depth on Demand takes the refined depth map and upscales it to the original resolution using a learned upsampling technique inspired by convex upsampling. The method applies this upsampling iteratively to refine depth maps, balancing efficiency with accuracy by embedding monocular cues and maintaining smooth depth transitions. This final step enables efficient depth prediction with improved spatial resolution.
Qualitatives
Depth on Demand is thoroughly evaluated in many different scenarios. In Fig. C are showed dynamic on-line mesh reconstructions on ScanNetV2 to represent the indoor use case.
Indoor reconstruction is on its own an interesting task. Nonetheless, Depth on Demand can be applied in many other different contexts like outdoor scenes. In Fig. D is showed depth predictions on the TartanAir dataset. Below, can be clearly noticed how the projected sparse points do not cover the whole target field of view due to the low frame rate ratio. Moreover, moving objects and occlusions generate outliers in the projected depth. Nonetheless, Depth on Demand is able to deal with all these issues providing a smooth reconstruction.
Finally, we show qualitative results on the KITTI dataset to demonstate applicability also to the automotive use case. This scenario is quite different from the previous ones since a LiDAR 360° sensor is usually used. Thus, the gap between the last depth frame and the target view doesn’t lead to large areas empty of sparse depth but instead to scan lines sparsification. This can be seen in Fig. E.
Reference
@InProceedings{Conti_2024_ECCV,
author = {Conti, Andrea and Poggi, Matteo and Cambareri, Valerio and Mattoccia, Stefano},
title = {Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor},
booktitle = {European Conference on Computer Vision (ECCV)},
month = {October},
year = {2024},
}
}