CabNIR

A Benchmark for In-Vehicle Infrared Monocular Depth Estimation

WACV 2025

Ugo Leone Cavalcanti1Matteo Poggi1Fabio Tosi1
Valerio Cambareri2Vladimir Zlokolica2Stefano Mattoccia1
1 Department of Computer Science and Engineering, University of Bologna, Italy
2 Sony Depthsensing Solutions, Brussels, Belgium   

Abstract


Accurate in-cabin depth estimation is critical for advancing automotive safety and occupant comfort. However, existing datasets for in-vehicle scene understanding tasks often fall short in providing sufficient information and scale needed to evaluate existing depth estimation methods. In this paper, we present a novel benchmark tailored for monocular depth estimation in vehicle interiors, containing both near-infrared (NIR) images and corresponding ground truth depth data. Featuring over 41,000 frames captured across 36 distinct vehicles and 45 different passengers, it offers an unprecedented level of variability for this application domain. Evaluation on our testbench of cutting-edge single-view depth models in different flavors, including zero-shot affine-invariant depth estimation or indomain specialization, reveals that current depth estimation approaches, while promising, still have a significant performance gap to overcome before achieving the reliability required for downstream safety-critical applications. In light of its diverse range and complex scenarios, we believe this benchmark could serve as a common reference for further research concerning in-cabin monocular depth estimation

Overview of the CabNIR dataset


The dataset comprises 47 scenes (29 recorded at night and 18 during the daytime) featuring 45 people (31 males and 14 females) and 36 cabins. We placed in 10 scenes everyday objects such as bags, jackets, headphones, laptops, mobile phones, and other small personal items. In addition, 4 scenes feature drivers or passengers wearing caps and 14 people wearing glasses or sunglasses.

We collected a total of 41,226 frames and partitioned the scenes as the training set (37 scenes, 32,195 frames) and another part featuring people and vehicles that do not appear in the training set as the testing set (5 scenes each, 4,516 frames). We also define a validation split using 5 other scenes (4,515 frames) containing vehicles and people that may appear in some of the clips in the training set, but in a different setting.

Among real IR datasets sporting a top-frontal setup, CabNIR combines the largest number of cabins and participants with the widest FoV.

Sequence Name Model # of Seats Ceiling Front Back Camera Pose Everyday Objs
500_A_1_NightFiat 5004GlassDriver Alone1 PassengerLow
500_B_1_DayFiat 5004Soft TopDriver+PassengerEmptyHigh
500_C_1_NightFiat 5004Soft TopDriver+PassengerEmptyHigh
500_D_1_NightFiat 5004Hard TopDriver AloneEmptyHigh
A1_A_1_DayAudi A15Hard TopDriver AloneEmptyHigh
A3_A_1_DayAudi A35Hard TopDriver AloneEmptyHigh
A3_A_2_DayAudi A35Hard TopDriver+PassengerEmptyLow
A3_B_1_NightAudi A35Hard TopDriver AloneEmptyHigh
Beetle_A_1_DayVolkswagen Beetle4Soft TopDriver+PassengerEmptyLow
Beetle_A_2_DayVolkswagen Beetle4Soft TopDriver+Passenger2 PassengerLow
Beetle_A_3_DayVolkswagen Beetle4Soft TopDriver Alone1 PassengerLow
C3_A_1_NightCitroen C35Glass+FabricDriver+Passenger1 PassengerLow
C3_A_2_NightCitroen C35Glass+FabricDriver+PassengerEmptyLow
CLA_A_1_DayMercedes CLA5Hard TopDriver+PassengerEmptyLow
EmptyFiat 5004Soft TopEmptyEmptyHigh
Fortwo_A_1_NightSmart Fortwo2Soft TopDriver Alone-Low
Fortwo_B_1_NightSmart Fortwo2GlassDriver+Passenger-High
Fortwo_C_1_NightSmart Fortwo2Hard TopDriver+Passenger-Low
GX3_A_1_DayMazda GX35Hard TopDriver AloneEmptyLow
Golf_A_1_NightVolkswagen Golf5GlassDriver+PassengerEmptyLow
Ibiza_A_1_DaySeat Ibiza5Hard TopDriver+Passenger1 PassengerLow
Jimny_A_1_DaySuzuki Jimny3Hard TopDriver+PassengerEmptyHigh
Mito_A_1_DayAlfa Romeo Mito4Hard TopDriver+PassengerEmptyLow
Model3_A_1_NightTesla Model 35GlassDriverEmptyLow
Model3_A_2_NightTesla Model 35GlassDriver+PassengerEmptyLow
Panda_A_1_DayFiat Panda4Hard TopDriver+PassengerEmptyHigh
Panda_B_1_DayFiat Panda5Hard TopDriver AloneEmptyHigh
Panda_C_1_NightFiat Panda5Hard TopDriver AloneEmptyLow
Panda_D_1_DayFiat Panda4Hard TopDriver AloneEmptyLow
Panda_E_1_NightFiat Panda5Hard TopDriver+PassengerEmptyLow
Panda_E_2_NightFiat Panda5Hard TopDriver AloneEmptyLow
Panda_F_1_NightFiat Panda5Hard TopDriver+PassengerEmptyLow
Panda_F_2_NightFiat Panda5Hard TopDriver+Passenger1 PassengerLow
Polo_A_1_NightVolkswagen Polo5Hard TopDriver AloneEmptyHigh
Puma_A_1_NightFord Puma5Hard TopDriver AloneEmptyLow
RS3_A_1_NightAudi RS35GlassDriver+Passenger1 PassengerHigh
Up_A_1_DayVolkswagen Up4Hard TopDriver AloneEmptyHigh
Up_B_1_NightVolkswagen Up4Hard TopDriver+PassengerEmptyLow
V60_A_1_NightVolvo V605Hard TopDriver AloneEmptyLow
X2_A_1_NightBMW X25Hard TopDriver AloneEmptyLow
Yaris_A_1_NightToyota Yaris5Hard TopDriver+PassengerEmptyLow
Yaris_A_2_NightToyota Yaris5Hard TopDriver AloneEmptyLow
Yaris_B_1_NightToyota Yaris5Hard TopDriver AloneEmptyHigh
Yaris_C_1_NightToyota Yaris5Hard TopDriver AloneEmptyLow
Yaris_D_1_NightToyota Yaris5Hard TopDriver AloneEmptyLow
Yaris_E_1_DayToyota Yaris5Hard TopDriver+PassengerEmptyLow
Yaris_E_2_DayToyota Yaris5Hard TopDriver+Passenger1 PassengerLow

Naming and structure


Frames related to each recording session are contained in a directory named as the following scheme:

[model_name]_[version]_[sequence]_[day|night]
  • [model_name] is the car model
  • [version] is an identification letter used to distinguish between different version of the same car model
  • [sequence] is a progressive number to identify distinct recordings of the same cabin
  • [day|night] indicates whether the recording session has been done in daytime or at night

The use of different versions is denoted by the [version] letter. However, there is also a collection of scenes made with the same cabin, this is indicated by the same [version] letter and a different [sequence] number. As an example: in Yaris_A_1_Night and Yaris_A_2_Night we have the same cabin and camera pose, but in the first sequence the driver is with a passenger, while in the second sequence he is alone.

Data acquisition


Our collection campaign took place both during daytime and nighttime in urban and rural areas, with different outdoor routes and weather conditions.

The collection process was set to record videos at 15 FPS with the Microsoft Azure Kinect recorder tool provided by the Sensor SDK; the tool yields synchronized NIR and Depth images at 1024 × 1024 resolution over a FoV of 120° × 120°.

Image frames are encoded as PNG b16g (16-bit grayscale, big-endian format). NIR images reach a maximum intensity value below 13500, except for saturated pixels, which are flagged with a fixed value of 65535; we therefore decided to clip the NIR image range to [0, 13500], which we then normalize to [0, 1] before feeding it to the monocular networks.

Input preprocessing


Since image frames are encoded as PNG b16g (16-bit grayscale, big-endian format) and NIR images reach a maximum intensity value below 13500 (except for saturated pixels, which are flagged with a fixed value of 65535), here is a Python snippet to easily load files:

import cv2
import numpy as np
from skimage import exposure

# File loading
depth_map = cv2.imread('path_depth', flags=cv2.IMREAD_UNCHANGED)
ir_image = cv2.imread('path_ir', flags=cv2.IMREAD_UNCHANGED)
ir_image = np.array(ir).astype('>u2')  

# CLIPPING
clipValue = 13500
ir_image[ir_image > clipValue] = clipValue

# Contrast stretching
ir_image = exposure.equalize_hist(ir_image)

..

Responsibility to human subjects


Since our dataset aims at in-cabin monitoring, it necessarily features the presence of people whose faces are visible. Indeed, although faces represent Personally Identifiable Information (PII), we cannot hide or blur these details to avoid altering the depth estimation process. As such, all the subjects involved in our acquisitions have been made perfectly aware of the information stored (NIR images and depth maps) and purposes. Each participant agreed to sign an explicit consent form.

Furthermore, our dataset has been approved by our institution’s Institutional Review Board (IRB). According to the IRB guidance, CabNIR will be available only to registered users: they shall provide a short overview of their research goal and use the data only for scientific purposes, targeting in-cabin monitoring through depth estimation.

Please consider citing our paper



                    
                    

This project page tamplate is inspired by ACE Zero