Access the full text.
Sign up today, get DeepDyve free for 14 days.
Miaomiao Liu, M. Salzmann, Xuming He (2014)
Discrete-Continuous Depth Estimation from a Single Image2014 IEEE Conference on Computer Vision and Pattern Recognition
Ashutosh Saxena, Sung Chung, A. Ng (2005)
Learning Depth from Single Monocular Images
Junjie Hu, M. Ozay, Yan Zhang, Takayuki Okatani (2018)
Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps With Accurate Object Boundaries2019 IEEE Winter Conference on Applications of Computer Vision (WACV)
Moritz Menze, Andreas Geiger (2015)
Object scene flow for autonomous vehicles2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Ashutosh Saxena, Min Sun, A. Ng (2009)
Make3D: Learning 3D Scene Structure from a Single Still ImageIEEE Transactions on Pattern Analysis and Machine Intelligence, 31
R. Adams (1947)
ProceedingsMRS Bulletin, 7
Lipu Zhou, Jiamin Ye, Montiel Abello, Shengze Wang, M. Kaess (2018)
Unsupervised Learning of Monocular Depth Estimation with Bundle Adjustment, Super-Resolution and Clip LossArXiv, abs/1812.03368
Shu Chen, Zhengdong Pu, Xiang Fan, Beiji Zou (2022)
Fixing Defect of Photometric Loss for Self-Supervised Monocular Depth EstimationIEEE Transactions on Circuits and Systems for Video Technology, 32
Kai Zhang, Jiaxin Xie, Noah Snavely, Qifeng Chen (2020)
Depth Sensing Beyond LiDAR Range2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Beyang Liu (2009)
CS 229 Final Project : Single Image Depth Estimation From Predicted Semantic Labels
R. Mahjourian, M. Wicke, A. Angelova (2018)
Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Sudheendra Vijayanarasimhan, Susanna Ricco, C. Schmid, R. Sukthankar, Katerina Fragkiadaki (2017)
SfM-Net: Learning of Structure and Motion from VideoArXiv, abs/1704.07804
Fayao Liu, Chunhua Shen, Guosheng Lin, I. Reid (2015)
Learning Depth from Single Monocular Images Using Deep Convolutional Neural FieldsIEEE Transactions on Pattern Analysis and Machine Intelligence, 38
Zhaowei Cai, Quanfu Fan, R. Feris, N. Vasconcelos (2016)
A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection
Vincent Casser, S. Pirk, R. Mahjourian, A. Angelova (2018)
Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular VideosArXiv, abs/1811.06152
K. Pnvr, Hao Zhou, D. Jacobs (2020)
SharinGAN: Combining Synthetic and Real Data for Unsupervised Geometry Estimation2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Junsheng Zhou, Yuwang Wang, K. Qin, Wenjun Zeng (2019)
Unsupervised High-Resolution Depth Learning From Videos With Dual Networks2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Huan Fu, Mingming Gong, Chaohui Wang, K. Batmanghelich, D. Tao (2018)
Deep Ordinal Regression Network for Monocular Depth Estimation2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Minsoo Song, Seokjae Lim, Wonjun Kim (2021)
Monocular Depth Estimation Using Laplacian Pyramid-Based Depth ResidualsIEEE Transactions on Circuits and Systems for Video Technology, 31
Sanghyun Woo, Jongchan Park, Joon-Young Lee, In-So Kweon (2018)
CBAM: Convolutional Block Attention ModuleArXiv, abs/1807.06521
Fangchang Ma, Guilherme Cavalheiro, S. Karaman (2018)
Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera2019 International Conference on Robotics and Automation (ICRA)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, E. Yang, Zach DeVito, Zeming Lin, Alban Desmaison, L. Antiga, Adam Lerer (2017)
Automatic differentiation in PyTorch
Iro Laina, C. Rupprecht, Vasileios Belagiannis, Federico Tombari, Nassir Navab (2016)
Deeper Depth Prediction with Fully Convolutional Residual Networks2016 Fourth International Conference on 3D Vision (3DV)
Mingxing Tan, Ruoming Pang, Quoc Le (2019)
EfficientDet: Scalable and Efficient Object Detection2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chaoqiang Zhao, Qiyu Sun, Chongzhen Zhang, Yang Tang, F. Qian (2020)
Monocular depth estimation based on deep learning: An overviewScience China Technological Sciences, 63
D. Eigen, R. Fergus (2014)
Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture2015 IEEE International Conference on Computer Vision (ICCV)
Clément Godard, Oisin Aodha, G. Brostow (2018)
Digging Into Self-Supervised Monocular Depth Estimation2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Kaiming He, X. Zhang, Shaoqing Ren, Jian Sun (2015)
Deep Residual Learning for Image Recognition2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Yangang Wang, Ruiping Wang, Qionghai Dai (2014)
A Parametric Model for Describing the Correlation Between Single Color Images and Depth MapsIEEE Signal Processing Letters, 21
Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, Xiaogang Wang (2018)
Learning Monocular Depth by Distilling Cross-domain Stereo Networks
Xiaotian Chen, X. Chen, Zhengjun Zha (2019)
Structure-Aware Residual Pyramid Network for Monocular Depth Estimation
W. Liu, Dragomir Anguelov, D. Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, A. Berg (2015)
SSD: Single Shot MultiBox Detector
Zhichao Yin, Jianping Shi (2018)
GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Jure Zbontar, Yann LeCun (2014)
Computing the stereo matching cost with a convolutional neural network2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Maarten Schellevis (2019)
Improving Self-Supervised Single View Depth Estimation by Masking OcclusionArXiv, abs/1908.11112
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, E. Wu (2017)
Squeeze-and-Excitation Networks2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Md. Islam, Mrigank Rochan, Neil Bruce, Yang Wang (2017)
Gated Feedback Refinement Network for Dense Image Labeling2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Songyan Zhang, Zhicheng Wang, Qiang Wang, Jinshuo Zhang, Gang Wei, X. Chu (2020)
EDNet: Efficient Disparity Estimation with Cost Volume Combination and Attention-based Spatial Residual2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Rahul Garg, N. Wadhwa, S. Ansari, J. Barron (2019)
Learning Single Camera Depth Estimation Using Dual-Pixels2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, Jiaya Jia (2018)
Path Aggregation Network for Instance Segmentation2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
K. Khoshelham, S. Elberink (2012)
Accuracy and Resolution of Kinect Depth Data for Indoor Mapping ApplicationsSensors (Basel, Switzerland), 12
René Ranftl, Alexey Bochkovskiy, V. Koltun (2021)
Vision Transformers for Dense Prediction2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Jogendra Kundu, P. Uppala, Anuj Pahuja, R. Babu (2018)
AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Diederik Kingma, Jimmy Ba (2014)
Adam: A Method for Stochastic OptimizationCoRR, abs/1412.6980
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie (2016)
Feature Pyramid Networks for Object Detection2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Clément Godard, Oisin Aodha, G. Brostow (2016)
Unsupervised Monocular Depth Estimation with Left-Right Consistency2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Jia-Ren Chang, Yonghao Chen (2018)
Pyramid Stereo Matching Network2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, T. Fingscheidt (2020)
Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance
Chaoyang Wang, J. Buenaposada, Rui Zhu, S. Lucey (2017)
Learning Depth from Monocular Videos Using Direct Methods2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
S. Bhat, Ibraheem Alhashim, Peter Wonka (2020)
AdaBins: Depth Estimation Using Adaptive Bins2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Ravi Garg, B. Kumar, G. Carneiro, I. Reid (2016)
Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue
Shanshan Zhao, Huan Fu, Mingming Gong, D. Tao (2019)
Geometry-Aware Symmetric Domain Adaptation for Monocular Depth Estimation2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
D. Eigen, Christian Puhrsch, R. Fergus (2014)
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, Illia Polosukhin (2017)
Attention is All you Need
Hengli Wang, Rui Fan, Peide Cai, Ming Liu (2021)
PVStereo: Pyramid Voting Module for End-to-End Self-Supervised Stereo MatchingIEEE Robotics and Automation Letters, 6
Kevin Karsch, Ce Liu, S. Kang (2014)
Depth Transfer: Depth Extraction from Video Using Non-Parametric SamplingIEEE Transactions on Pattern Analysis and Machine Intelligence, 36
Olga Russakovsky, Jia Deng, Hao Su, J. Krause, S. Satheesh, Sean Ma, Zhiheng Huang, A. Karpathy, A. Khosla, Michael Bernstein, A. Berg, Li Fei-Fei (2014)
ImageNet Large Scale Visual Recognition ChallengeInternational Journal of Computer Vision, 115
P. Sermanet, D. Eigen, Xiang Zhang, Michaël Mathieu, R. Fergus, Yann LeCun (2013)
OverFeat: Integrated Recognition, Localization and Detection using Convolutional NetworksCoRR, abs/1312.6229
J. Konrad, M. Wang, P. Ishwar (2012)
2D-to-3D image conversion by learning depth from examples2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
Tinghui Zhou, Matthew Brown, Noah Snavely, D. Lowe (2017)
Unsupervised Learning of Depth and Ego-Motion from Video2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Amir Atapour-Abarghouei, T. Breckon (2018)
Real-Time Monocular Depth Estimation Using Synthetic Data with Domain Adaptation via Image Style Transfer2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, Quoc Le (2019)
NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Estimating depth from images captured by camera sensors is crucial for the advancement of autonomous driving technolo- gies and has gained significant attention in recent years. However, most previous methods rely on stacked pooling or stride convolution to extract high-level features, which can limit network performance and lead to information redundancy. This paper proposes an improved bidirectional feature pyramid module (BiFPN) and a channel attention module (Seblock: squeeze and excitation) to address these issues in existing methods based on monocular camera sensor. The Seblock redistributes channel feature weights to enhance useful information, while the improved BiFPN facilitates efficient fusion of multi-scale features. The proposed method is in an end-to-end solution without any additional post-processing, resulting in efficient depth estimation. Experiment results show that the proposed method is competitive with state-of-the-art algorithms and preserves fine-grained texture of scene depth. Keywords Autonomous vehicle · Camera sensor · Deep learning · Depth estimation · Self-supervised Abbreviations Kinect cannot be used in bright sunlight [3]. Additionally, BiFPN Bidirectional feature pyramid network visible cameras are commonly used in depth estimation tasks CNN Convolution neural network [2, 3] as they are cost-effective and have a smaller size. Two Seblock Squeeze-and-excitation block main approaches for depth estimation using camera sensors are monocular and binocular solutions [4]. While binocular depth estimation is a possible solution, it is usually limited 1 Introduction by the occlusion problem, and the larger calculation amount and cost are more expensive than the monocular camera Depth estimation is a significant and interesting task in the [5]. Therefore, in recent years, monocular depth estimation field of scene perception, with a wide range of applications, methods have gained popularity as a promising and feasible such as autonomous driving, intelligent transportation, 3D solution [4, 5]. reconstruction, and virtual reality. However, traditional methods for acquiring depth information, such as Lidar or 1.1 Traditional Machine Learning Methods Kinect sensors [1], have limitations in certain situations. For example, Lidar is not suitable for medical applications, Recovering depth from camera sensors has been a subject like gastroscopy, due to its large size and high cost [2], and of research for a long time, using traditional machine learn- ing methods. There are two main branches of traditional machine learning methods in monocular depth estimation, Academic Editor: Xipeng Wang i.e., parameter learning methods and non-parametric learn- ing methods. * Xingda Qu The parameter learning methods obtain parameters of the quxd@szu.edu.cn model through training and have been widely adopted for College of Mechanical and Vehicle Engineering, Chongqing depth estimation from monocular camera sensors [6–8]. For University, Chongqing 400044, China example, Saxena et al. [6] modeled the mapping relation- Institute of Human Factors and Ergonomics, College ship between the input image characteristics and the output of Mechatronics and Control Engineering, Shenzhen depth by using Markov random field (MRF). Liu et al. [7 ] University, 3688 Nanhai Avenue, Shenzhen 518060, China Vol.:(0123456789) 1 3 G. Li et al. optimized the depth map by constructing a two-layer MRF depth prediction tasks. However, these supervised learning model which used semantic tags as an auxiliary different methods are highly dependent on high-quality datasets with semantic tag and using pixels and super-pixels as nodes. annotated labels, which limits their adaptiveness to other Wang et al. [8] described the correlation between RGB scenarios. images and the corresponding depth maps by adopting a Alternatively, self-supervised learning methods can be kernel function in a nonlinear space, and then used image used to overcome the limitations of supervised learning block learning parameters for depth estimation. However, methods. There are two main branches of self-supervised these methods all require that the relationship between learning methods in the literature, i.e., approaches based sensor-collected RGB images and the inferred depths can on stereo matching and approaches based on synthetic ste- be established by a parametric model, which is difficult to reo pairs or monocular video. The methods based on ste- be formulated reliably to describe the real-world mapping reo matching aim to minimize the cost volume calculated relationship. Therefore, the prediction accuracies of the from the matched features. For example, Zbontar et al. [22] parametric learning methods are usually limited. trained a deep neural network by computing the matching The methods based on nonparametric learning are another cost of two different patches. Wang et al. [23] used a new widely adopted solution for depth estimation using camera structure for depth estimation by comprehensively using a sensors [9–11]. These methods infer depth by using existing pyramid voting module (PVM) and deep convolutional neu- datasets for similarity retrieval. For example, Karsch et al. ral network (DCNN). These methods can deliver accurate [9] used depth transfer to search for the image sequence that results in real time, but they are prone to problems such as closely resembles the input image. Liu et al. [10] obtained occlusion and texture-copy artifacts [23]. the depth map using a discrete and continuous optimizer, Recent studies have proposed methods to get depth infor- where the continuous optimization encoded the super-pixels mation by training models based on synthetic stereo pairs in the input features to generate depth and the discrete part [13, 24] and monocular videos [4, 5] from camera sensors. described the relationships between the adjacent super-pix- The methods based on synthetic stereo pairs have shown els. Konrad et al. [11] performed median filtering on the promising results in monocular depth estimation, which is retrieved similar images to generate an initial depth map different with monocular video in that the model is trained and then used a bilateral cross filtering method to smooth using stereo images. For instance, in Ref. [13], the left image the initial depth map. However, these methods rely heavily in the stereo image pair was used to generate the depth map on retrieving image pixels, which can be computationally of the corresponding left image, and then the warp method expensive and may pose challenges in practical applications. was used to obtain the disparity map of the right image. Based on the generated depth map, a synthesized right image 1.2 Deep Learning Methods was obtained, and a loss function was designed by compar- ing it with the real right image. In Ref. [24], a CNN was used With the rapid development of convolution neural network to estimate the left image in the image stereo pair to gener- (CNN) in recent years, various deep learning approaches ate the corresponding left disparity image, which was then have been developed to recover depth information from RGB combined with the real right image to obtain the synthetic images captured by monocular camera sensors [12–18]. left image. However, these methods are less attractive than These methods can be generally classified into supervised those based on monocular videos because monocular camera learning methods and self-supervised learning methods. sensors can acquire datasets more easily and conveniently. Supervised learning methods for depth estimation from Given the increasing availability of public datasets, RGB images mainly involve constructing a loss function methods based on monocular camera sensors are receiving to evaluate the difference or variance between the input increased attention from researchers. Recently, self-super- image and the output predicted value. The loss values are vised methods have demonstrated the ability to synthesize then back-propagated to the neural network to update the the RGB image of the target through the depth map esti- weights. These methods typically achieve higher accuracy mated by CNN [4, 15, 25]. For instance, Zhou et al.[15] than unsupervised approaches. For example, in Ref. [19], trained a depth estimation model along with an ego-motion a transformer-based module was proposed in which the network using a self-supervised method based on videos depth of range was divided into bins, and the middle values datasets from camera sensors. However, this method may of these bins were estimated adaptively per image. In Ref. make the model fall into a local minimum because it is chal- [20], the Laplacian pyramid was incorporated into a decoder lenging to simultaneously estimate depth and predict ego- architecture and weight standardization was applied to the motion. To address this issue, various approaches have been pre-activation convolution blocks of the decoder architec- proposed. Vijayanarasimhan et al. [5] estimated depth by ture. Ranftl et al. [21] proposed a transformer-based method using segmentation and object motion to construct a motion to replace the convolution structure in the backbone for field, reducing the influence of ego-motion and relative 1 3 Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self‑supervised… motion. Klingner et al. [26] proposed a self-supervised focused on integrating features from the backbone network semantical method to guide depth estimation in dynamic [34–36]. As one of the classical methods, Lin et al. [37] scenarios. Godard et al. [4] proposed an auto-mask to solve built high-level sematic feature maps at each scale using non-rigid motion and per-pixel minimum re-projection loss a top-down framework with lateral connections. Liu et al. to handle occlusions in depth estimation. [38] proposed a bottom-up augmentation method to reduce The most recent approaches have primarily focused on the distance between lower and higher layers. Amirul Islam complex structures to improve estimation performance. For et al. [39] introduced gate units to control the flow of valid example, Fu et al. [27] proposed a regression method for information and avoid ambiguity. More recently, Ghiasi et al. depth estimation to obtain a continuous high-precision depth [40] utilized a neural architecture search (NAS) strategy to map. Hu et al. [28] proposed fusing features extracted at achieve a more effective yet complex feature fusion struc- different scales and used a complex model to improve esti- ture. To effectively use features from different layers, this mation accuracy. Chen et al. [29] built a depth estimation study developed an improved bidirectional feature pyramid model by combining a residual pyramid decoder and four module (BiFPN) that connects features from different lay - residual refinement modules. However, these methods did ers by calculating weights from different layers rather than not consider that stacking too many pooling and CNN layers simply concatenating the features. may cause information redundancy. The merits and demerits of the above-mentioned methods 1.4 Contributions are summarized in Table 1. In this study, a novel self-supervised monocular depth 1.3 Attention Mechanism and Feature Pyramid estimation method is proposed, inspired by ResNet [41]. Network The method integrates a channel attention module and an improved BiFPN for enhanced performance. The channel Previous research has proved that incorporating learning attention module extracts more useful information than mechanisms, such as attention, can significantly improve the baseline by learning weights from different features, network performance without the need for additional super- while the improved BiFPN is used as the decoding network, vision [30]. One such mechanism is the squeeze-and-exci- preserving fine-grained features and incorporating global tation block (Seblock), proposed in Hu et al. [30], which information based on high-level features from multilayers. increases the weight of valid information and reduces the The integration of the channel attention module and BiFPN weight of invalid information. Another example is the use improves the depth estimation accuracy of the developed of sequential channel and spatial attention maps for adap- method while reducing the number of parameters, which tive feature refinement in Woo et al. [31]. Additionally, self- addresses the issue of high network complexity commonly attention, originally used in natural language processing, has found in stacked pooling or stride convolution. been utilized in recent camera sensor-related tasks [32]. This The main contributions of this study are twofold. Firstly, study leverages the Seblock module to effectively extract a fusion version of ResNet is proposed as the encoder, which image features. effectively extracts features from input images by incorpo- In deep learning, increasing the receptive field is a sig- rating the channel attention mechanism in different layers nificant challenge. While this can be achieved by adding of ResNet, thereby combining information from different more CNN layers, this approach also leads to the problem channels and improving model performance. Secondly, an of gradient disappearance [33]. Previous work has primarily improved BiFPN, with a unique structure, is proposed as the Table 1 A brief summary of the related methods based on camera sensors Methods Merits Demerits References Monocular video-based self-supervised Easy to acquire datasets Difficult to reach the optimal solution [4, 5] methods Traditional machine learning methods Easy to be understood and explained Based on the assumption that the rela- [6, 7] tionship satisfies a parameter model Synthetic stereo pairs-based self-super- No need to solve the problem of ego- Artifacts visible at occlusion boundaries [13, 24] vised methods motion Supervised methods No need to process obstacle and high Limited ground truth depth data [19, 20] accuracy Stereo matching-based self-supervised Able to get the real depth rather than rela- Obstructed objects cause matching errors [22, 23] methods tive depth 1 3 G. Li et al. ∑ � � � decoder, which effectively generates high-precision depth L = min �I − I � ph t (3) � � maps of input images while preserving rich and effective t details. where photometric re-projection loss is used to address the The proposed method is demonstrated to be effective and problem of occlusion [4]. Then, structural similarity (SSIM) superior to state-of-the-art methods on two large-scale data- loss is calculated to measure the similarity between I and sets, KITTI and Make3D. To the best of knowledge, this the synthesized image I : technology has not been previously reported in studies on depth estimation based on camera sensors. 1−SSIM I −I ( ) t t L = (4) ssim To make the depth images clearer and smoother intui- 1.5 Paper Organization tively, the following loss is used: − I The remaining part of this paper is structured as follows: ∗ − I ∗ y t x t L = d e + d e (5) smooth x y t t Sect. 2 introduces the proposed approach for depth estima- tion. Section 3 details the experimental set up, including ∗ where d = d∕d with d as predicted depth value and d as the the datasets and evaluation metrics used. Section 4 presents mean of predicted depth value. By employing d , the shrink- both quantitative and qualitative experimental results to ing of the estimated depth can be efficiently prevented [42]. demonstrate the superiority of the proposed method. Finally, Then, the final loss is designed as: Sect. 5 concludes this study. L = L + (1 − )L + L (6) ph ssim smooth where the smooth term is set at 0.001 and photometric loss term at 0.15, and denotes the auto-mask which is used in 2 Proposed Method Ref. [4] for masking stationary pixels and objects motion. 2.1 General Solution 2.2 Architecture of the Proposed Method A feasible solution for self-supervised training is to syn- The proposed network structure, as shown in Fig. 1, com- thesize a new image and compare it to the original image, prises of two main branches. The upper branch is responsi- using this comparison to construct an L1 loss training net- ble for estimating depth information (i.e., the upper part in work. This approach does not require ground truth labels, Fig. 1), while the lower branch is utilized to estimate pose but instead utilizes a supervised signal to guide the conver- information (i.e., the lower part in Fig. 1). gence of the loss function. By using this method, the depth In Fig. 1, the frames labeled −1, 0, and 1 represent three D of I and ego-motion T between the target image I t t t→s t consecutive images in time. The frame labeled 0 is the target and the source image I (s ∈(t−1, t + 1)) can be estimated frame, while the frames labeled − 1 and 1 are the frames using camera sensor data. The homogeneous coordinates of immediately preceding and following the target frame, a pixel in I are denoted as p . The projection p of p can t t s t respectively. The depth map of the target image is obtained be obtained by through the depth network in the upper part of the figure, and −1 p = KT D p K p the camera’s rotation and translation information is obtained (1) s t→s t t t through the pose network in the lower part. The depth map is where K is the intrinsic matrix of camera. Then, a differ - then transformed into 3D space using the inverse of camera’s entiable bilinear sampling mechanism is employed to solve internal parameters to generate a point cloud, and the camera the problem of non-integer pixel coordinate values being rotation and translation information is used to align the point projected onto I . cloud with the corresponding input image. Finally, the point � � cloud of the target image is projected onto the 2D plane � � ij � ij I p = I p t s s t (2) according to the camera’s internal parameters, and the final i∈{t,b},j∈{l,r} synthesized image is obtained through bilinear interpolation. ij where { t, b, l, r } denote the 4-pixel neighbors, and is the Both the depth network and pose network have encoding weight of the calculated bilinear interpolation which meas- and decoding structures, with the depth network incorporat- ij ures the distance between adjacent pixels and = 1. ing two innovations: the use of Seblocks in the encoder to The synthesized target images I are acquired from the extract features from different layers and an improved BiFPN above calculation. Then, the L1 loss between I and I can in the decoder to fuse multilayer features by learning the be computed to get photometric loss: weights of features. The encoder of the pose network has 1 3 Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self‑supervised… Fig. 1 The overall structure of the proposed method the same structure as the depth encoder (i.e., Seblocks are information, a global average pooling is proposed to expand inserted in the encoder), but it receives input from two pic- the receptive field of the transformation outputs, as shown tures to infer ego-motion, whereas the depth network only in Eq. (8). needs one picture to estimate depth. H W � � ∑ ∑ z = F u = u (i, j) (8) c c c H×W i=1 j=1 2.3 Channel Attention Network where z ∈ R and F are the squeeze functions to generate The Seblock [30] is applied to address the problem of infor- statistics z by using average pooling operation on u . To c c mation redundancy, and the weights of different channels completely gain the channel-wise dependencies, a simple learned by Seblock are applied to extract useful informa- but useful gating mechanism with a sigmoid function is pro- tion and to reduce the weights of useless information. The posed as follows. diagram of the Seblock module is shown in Fig. 2. Seblock s = F (z, W) = (g(z, W)) = W W z (9) 2 1 is a unit to construct the given transformation F : T − > � � � H×W×C H ×W ×C T, T ∈ R , T ∈ R . The V = v , v ,⋯ , v 1 2 c where and are the ReLU function and sigmoid func- c c denotes the learned filter kernels, and v is the parameter of ×c c× r r tion, respectively, W ∈ R and W ∈ R . To make the 1 2 the c-th filter. Then, the outputs of F as U = u , u ,… , u 1 2 c modules lightweight, the reduction ratio r is set as 16 [30]. can be obtained. Finally, the outputs are obtained by rescaling. s s c x = F u , s = s ⋅ u u = v ∗ X = v ∗x (7) (10) c c c c c c s=1 1 2 c where X = x , x ,⋯ , x and F u , s are channel-wise 1 2 c c c where * denotes convolution, v =[v , v , … , v ] , c c c H×W 1 2 c s multiplication between u R and scalar s . X =[x , x , … , x ] , and v is a 2D spatial kernel. For sim- c c Different from SeNet [30] that uses Seblock in backbone plicity, bias terms are omitted. In order to address the limita- to train the model, Seblock is inserted into the encoders of tion that transformation outputs cannot use global contextual Fig. 2 The diagram of the Seblock module [30] X X W W C C 1 3 G. Li et al. the depth network and pose network in this study. As illus- P7 2048 64 1 trated in Fig. 1, the channel attention mechanism is applied to the encoding and decoding structure. P6 1024 64 64 1 2.4 T he Improved Bidirectional Feature Pyramid P5 512 64 64 64 1 Network (BiFPN) P4 256 64 64 1 In Ref. [33], BiFPN is proposed as a method for efficiently improving network performance through multi-scale fea- P3 64 64 ture fusion. Compared to other methods [37–40], BiFPN has several unique features. Firstly, it simplifies the struc- Fig. 4 The diagram of the improved BiFPN module used in the pro- ture by removing nodes with only one input edge. Sec- posed method ondly, it adds an extra edge from input to output for more dimension reduction from the original feature channels to feature fusion. Thirdly, it utilizes a bidirectional path to 64. The light-blue rectangles denote dimension reduction achieve high-level feature fusion. Lastly, it addresses the processes of the channels to one layer. issue of uneven input feature contributions by introducing additional weights for each input, allowing the network to learn the importance of each input feature. Figure 3 shows the specifics of the original BiFPN. 3 Experiments In Fig. 3, P3-P7 represents the feature level with a reso- (i−2) lution of 1∕2 of the input images, where i = 3, 4,… ,7 . 3.1 Training and Test Datasets For example, P represents the feature level with a resolu- (3−2) tion of 1∕2 of the input images, which means that if the 3.1.1 KITTI input resolution is 192 × 640, the P3 feature level would (3−2) be with a resolution of 96 × 320 because 192∕2 = 96 The KITTI dataset is one of the most widely used datasets in (3−2) and 640∕2 = 320. autonomous driving and compute vision tasks (e.g., visual In this study, BiFPN is used as a decoder to efficiently odometry and SLAM). The training and testing data split fuse features from multilayers. In addition, in order to use method used in this study, as well as in Refs. [15] and [43], BiFPN efficiently, channel downsampling is applied to is the same as Ref. [44]. As suggested by Zhou et al. [15], resize the channel to fit the BiFPN’s inputs and merge 39,810 monocular triplets without static images were used the features in different layers. Further, another channel for model training. The KITTI dataset, which includes 4424 downsampling (64 → 1) is applied to gain the final depth images from camera sensors, was used to evaluate the exam- value. Figure 4 shows the improved BiFPN module, which ined methods. Additionally, the same camera intrinsic matrix is novel in the literature. was used for all images and the predicted depth was capped In Fig. 4, the numbers in the circles represent the num- at 80 m, following the guideline of the KITTI dataset [44]. ber of feature channels. The different colors indicate differ - ent features in different layers. The blue rectangles indicate 3.1.2 Make3D P7 The proposed method was further evaluated for its general- izability on the Make3D dataset [45]. The Make3D, which is designed specifically for depth estimation tasks, con- P6 sists of monocular RGB images and ground truth data from camera sensors. However, it lacks stereo images or image P5 sequences, making it a common test datasets for unsuper- vised methods [4]. Although it is not suitable for training P4 unsupervised or stereo depth estimation methods due to its small size (only 534 images), it was used to evaluate the proposed method. Image preprocessing involved central P3 cropping due to the varying aspect ratios of the images in the Make3D dataset. Fig. 3 The diagram of the original BiFPN module [33] 1 3 Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self‑supervised… the same resolution as the input image. Following the other 3.2 Evaluation Metrics depth estimation approaches [4, 49], the weights were pre- trained on ImageNet [50]. To quantitatively evaluate the performance of the proposed method against other state-of-the-art methods, five com- The depth estimation network is comprised of a encod- ing network, which includes the ResNet50 architecture with monly used evaluation metrics are utilized [46], including absolute relative error (Abs Rel), square relative error (Sq inserted Seblock modules, and a decoding network, featuring an improved BiFPN with a U-Net architecture that effectively Rel), root-mean-square error (RMSE), root-mean-square logarithmic error ( RMSE ), and accuracy with threshold extracts useful features from the inputs to produce a depth map. log The pose estimation network was structured with a ( 𝛿< 1.25 , i =1, 2,3 ). These metrics are widely used in monocular depth estimation [4, 13, 24, 26]. The definitions ResNet50 architecture and incorporated the Seblock module for feature extraction. To estimate the 6-DoF, which included of these metrics are given as follows: rotation and translation, the outputs were scaled by 0.01, fol- ∑ ∗ 1 �y−y � AbsRel = lowing the approach in Wang et al. [42]. In order to input two (11) �T� y y∈T images to estimated 6-DoF, the pose network is modified to accept six channel images [4]. Furthermore, to prevent overfit- ∗ 2 1 �y−y � ting, techniques for online data augmentation, such as random Sqrel = (12) �T� y y∈T brightness, contrast, and saturation, were implemented. All the experiments were implemented in PyTorch [51] ∑ ∗ on 3.50 GHz Intel(R) Core (TM) i5–7300HQ CPU with 1 � y−y � RMSE = ∗ (13) � T� y 64.00 GB RAM and one NVIDIA GeForce Titan Xp GPUs. y∈T The change of training loss with the number of training steps is illustrated in Fig. 5, which shows that the proposed ∑ ∗ 1 � log y−log y � RMSE = method can effectively converge to a stable level. log ∗ (14) � T� y y∈T 4 Results and Discussion y y Accuracy =%ofy s.t. max , = 𝛿< thr (15) i ∗ y y 4.1 Comparison with the State‑of‑theA ‑ rt (SOTA) where y is the predicted depth, y is the ground truth label, Methods T is the collection of all the pixels, T denotes the num- ber of pixels, and thr denotes the threshhold gate (i.e., Thirteen SOTA methods for depth estimation were compared thr = 1.25 , i =1, 2,3 ). The unit of the predicted depth and to demonstrate the advances of the proposed method. Among ground truth depth is m, while the used evaluation metrics the thirteen methods, six are supervised and seven are self- are dimensionless. supervised. The supervised methods include those found in Bhat et al. [19], Song et al. [20], Ranftl et al. [21], Eigen et al. 3.3 Implementation Details [46], Liu et al. [52] and Kundu et al. [53]. The self-supervised methods include those proposed by Monodepth2 [4], Mah- The proposed method involves determining 3 parameters: jourian et al. [12], Monodepth [13], Zhou et al. [15], SGD- the smooth parameter , the photometric loss term , and depth [26], DDVO [42], Struct2depth [54], DualNet [55], the learning rate. These parameters were specified accord- GeoNet [56], Schellevis et al. [57] and Zhou et al. [58]. ing to Ref. [4]. The Adam optimization algorithm [47] and the model were trained 20 epochs with a batch size of 16. The specific values of γ and τ were set at 0.001 and 0.15, −4 respectively. The learning rate was set at 10 in the begin- −5 ning and 10 in the final five epochs. The patch size used for the KITTI dataset was 192 × 640, and for the Make3D dataset, it was 240 × 319. Following the setting in Godard et al. [4] and Chen et al. [48], the depth range was limited to 0–80 m for evaluation. As shown in Fig. 1, each layer in the encoding network downsamples the input features once, and each downsampling process reduces the resolution by Fig. 5 The change of training loss with the number of training steps. half. In addition, each layer in the decoding network upsam- The spacing of the horizontal axis does not represent equal distance, ples the input features and finally outputs a depth map with but only serves as a tick mark 1 3 G. Li et al. Table 2 Quantitative 2 3 Method Supervised Abs Rel Sq Rel RMSE RMSE 𝛿 < 1.25 𝛿 < 1.25 𝛿 < 1.25 log comparison of the examined supervised and self-supervised Eigen et al. [46] Yes 0.203 1.548 6.307 0.282 0.702 0.890 0.890 methods Liu et al. [52] Yes 0.201 1.584 6.471 0.273 0.680 0.898 0.967 AdaDepth [53] Yes 0.167 1.257 5.578 0.237 0.771 0.922 0.971 Lapdepth [20] Yes 0.059 0.212 2.446 0.091 0.962 0.994 0.999 DPT-Hybrid [21] Yes 0.062 - 2.573 0.092 0.959 0.995 0.999 AdaBins [19] Yes 0.058 0.190 2.360 0.088 0.964 0.995 0.999 Zhou et al. [15] No 0.183 1.595 6.709 0.270 0.734 0.902 0.959 Mahjourian et al. [12] No 0.163 1.240 6.220 0.250 0.762 0.916 0.968 GeoNet [56] No 0.155 1.296 5.857 0.233 0.793 0.931 0.973 DDVO [42] No 0.151 1.257 5.583 0.228 0.810 0.936 0.974 Struct2depth [54] No 0.141 1.026 5.291 0.215 0.816 0.945 0.979 Zhou et al. [58] No 0.139 1.057 50,213 0.214 0.831 0.940 0.975 Monodepth [13] No 0.133 1.142 5.533 0.230 0.830 0.936 0.970 DualNet [55] No 0.121 0.837 4.945 0.197 0.853 0.955 0.982 Monodepth2 [4] No 0.115 0.903 4.863 0.193 0.877 0.959 0.981 Schellevis et al. [57] No 0.113 0.865 4.789 0.192 0.878 0.960 0.981 SGDdepth [26] No 0.113 0.835 4.693 0.191 0.879 0.961 0.981 The proposed method No 0.113 0.763 4.645 0.187 0.874 0.960 0.983 The estimation results when using different meth- The proposed method has the best performance among ods on the KITTI dataset are shown in Table 2. The pre- the self-supervised methods, as shown in Table 2. However, sented results reveal that the Abs Rel, Sq Rel, RMSE, and the supervised learning methods including Lapdepth, DPT- RMSE of the proposed method are 0.113, 0.763, 4.645, Hybrid, and AdaBins achieve better results than the proposed log and 0.187, respectively. These numbers are improved by method because of the use of labelled data for training, which 1.74%, 15.50%, 4.48%, and 3.11%, respectively, when com- can address the challenges of occlusion and ego-motion. pared to Monodepth2 [4]. Additionally, the accuracies with Nonetheless, the proposed method still outperforms the other 2 3 thresholds 1.25, 1.25 , and 1.25 are 0.874, 0.960, and 0.983, three supervised learning methods including in Eigen et al. respectively, when using the proposed method. The slightly [46], Liu et al. [52] and Kundu et al. [53], demonstrating that weaker performance of the proposed method on 𝛿< 1.25 and the proposed self-supervised method can achieve comparable 𝛿< 1.25 is probably because of the simpler decoder design performance to supervised methods. The qualitative results which only contains 8 M parameters. The proposed method shown in Fig. 6 also indicate that the proposed method has demonstrates the best performance across all other evalu- better performance, with sharper thin objects such as poles ation metrics when compared to the other self-supervised in comparison with the estimation from Monodepth2. This methods. could be attributed to the use of the Seblock together with the Many of the compared methods (e.g., [27–29, 33] and improved BiFPN module for depth estimation. [58]) in Table 2 use stacked pooling or stride convolution to extract high-level features for depth estimation. Stacking 4.2 Ablation Study too many pooling or stride convolution layers can lead to information redundancy [33]. For example, the VGG encod- In order to evaluate the impact of each component in the ing network used in Zhou et al. [58] has 500 M parameters, proposed method on depth estimation performance, ablation which is five times more than the number of parameters in experiments were conducted. Both ResNet18 and ResNet50 the proposed method. Due to the high complexity of stacked were tested as the baseline encoder. As shown in Table 3, pooling and stride convolution, the performance of these using ResNet50 as the encoder achieves better performance compared methods is not satisfactory (as shown in Table 2). than using ResNet18. Then, the Seblock and the improved To address this issue, the proposed method utilizes a more BiFPN module were incorporated into the ResNet50 base- efficient decoding network based on BiFPN and incorporates line and evaluated their impact on network performance. As a channel attention mechanism to enhance its performance. displayed in Table 3, the Sq Rel and RMSE of the ResNet50 The results in Table 2 show that the proposed method’s baseline are 0.831 and 4.705, respectively. These two met- depth estimation performance surpasses that of the methods rics were improved by 6.26% and 1.08%, respectively, when with stacked pooling or stride convolution. Seblock was added to ResNet50, and improved by 6.38% 1 3 Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self‑supervised… Inputs SGDepth [26] Monodepth [9] GeoNet [56] Schellevis et al. [57] Monodepth2 [4] Proposed method Fig. 6 Qualitative results for comparisons with the examined supervised and self-supervised methods Table 3 Ablation experiment 2 3 Method Abs Rel Sq Rel RMSE RMSE 𝛿 < 1.25 𝛿 < 1.25 𝛿 < 1.25 log results on KITTI ResNet18 0.115 0.903 4.863 0.193 0.877 0.959 0.981 ResNet18 + Seblock 0.116 0.885 4.842 0.194 0.874 0.959 0.981 ResNet18 + BIFPN 0.118 0.863 4.809 0.191 0.863 0.958 0.983 ResNet18 + Seblock + BiFPN 0.118 0.825 4.861 0.192 0.862 0.957 0.983 ResNet50 0.113 0.831 4.705 0.189 0.878 0.961 0.982 ResNet50 + Seblock 0.112 0.779 4.654 0.190 0.880 0.961 0.982 ResNet50 + BIFPN 0.114 0.778 4.690 0.187 0.868 0.958 0.983 ResNet50 + Seblock + BiFPN 0.113 0.763 4.645 0.187 0.874 0.960 0.983 and 0.32%, respectively, when improved BiFPN was added. Make3D [45]. The central crop method, as suggested in Furthermore, when the improved BiFPN was used, Sq Rel Godard et al. [4], was used to process the sensor-collected and RMSE were further reduced to 0.763 and 4.645, respec- images with different aspect ratios in the dataset. To ensure tively. Compared with the ResNet50 baseline, the Sq Rel fairness in comparison, the model trained on KITTI was and RMSE are improved by 8.18% and 1.28%, respectively, directly used for testing on Make3D without any fine-tuning. when using the proposed ResNet50 + Seblock + BiFPN. The Eight state-of-the-art supervised and self-supervised meth- performances on Abs Rel, RMSE and the accuracies with ods were used for comparison to demonstrate the robust- log different thresholds when incorporating different modules ness of the proposed method. Three of the eight methods are generally on the same levels. The results on ResNet18 are supervised, which can be found in Refs [9, 16]. and [53], show similar trends with the results on ResNet50. These and the other five are self-supervised, including Monodepth2 results indicate that both Seblock and the improved BiFPN [4], Monodepth [13], SharinGAN [59], Atapour et al. [60], contribute to the improved depth estimation performance. and GASDA [61]. The quantitative comparison results are presented in 4.3 Robustness of the Proposed Method Table 4. As indicated by the numbers in bold, the proposed self-supervised method obtains better depth estimation The robustness of the proposed method was further evalu- performance when compared to the other self-supervised ated by testing it in another popular depth estimation dataset, methods on Make3D. When comparing the proposed method 1 3 G. Li et al. Table 4 Quantitative comparison results on Make3D Method Supervised Abs Rel Sq Rel RMSE Kundu et al. [53] Yes 0.452 5.710 9.559 Karsch et al. [9] Yes 0.398 4.723 7.801 Laina et al. [16] Yes 0.198 1.665 5.461 Monodepth [13] No 0.505 10.172 10.936 Atapour et al. [60] No 0.423 9.343 9.002 Fig. 8 Remarkable distortions in the synthesized images where we GASDA [61] No 0.403 6.709 10.424 have labeled with red rectangles SharinGAN [59] No 0.377 4.900 8.388 Monodepth2 [4] No 0.322 3.589 7.517 Proposed No 0.294 2.163 6.239 4.4 Limitations and Future Work The limitation of the proposed method is that it may with the supervised learning methods, the proposed method result in artifacts when synthesizing images. As shown in shows competitive performance, similar to the results in Fig. 8, blurry boundaries can occur when the target frame Table 2. Only the supervised method in [16] performs bet- is obtained by interpolating from the first frame. Another ter than the proposed method. Given that supervised meth- limitation is that the proposed method may induce errors ods can learn from the accurately annotated labels, while when constructing the photometric loss based on synthe- unsupervised methods can overcome the heavy reliance on sized images from the previous frame and the next frame. In ground truth labels with degraded estimation [4], it is prom- the future research, a new loss function may be considered ising that the performance of the proposed method is close to solve this problem. For example, the target frame could to or even better than supervised methods. be synthesized by incorporating the previous frame in the The performance of the proposed method is compared continuous image sequence instead of the next frame, which with Monodepth2 through qualitative analysis [4], which may reduce the occurrence of artifacts. is one of the most advanced methods. The results in Fig. 7 show that the depth maps obtained using the proposed method capture more details from the input images and have 5 Conclusions more accurate depth information, indicating superior perfor- mance compared to Monodepth2. In this paper, an innovative approach for self-supervised monocular depth estimation is proposed, which combines the use of Seblock and an improved BiFPN module to process images based on ResNet50. The Seblock module improves Fig. 7 Qualitative illustration results on Make3D 1 3 Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self‑supervised… 7. Liu, B., Gould, S., Koller, D.: Single image depth estimation depth map accuracy by strengthening the weights of useful from predicted semantic labels. In: 2010 IEEE Computer Soci- features, and the improved BiFPN module effectively utilizes ety Conference on Computer Vision and Pattern Recognition, pp. different levels of features from the encoder. Results on the 1253–1260 (2010) KITTI dataset show that this proposed method outperforms 8. Wang, Y., Wang, R., Dai, Q.: A parametric model for describing the correlation between single color images and depth maps. IEEE current state-of-the-art self-supervised methods and even Signal Process. Lett. 21(7), 800–803 (2013) some supervised methods in terms of depth information 9. Karsch, K., Liu, C., Kang, S.B.: Depth transfer: Depth extraction estimation. The robustness of the proposed method is fur- from video using non-parametric sampling. IEEE Trans. Pattern ther demonstrated on the Make3D dataset, where it achieved Anal. Mach. Intell. 36(11), 2144–2158 (2014) 10. Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estima- competitive performance with examined supervised meth- tion from a single image. In: Proceedings of the IEEE Conference ods. The proposed method, being self-supervised, over- on Computer Vision and Pattern Recognition, pp. 716–723 (2014) comes the limitation of heavy reliance on annotated labels 11. Konrad, J., Wang, M., Ishwar, P.: 2d-to-3d image conversion by for training, making it useful for the development of smart learning depth from examples. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work- environment perception systems in autonomous vehicles for shops, IEEE, pp. 16–22 (2012) safe driving in intelligent transportation systems. 12. Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3d geo- Acknowledgements This study is supported by the National Natural metric constraints. In: Proceedings of the IEEE Conference on Science Foundation of China (Grant No. 52272421) and Shenzhen Fun- Computer Vision and Pattern Recognition, pp. 5667–5675 (2018) damental Research Fund (Grant Number: JCYJ20190808142613246 13. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monoc- and 20200803015912001). ular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- Declarations tion, pp. 270–279 (2017) 14. Garg, R., Wadhwa, N., Ansari, S., Barron, J.T.: Learning single Conflict of interest On behalf of all the authors, the corresponding au- camera depth estimation using dual-pixels. In: Proceedings of thor states that there is no conflict of interest. the IEEE/CVF International Conference on Computer Vision, pp. 7628–7637 (2019) Open Access This article is licensed under a Creative Commons Attri- 15. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised bution 4.0 International License, which permits use, sharing, adapta- learning of depth and ego-motion from video. In: Proceedings tion, distribution and reproduction in any medium or format, as long of the IEEE Conference on Computer Vision and Pattern Rec- as you give appropriate credit to the original author(s) and the source, ognition, pp. 1851–1858 (2017) provide a link to the Creative Commons licence, and indicate if changes 16. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, were made. The images or other third party material in this article are N.: Deeper depth prediction with fully convolutional residual included in the article's Creative Commons licence, unless indicated networks. In: 2016 IEEE Fourth International Conference on otherwise in a credit line to the material. If material is not included in 3D Vision (3DV), pp. 239–248 (2016) the article's Creative Commons licence and your intended use is not 17. Chang, J. R., Chen, Y. S.: Pyramid stereo matching network. In: permitted by statutory regulation or exceeds the permitted use, you will Proceedings of the IEEE Conference on Computer Vision and need to obtain permission directly from the copyright holder. To view a Pattern Recognition, pp. 5410–5418 (2018) copy of this licence, visit http://cr eativ ecommons. or g/licen ses/ b y/4.0/ . 18. Zhang, S., Wang, Z., Wang, Q., Zhang, J., Wei, G., Chu, X.: EDNet: Efficient disparity estimation with cost volume com- bination and attention-based spatial residual. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5433–5442 (2021) References 19. Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estima- tion using adaptive bins. In: Proceedings of the IEEE/CVF 1. Khoshelham, K., Elberink, S.O.: Accuracy and resolution of Conference on Computer Vision and Pattern Recognition, pp. kinect depth data for indoor mapping applications. Sensors 12(2), 4009–4018 (2021) 1437–1454 (2012) 20. Song, M., Lim, S., Kim, W.: Monocular depth estimation using 2. Zhang, K., Xie, J., Snavely, N., Chen, Q.: Depth sensing beyond laplacian pyramid-based depth residuals. IEEE Trans. Circuits lidar range. In: Proceedings of the IEEE/CVF Conference on Syst. Video Technol. 31(11), 4381–4393 (2021) Computer Vision and Pattern Recognition, pp. 1692–1700 (2020) 21. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers 3. Zhao, C., Sun, Q., Zhang, C., Tang, Y., Qian, F.: Monocular depth for dense prediction. In: Proceedings of the IEEE/CVF Inter- estimation based on deep learning: An overview. Sci. China Tech- national Conference on Computer Vision, pp. 12179–12188 nol. Sci. 63(9), 1612–1627 (2020) (2021) 4. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging 22. Zbontar, J., LeCun, Y.: Computing the stereo matching cost with into self-supervised monocular depth estimation. In: Proceedings a convolutional neural network. In: Proceedings of the IEEE of the IEEE/CVF International Conference on Computer Vision, Conference on Computer Vision and Pattern Recognition, pp. pp. 3828–3838 (2019) 1592–1599 (2015) 5. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., 23. Wang, H., Fan, R., Cai, P., Liu, M.: PVStereo: Pyramid voting Fragkiadaki, K.: Sfm-net: Learning of structure and motion from module for end-to-end self-supervised stereo matching. IEEE video. arXiv preprint arXiv: 1704. 07804, (2017) Robot. Autom. Lett. 6(3), 4353–4360 (2021) 6. Saxena, A., Chung, S., Ng, A.: Learning depth from single monoc- 24. Garg, R., Bg, V.K., Carneiro, G., Reid, I.: Unsupervised cnn ular images. Adv. Neural Inf. Process. Syst. pp. 18 (2005) for single view depth estimation: Geometry to the rescue. In: Computer Vision–ECCV 2016: 14th European Conference, 1 3 G. Li et al. Amsterdam, The Netherlands, October 11–14, 2016, Proceed- 41. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for ings, Part VIII 14, pp. 740–756. Springer (2016) image recognition. In: Proceedings of the IEEE Conference on 25. Ma, F., Cavalheiro, G.V., Karaman, S.: Self-supervised sparse- Computer Vision and Pattern Recognition, pp. 770–778 (2016) to-dense: Self-supervised depth completion from lidar and 42. Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth monocular camera. In: 2019 IEEE International Conference on from monocular videos using direct methods. In: Proceedings of Robotics and Automation (ICRA), pp. 3288–3295 (2019) the IEEE Conference on Computer Vision and Pattern Recogni- 26. Klingner, M., Termöhlen, J. A., Mikolajczyk, J., Fingscheidt, tion, pp. 2022–2030 (2018) T.: Self-supervised monocular depth estimation: Solving the 43. Eigen, D., Fergus, R.: Predicting depth, surface normals and dynamic object problem by semantic guidance. In: Computer semantic labels with a common multi-scale convolutional archi- Vision–ECCV 2020: 16th European Conference, Glasgow, UK, tecture. In: Proceedings of the IEEE International Conference on August 23–28, 2020, Proceedings, Part XX 16, pp. 582–600. Computer Vision, pp. 2650–2658 (2015) Springer (2020). 44. Menze, M., Geiger, A.: Object scene flow for autonomous vehi- 27. Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep cles. In: Proceedings of the IEEE Conference on Computer Vision ordinal regression network for monocular depth estimation. In: and Pattern Recognition, pp. 3061–3070 (2015) Proceedings of the IEEE Conference on Computer Vision and 45. Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene struc- Pattern Recognition, pp. 2002–2011 (2018) ture from a single still image. IEEE Trans. Pattern Anal. Mach. 28. Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image Intell. 31(5), 824–840 (2008) depth estimation: Toward higher resolution maps with accurate 46. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a object boundaries. In: 2019 IEEE Winter Conference on Appli- single image using a multi-scale deep network. In: Proceedings of cations of Computer Vision (WACV), pp. 1043–1051 (2019) the 28th Conference on Neural Information Processing Systems 29. Chen, X., Chen, X., Zha, Z.J.: Structure-aware residual pyramid (NIPS), p. 27 (2014) network for monocular depth estimation. arXiv preprint arXiv: 47. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. 1907. 06023, (2019) arXiv preprint arXiv: 1412. 6980, (2014) 30. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 48. Chen, S., Pu, Z., Fan, X., Zou, B.: Fixing defect of photometric Proceedings of the IEEE Conference on Computer Vision and loss for self-supervised monocular depth estimation. IEEE Trans. Pattern Recognition, pp. 7132–7141 (2018) Circuits Syst. Video Technol. 32(3), 1328–1338 (2021) 31. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional 49. Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular block attention module. In: Proceedings of the European Confer- depth by distilling cross-domain stereo networks. In: Proceedings ence on Computer Vision (ECCV), pp. 3–19 (2018) of the European Conference on Computer Vision (ECCV), pp. 32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., 484–500 (2018) Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. 50. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, In: Proceedings of the 31st Annual Conference on Neural Infor- S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.: Imagenet mation Processing Systems (NIPS), pp. 30 (2017) large scale visual recognition challenge. Int. J. Comput. Vis. 115, 33. Tan, M., Pang, R., Le, Q.V.: Efficientdet: Scalable and efficient 211–252 (2015) object detection. In: Proceedings of the IEEE/CVF Conference on 51. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Computer Vision and Pattern Recognition, pp. 10781–10790 (2020) Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differ - 34. Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi- entiation in pytorch. In: Proceedings of the Conference on Neural scale deep convolutional neural network for fast object detection. Information Processing Systems, (2017) In: Computer Vision–ECCV 2016: 14th European Conference, 52. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, monocular images using deep convolutional neural fields. IEEE Part IV 14, pp. 354–370. Springer (2016) Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2015) 35. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., 53. Kundu, J.N., Uppala, P.K., Pahuja, A., Babu, R.V.: Adadepth: Berg, A.C.: Ssd: Single shot multibox detector. In: Computer Unsupervised content congruent adaptation for depth estimation. Vision–ECCV 2016: 14th European Conference, Amsterdam, In: Proceedings of the IEEE Conference on Computer Vision and The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Pattern Recognition, pp. 2656–2665 (2018) pp. 21–37. Springer (2016) 54. Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth predic- 36. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., tion without the sensors: Leveraging structure for unsupervised LeCun, Y.: Overfeat: Integrated recognition, localization and learning from monocular videos. In: Proceedings of the AAAI detection using convolutional networks. arXiv preprint arXiv: Conference on Artificial Intelligence, pp. 8001–8008 (2019) 1312. 6229, (2013) 55. Zhou, J., Wang, Y., Qin, K., Zeng, W.: Unsupervised high-res- 37. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, olution depth learning from videos with dual networks. In: Pro- S.: Feature pyramid networks for object detection. In: Proceedings ceedings of the IEEE/CVF International Conference on Computer of the IEEE Conference on Computer Vision and Pattern Recogni- Vision, pp. 6872–6881 (2019) tion, pp. 2117–2125 (2017) 56. Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, opti- 38. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for cal flow and camera pose. In: Proceedings of the IEEE Conference instance segmentation. In: Proceedings of the IEEE Conference on on Computer Vision and Pattern Recognition, pp. 1983–1992 (2018) Computer Vision and Pattern Recognition, pp. 8759–8768 (2018) 57. Schellevis, M.: Improving self-supervised single view depth esti- 39. Amirul Islam, M., Rochan, M., Bruce, N.D., Wang, Y.: Gated mation by masking occlusion. arXiv preprint arXiv: 1908. 11112, feedback refinement network for dense image labeling. In: Pro- (2019) ceedings of the IEEE Conference on Computer Vision and Pattern 58. Zhou, L., Ye, J., Abello, M., Wang, S., Kaess, M.: Unsupervised Recognition, pp. 3751–3759 (2017) learning of monocular depth estimation with bundle adjustment, 40. Ghiasi, G., Lin, T.Y., Le, Q.V.: Nas-fpn: Learning scalable feature super-resolution and clip loss. arXiv preprint arXiv: 1812. 03368, pyramid architecture for object detection. In: Proceedings of the (2018) IEEE/CVF Conference on Computer Vision and Pattern Recogni- 59. PNVR, K., Zhou, H., Jacobs, D.: Sharingan: Combining syn- tion, pp. 7036–7045 (2019) thetic and real data for unsupervised geometry estimation. In: 1 3 Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self‑supervised… Proceedings of the IEEE/CVF Conference on Computer Vision Xingyu Chi received his Bachelor’s degree and Pattern Recognition, pp. 13974–13983 (2020) from North Minzu University, China, in 60. Atapour-Abarghouei, A., Breckon, T.P.: Real-time monocular 2020. He is currently pursuing a master's depth estimation using synthetic data with domain adaptation via degree in Mechanical Engineering with the image style transfer. In: Proceedings of the IEEE Conference on College of Mechatronics and Control Engi- Computer Vision and Pattern Recognition, pp. 2800–2810 (2018) neering, Shenzhen University, Shenzhen, 61. Zhao, S., Fu, H., Gong, M., Tao, D.: Geometry-aware symmetric China. His research interests focus on the domain adaptation for monocular depth estimation. In: Proceed- application of computer vision and deep ings of the IEEE/CVF Conference on Computer Vision and Pat- learning technologies in autonomous tern Recognition, pp. 9788–9798 (2019) vehicles. Xingda Qu is a Professor at the Institute of Guofa Li is a Professor at the College of Human Factors and Ergonomics, Shenzhen Mechanical and Vehicle Engineering, University, Shenzhen, China. He received his Chongqing University, Chongqing, China. Ph.D in Human Factors and Ergonomics He received his Ph.D. in Mechanical Engi- from Virginia Tech, Blacksburg, VA, USA, neering from Tsinghua University, Beijing, in 2008. His research interests include trans- China, in 2016. His research focuses on envi- portation safety, occupational safety and ronment perception, driver behavior analysis, health, and human computer interaction. and humanlike decision-making based on artificial intelligence technologies in autono- mous vehicles and intelligent transportation systems. He has published more than 70 papers in his research areas. He is the recipi- ent of the Young Elite Scientists Sponsorship Program in China, and he receives the Best Paper Award from the China Association for Sci- ence and Technology and Automotive Innovation. 1 3
Automotive Innovation – Springer Journals
Published: May 1, 2023
Keywords: Autonomous vehicle; Camera sensor; Deep learning; Depth estimation; Self-supervised
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.