Semantic Scene Understanding for Autonomous Vehicles: A Comprehensive Review of Vision Transformers

S. Gnanamurthy, R. Dhivya

Abstract


Semantic scene understanding—the ability to parse visual scenes into meaningful object categories, spatial relationships, and contextual information—represents a foundational capability for autonomous vehicle perception, with vision transformers emerging as a paradigm shift from convolutional neural networks in achieving human-level scene comprehension. This comprehensive review examines vision transformer architectures and their applications to autonomous vehicle perception tasks including semantic segmentation, panoptic segmentation, object detection, instance segmentation, and scene graph generation. We systematically analyze the fundamental mechanisms distinguishing transformers from CNNs: self-attention operations capturing long-range dependencies across entire images, positional encodings preserving spatial information, and multi-head attention enabling diverse feature relationships. The review traces the evolution from the original Vision Transformer (ViT) requiring massive pre-training datasets, through data-efficient variants (DeiT, Swin Transformer) with hierarchical architectures and shifted windowing schemes, to specialized designs for dense prediction tasks (SegFormer, Mask2Former, OneFormer). Particular emphasis is placed on transformer adaptations addressing autonomous driving requirements: real-time inference through efficient attention mechanisms and model compression, multi-scale feature representation for detecting objects ranging from distant vehicles to nearby pedestrians, and temporal modeling for video understanding incorporating motion cues. We examine pre-training strategies that enable transformers to learn robust visual representations including supervised pre-training on ImageNet, self-supervised methods (MAE, DINO, BEiT) learning from unlabeled data, and domain-specific pre-training on driving datasets. The review comprehensively analyzes benchmark performance on autonomous driving datasets (Cityscapes, BDD100K, nuScenes, Waymo) across semantic segmentation, object detection, and multi-task learning scenarios, comparing transformers against state-of-the-art CNN baselines. Application-specific considerations are examined for different perception challenges: urban driving requiring fine-grained segmentation of road infrastructure, highway scenarios emphasizing long-range detection, and adverse weather conditions where transformers’ global context modeling may improve robustness. We critically evaluate computational requirements, analyzing inference latency, memory consumption, and energy efficiency on automotive-grade hardware (NVIDIA Drive, Qualcomm Snapdragon Ride), identifying that while transformers achieve superior accuracy, their quadratic complexity with image resolution poses deployment challenges. The paper examines hybrid architectures combining convolutional feature extraction with transformer reasoning, offering improved efficiency-accuracy trade-offs. Advanced capabilities are reviewed including cross-modal transformers fusing camera, LiDAR, and radar data through unified attention mechanisms, temporal transformers aggregating information across video frames for robust tracking, and query-based detection transformers (DETR, Deformable DETR) eliminating hand-crafted components like anchor boxes and non-maximum suppression. Emerging research directions are analyzed including efficient transformer designs with linear attention complexity, neural architecture search for task-specific optimization, and foundation models pre-trained on massive multi-modal datasets enabling zero-shot transfer to novel driving scenarios. The review identifies critical gaps including limited interpretability of attention mechanisms for safety validation, vulnerability to adversarial perturbations, and insufficient evaluation on rare but safety-critical edge cases.

Keywords


vision transformers, semantic segmentation, autonomous vehicles, scene understanding.

References


Chen, Wuyang, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, et al. “A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation”. arXiv [Cs.CV], 2022. arXiv. http://arxiv.org/abs/2112.09747.

Chu, Xiangxiang, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. “Twins: Revisiting the Design of Spatial Attention in Vision Transformers”. arXiv [Cs.CV], 2021. arXiv. http://arxiv.org/abs/2104.13840.

Dong, Xiaoyi, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. “CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows”. arXiv [Cs.CV], 2022. arXiv. http://arxiv.org/abs/2107.00652.

Dosovitskiy, Alexey. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”. arXiv Preprint arXiv:2010. 11929, 2020. Available at https://arxiv.org/pdf/2010.11929/100.

Gu, Jiaqi, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, and David Z. Pan. “Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation”. arXiv [Cs.CV], 2021. arXiv. http://arxiv.org/abs/2111.01236.

Lee, Youngwan, Jonghee Kim, Jeff Willette, and Sung Ju Hwang. “MPViT: Multi-Path Vision Transformer for Dense Prediction”. arXiv [Cs.CV], 2021. arXiv. http://arxiv.org/abs/2112.11010.

Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–22, 2021. Available at https://openaccess.thecvf.com/content/ICCV2021/html/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.html.

Naseer, Muzammal, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. “Intriguing Properties of Vision Transformers”. arXiv [Cs.CV], 2021. arXiv. http://arxiv.org/abs/2105.10497.

Wu, Dong, Man-Wen Liao, Wei-Tian Zhang, Xing-Gang Wang, Xiang Bai, Wen-Qing Cheng, and Wen-Yu Liu. “YOLOP: You Only Look Once for Panoptic Driving Perception”. Machine Intelligence Research 19, no. 6 (1 December 2022): 550–62. https://doi.org/10.1007/s11633-022-1339-y.

Xie, Enze, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers”. In Advances in Neural Information Processing Systems, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, 34:12077–90. Curran Associates, Inc., 2021. https://proceedings.neurips.cc/paper_files/paper/2021/file/64f1f27bf1b4ec22924fd0acb550c235-Paper.pdf.

Yang, Michael. “Visual Transformer for Object Detection”. arXiv [Cs.CV], 2022. arXiv. http://arxiv.org/abs/2206.06323.

Zhu, Xizhou, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. “Deformable DETR}: Deformable Transformers for End-to-End Object Detection”. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=gZ9hCDWe6ke.


Refbacks

  • There are currently no refbacks.




Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.