ViT-Based Future Road Image Prediction: Evaluation via VLM 


Vol. 50,  No. 10, pp. 1532-1535, Oct.  2025
10.7840/kics.2025.50.10.1532


PDF Full-Text
  Abstract

This paper proposes a Vision Transformer (ViT)-based model for predicting future driving scenes. The proposed ViT architecture processes input images as patches and leverages the attention mechanism to efficiently learn global visual information, while also integrating control inputs to effectively capture correlations between visual context and driving actions. Experimental results show that the ViT-based model generates sharper images than the baseline and achieves higher semantic similarity in explanation evaluations using a Vision-Language Model (VLM). These results suggest that the ViT architecture is effective not only for future prediction but also for explainable autonomous driving control.

  Statistics
Cumulative Counts from November, 2022
Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


  Related Articles
  Cite this article

[IEEE Style]

D. Kim, J. Kwon, H. Nam, "ViT-Based Future Road Image Prediction: Evaluation via VLM," The Journal of Korean Institute of Communications and Information Sciences, vol. 50, no. 10, pp. 1532-1535, 2025. DOI: 10.7840/kics.2025.50.10.1532.

[ACM Style]

Donghyun Kim, Jaerock Kwon, and Haewoon Nam. 2025. ViT-Based Future Road Image Prediction: Evaluation via VLM. The Journal of Korean Institute of Communications and Information Sciences, 50, 10, (2025), 1532-1535. DOI: 10.7840/kics.2025.50.10.1532.

[KICS Style]

Donghyun Kim, Jaerock Kwon, Haewoon Nam, "ViT-Based Future Road Image Prediction: Evaluation via VLM," The Journal of Korean Institute of Communications and Information Sciences, vol. 50, no. 10, pp. 1532-1535, 10. 2025. (https://doi.org/10.7840/kics.2025.50.10.1532)
Vol. 50, No. 10 Index