EFFECTIVENESS OF VISION–LANGUAGE MODELS (VLMs) FOR GROUND-OBJECT RECOGNITION IN A MULTI-LEVEL EDGE–CLOUD UAV ARCHITECTURE
https://doi.org/10.33815/2313-4763.2025.2.31.019-029
Abstract
This paper presents a comparative analysis of the performance of vision–language models (VLMs) in detecting explosive hazards in images acquired from unmanned aerial vehicles (UAVs). The study evaluates two state-of-the-art models: ChatGPT (GPT-4.1) and Google Gemini-2.5-flash. A dataset of 2,500 frames containing anti-personnel mines (specifically PFM-1, PMN-3, and RMA-2) was collected from videos recorded in Ukraine, the USA, and Italy. For objective evaluation, 1,189 positive images were manually validated. At the frame level, Gemini achieved a correct detection rate of 67.62%, while GPT-4.1 reached 63.75%. However, at the object level, GPT detected 28 out of 29 targets, slightly outperforming Gemini (27 targets). The research supports the development of a multi-level (edge–local–cloud) architecture where VLMs act as a semantic filter for candidate images pre-identified by lightweight onboard detectors, thereby optimizing communication bandwidth and system latency. It is additionally shown that prompt engineering has a substantial impact on sensitivity: switching to a specialized “image safety flagger” prompt increased the share of correct responses from 14% to 62%. Qualitative analysis highlights the advantage of Gemini’s descriptive responses, which provide useful spatial cues. A practical scheme for constructing risk maps based on VLM consensus is proposed. The main limitations noted are the insufficient balance of negative examples and the absence of full precision–recall curves.
References
2. Verbickas, J. (2024). Foundational Vision Models for Mine Detection in UAV Images. URL: https://ecmlpkdd-storage.s3.eu-central-1.amazonaws.com/2024/industry_ track_papers/1575_ FoundationalVisionModelsForMineDetectionInUAVImages.pdf.
3. Chen, Y., Que, X., Zhang, J., Chen, T., Li, G., Jiachi. (2025). When Large Language Models Meet UAVs: How Far Are We. ArXiv. URL: https://arxiv.org/html/2509.12795v1.
4. Mentus, I., Yasko, V., Saprykin, I. (2024). Methods of mine detection for humanitarian demining: survey. Ukrainian Journal of Remote Sensing. https://doi.org/10.36023/ujrs.2024.11.3.271.
5. Weng, Z., Yu, Z. (2025). Cross-Modal Enhancement and Benchmark for UAV-based Open-Vocabulary Object Detection. ArXiv. URL: https://arxiv.org/html/2509.06011v1.
6. Liu, Q., Shi, L., Sun, L., Li, J., Ding, M., & Shu, F. (2020). Path planning for UAV-mounted mobile edge computing with deep reinforcement learning. IEEE Transactions on Vehicular Technology, 69(5).
7. Liu, S., Zhang, H., Qi, Y., Wang, P., Zhang, Y., & Wu, Q. (2023). AerialVLN: Vision-and-language Navigation for UAVs. International Conference on Computer Vision (ICCV).
8. Liang, Q., et al. (2025). Next-Generation LLM for UAV (NeLV) system–a comprehensive demonstration and automation roadmap for integrating LLMs into multi-scale UAV operations. ArXiv.
9. Penava, P., Buettner, R. (2024). Advancements in Landmine Detection: Deep Learning-Based Analysis with Thermal Drones. Research Gate Publication 391974681.
10. Stankevich, S., Saprykin, I. (2024). Optical and Magnetometric Data Integration for Landmine Detection with UAV. WSEAS Transactions on Environment and Development. https://doi.org/10.37394/232015.2024.20.96.
11. Kim, B., Kang, J., Kim, D. H., Yun, J., Choi, S. H., & Paek, I. (2018). Dual-sensor Landmine Detection System utilizing GPR and Metal Detector. Proceedings of the 2018 International Symposium on Antenass and Propagation (ISAP).
12. Novikov, O., Ilin, M., Stopochkina, I., Ovcharuk, M., Voitsekhovskyi, A. (2025). Application of LLM in UAV route planning tasks to prevent data exchange availability violations. Electronic Professional Scientific Journal «Cybersecurity: Education, Science, Technique», 1(29), 419–431. https://doi.org/10.28925/2663-4023.2025.29.892.
13. Kumar, C., Giridhar, O. (2024). UAV Detection Multi-sensor Data Fusion. Journal of Research in Science and Engineering. https://doi.org/10.53469/jrse.2024.06(07).02.
14. Zhang, J., Huang, J., Jin, S., Lu, S. (2024). Vision-Language Models for Vision Tasks: A Survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5625–5644. https://doi.org/10.1109/TPAMI.2024.3369699.
15. Cai, H., Dong, J., Tan, J., Deng, J., Li, S., Gao, Z., Wang, H., Su, Z., Sumalee, A., Zhong, R. (2025). FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models. ArXiv. URL: https://arxiv.org/html/2505.12835v1.
16. Zhan, Y., Xiong, Z., Yuan, Y. (2024). SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model. ArXiv. URL: https://arxiv.org/html/2401.09712v1.
