Visual Question Answering in Mobile AI Assistants: A Benchmark of Proprietary cloud-based multimodal LLMs

Evaluating Monetary Cost, Accuracy, Token Usage, Payload, and Latency

Master Thesis (2026)
Author(s)

H. Vanhuynegem (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

G. Lan – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Coordinates
51.9989, 4.3735
Graduation Date
15-06-2026
Awarding Institution
Delft University of Technology
Programme
Electrical Engineering, Embedded Systems
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
14
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Mobile AR assistants must offload visual queries to cloud multimodal large language models (MLLMs): on-device inference exceeds the power, memory, and thermal budgets of wearable hardware. This thesis measures how image preprocessing affects pipeline latency, payload size, token usage, cost, and visual question answering (VQA) accuracy when applied to the captured frame before transmission.

A controlled paired experiment—pairing each preprocessed sample with its unprocessed counterpart to eliminate confounding from per-image difficulty variance—compared 12 techniques across four provider-model configurations and three VQA datasets, logging all five dimensions per sample. The techniques span JPEG compression, downsampling, grayscale conversion, gaze-based region-of-interest (ROI) cropping, and saliency- and YOLO-based cropping. Relative to unprocessed images, JPEG at quality 85 reduces latency by 25% and payload by 50% with no detectable accuracy loss; gaze-based ROI cropping reduces latency by 38% and payload by over 85% at a 3-percentage-point accuracy cost, provided eye-tracking data are available. On Realtime-class streaming models, both techniques are recommended as deployment defaults.

This thesis introduces a principled taxonomy distinguishing compression-only preprocessing—which reduces payload without altering image geometry and therefore cannot discard task-relevant content—from geometry-changing preprocessing, which crops or resizes the image and can remove information the model would otherwise receive; this distinction predicts which technique classes incur accuracy costs and which do not. The open benchmark VQABench supports replication across additional providers, models, and strategies; the results are limited to still-frame VQA and should be validated separately for video or streaming queries. The findings extend beyond AR: any multimodal pipeline that transmits images to a cloud model can apply preprocessing-first optimisations before investing in prompt compression or model-level architectural changes.

Files

Report.pdf
(pdf | 11.9 Mb)
License info not available