Visual Question Answering in Mobile AI Assistants: A Benchmark of Proprietary cloud-based multimodal LLMs

None, None

Visual Question Answering in Mobile AI Assistants: A Benchmark of Proprietary cloud-based multimodal LLMs

Evaluating Monetary Cost, Accuracy, Token Usage, Payload, and Latency

Master Thesis (2026)

Author(s)

H. Vanhuynegem (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

G. Lan – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Multimodal Large Language Models (MLLMs) Image Preprocessing Visual Question Answering (VQA)

To reference this document use

https://resolver.tudelft.nl/uuid:e08e21ea-ca40-4443-85e9-850f58e79a05

More Info

expand_more

Publication Year

2026

Language

English

Coordinates

51.9989, 4.3735

Graduation Date

15-06-2026

Awarding Institution

Delft University of Technology

Programme

Electrical Engineering, Embedded Systems

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

14

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Mobile AR assistants must offload visual queries to cloud multimodal large language models (MLLMs): on-device inference exceeds the power, memory, and thermal budgets of wearable hardware. This thesis measures how image preprocessing affects pipeline latency, payload size, token usage, cost, and visual question answering (VQA) accuracy when applied to the captured frame before transmission.

A controlled paired experiment—pairing each preprocessed sample with its unprocessed counterpart to eliminate confounding from per-image difficulty variance—compared 12 techniques across four provider-model configurations and three VQA datasets, logging all five dimensions per sample. The techniques span JPEG compression, downsampling, grayscale conversion, gaze-based region-of-interest (ROI) cropping, and saliency- and YOLO-based cropping. Relative to unprocessed images, JPEG at quality 85 reduces latency by 25% and payload by 50% with no detectable accuracy loss; gaze-based ROI cropping reduces latency by 38% and payload by over 85% at a 3-percentage-point accuracy cost, provided eye-tracking data are available. On Realtime-class streaming models, both techniques are recommended as deployment defaults.

This thesis introduces a principled taxonomy distinguishing compression-only preprocessing—which reduces payload without altering image geometry and therefore cannot discard task-relevant content—from geometry-changing preprocessing, which crops or resizes the image and can remove information the model would otherwise receive; this distinction predicts which technique classes incur accuracy costs and which do not. The open benchmark VQABench supports replication across additional providers, models, and strategies; the results are limited to still-frame VQA and should be validated separately for video or streaming queries. The findings extend beyond AR: any multimodal pipeline that transmits images to a cloud model can apply preprocessing-first optimisations before investing in prompt compression or model-level architectural changes.

Files

Report.pdf

(pdf | 11.9 Mb)

License info not available