Coastal zones are dynamic and vulnerable regions, demanding accurate, scalable monitoring tools to inform environmental management and hazard mitigation. While
satellite imagery and CNN-based classifiers have improved
automated mapping, their reliance on unstructured pixel
data l
...
Coastal zones are dynamic and vulnerable regions, demanding accurate, scalable monitoring tools to inform environmental management and hazard mitigation. While
satellite imagery and CNN-based classifiers have improved
automated mapping, their reliance on unstructured pixel
data limits contextual understanding. This study presents
the first fine-tuning of a multi-modal large language model
(MLLM), Qwen2.5, on 12-channel satellite input for multilabel coastal classification, demonstrating how architectural adaptation enables integration of spectral, topographic, and derived features beyond RGB. We compare
this approach to a ResNet-50 baseline and state-of-the-art
prompting methods using GPT-4o and LLaMA-3.2. Our experiments on the CoastBench dataset reveal that MLLMs
benefit substantially from few-shot prompting with diverse,
balanced sampling and that fine-tuning Qwen2.5 with full
12-channel input outperforms its RGB-only variant. An
ablation study quantifies the importance of elevation and
water-sensitive indices, while a human benchmark exposes
a performance ceiling near F1 ≈ 0.70 due to label ambiguity. Our findings suggest that while MLLMs can rival traditional models and offer interpretability benefits, future gains depend on dataset quality, input diversity, and
prompting strategy design.