Generative AI: Investigating Consistency and Neutrality in Multilingual Outputs

Master Thesis (2025)
Author(s)

A. Ibrahim (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Luciano Siebert – Mentor (TU Delft - Interactive Intelligence)

S.K. Kuilman – Mentor (TU Delft - Interactive Intelligence)

M.S. Pera – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
27-05-2025
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis investigates whether large language models (LLMs) produce consistent and neutral outputs when the same prompts are given in English and Arabic. It begins by reviewing technological, philosophical, psychological, and linguistic factors that can influence the behavior of the multilingual model. Consistency is defined as stability in content and tone, while neutrality refers to the absence of biased or emotionally loaded framing.

Ten prompts (seven sensitive and three non-sensitive) were refined through an iterative English ablation process and then translated into Arabic. Six leading LLMs were queried in both languages, and their outputs were analyzed using automated sentiment analysis to measure differences in emotional tone. In parallel, a survey of bilingual English and Arabic speakers evaluated model responses on sentiment consistency, factual consistency, and perceived neutrality in each language, along with the neutral framing of the prompts.

Results indicate that non-sensitive prompts are rated as less neutral but exhibit fewer inconsistencies in sentiment and factuality across English and Arabic outputs. In contrast, sensitive prompts are perceived as more neutral overall but exhibit larger differences in both sentiment and factual alignment. Among the models tested, some demonstrate higher consistency across languages than others. Automated analysis shows English outputs often carry more positive or mixed tones, while Arabic outputs lean toward neutrality. Human evaluations mirror these patterns for non-sensitive topics but differ for the more politically charged prompts, highlighting that automated tools do not align well with human perception in sensitive contexts.

These findings underscore the importance of combining automated metrics with human judgment to assess multilingual reliability and neutrality. The study suggests that improving balance in training data, improving transparency about language-specific behaviors, and guiding users to anticipate multilingual variations are key to developing fairer and more reliable GenAI systems.

Files

Thesis_-_Final.pdf
(pdf | 2.42 Mb)
License info not available