Generative AI: Investigating Consistency and Neutrality in Multilingual Outputs

None, None

Generative AI: Investigating Consistency and Neutrality in Multilingual Outputs

Master Thesis (2025)

Author(s)

A. Ibrahim (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Luciano Siebert – Mentor (TU Delft - Interactive Intelligence)

S.K. Kuilman – Mentor (TU Delft - Interactive Intelligence)

M.S. Pera – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Bias Generative AI Consistency Neutrality

To reference this document use:

https://resolver.tudelft.nl/uuid:139ef2ec-6022-4f5e-9c88-866c80518db9

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

27-05-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis investigates whether large language models (LLMs) produce consistent and neutral outputs when the same prompts are given in English and Arabic. It begins by reviewing technological, philosophical, psychological, and linguistic factors that can influence the behavior of the multilingual model. Consistency is defined as stability in content and tone, while neutrality refers to the absence of biased or emotionally loaded framing.

Ten prompts (seven sensitive and three non-sensitive) were refined through an iterative English ablation process and then translated into Arabic. Six leading LLMs were queried in both languages, and their outputs were analyzed using automated sentiment analysis to measure differences in emotional tone. In parallel, a survey of bilingual English and Arabic speakers evaluated model responses on sentiment consistency, factual consistency, and perceived neutrality in each language, along with the neutral framing of the prompts.

Results indicate that non-sensitive prompts are rated as less neutral but exhibit fewer inconsistencies in sentiment and factuality across English and Arabic outputs. In contrast, sensitive prompts are perceived as more neutral overall but exhibit larger differences in both sentiment and factual alignment. Among the models tested, some demonstrate higher consistency across languages than others. Automated analysis shows English outputs often carry more positive or mixed tones, while Arabic outputs lean toward neutrality. Human evaluations mirror these patterns for non-sensitive topics but differ for the more politically charged prompts, highlighting that automated tools do not align well with human perception in sensitive contexts.

These findings underscore the importance of combining automated metrics with human judgment to assess multilingual reliability and neutrality. The study suggests that improving balance in training data, improving transparency about language-specific behaviors, and guiding users to anticipate multilingual variations are key to developing fairer and more reliable GenAI systems.

Files

Thesis_-_Final.pdf

(pdf | 2.42 Mb)

License info not available