This thesis investigates whether large language models (LLMs) produce consistent and neutral outputs when the same prompts are given in English and Arabic. It begins by reviewing technological, philosophical, psychological, and linguistic factors that can influence the behavior o
...
This thesis investigates whether large language models (LLMs) produce consistent and neutral outputs when the same prompts are given in English and Arabic. It begins by reviewing technological, philosophical, psychological, and linguistic factors that can influence the behavior of the multilingual model. Consistency is defined as stability in content and tone, while neutrality refers to the absence of biased or emotionally loaded framing.
Ten prompts (seven sensitive and three non-sensitive) were refined through an iterative English ablation process and then translated into Arabic. Six leading LLMs were queried in both languages, and their outputs were analyzed using automated sentiment analysis to measure differences in emotional tone. In parallel, a survey of bilingual English and Arabic speakers evaluated model responses on sentiment consistency, factual consistency, and perceived neutrality in each language, along with the neutral framing of the prompts.
Results indicate that non-sensitive prompts are rated as less neutral but exhibit fewer inconsistencies in sentiment and factuality across English and Arabic outputs. In contrast, sensitive prompts are perceived as more neutral overall but exhibit larger differences in both sentiment and factual alignment. Among the models tested, some demonstrate higher consistency across languages than others. Automated analysis shows English outputs often carry more positive or mixed tones, while Arabic outputs lean toward neutrality. Human evaluations mirror these patterns for non-sensitive topics but differ for the more politically charged prompts, highlighting that automated tools do not align well with human perception in sensitive contexts.
These findings underscore the importance of combining automated metrics with human judgment to assess multilingual reliability and neutrality. The study suggests that improving balance in training data, improving transparency about language-specific behaviors, and guiding users to anticipate multilingual variations are key to developing fairer and more reliable GenAI systems.