P.K. Murukannaiah | TU Delft Repository

From human teams to hybrid intelligence teams

Identifying, characterizing, and evaluating foundational quality attributes

Journal article (2026) - Davide Dell’Anna, Pradeep K. Murukannaiah, Mireia Yurrita, Bernd Dudzik, Davide Grossi, Catholijn M. Jonker, Catharine Oertel, Pınar Yolum

Hybrid Intelligence (HI) is an emerging paradigm in which artificial intelligence (AI) augments human intelligence. The current literature lacks systematic models that guide the design and evaluation of HI systems. Further, discussions around HI primarily focus on technology, neglecting the holistic human-AI ensemble. In this paper, we take the initial steps toward the development of a quality model for characterizing and evaluating HI systems from a human-AI teams perspective. We first conducted a study investigating the adequacy of properties commonly associated with effective human teams to describe HI. The study features the insights of 50 HI researchers, and shows that various human team properties, including boundedness, interdependence, competency, purposefulness, initiative, normativity, and effectiveness, are important for HI systems. Based on these results, we developed a quality model for HI teams composed of seven high-level quality attributes, further refined into 16 specific ones. To evaluate the relevance and understanding of the proposed attributes, we conducted a second empirical investigation by staging competitions in which participants used the quality model to develop and analyze HI usage scenarios. Our analysis of 48 collected scenarios, which we openly release, confirms the proposed attributes’ relevance and highlights insights that emerge when designers consider the quality model in HI system design. ...

Developing Guidelines for Human-LLM Agent Teams

A Multi-Stakeholder Lens

Conference paper (2026) - Mireia Yurrita, Davide Dell’Anna, Pradeep K. Murukannaiah, Catholijn M. Jonker, Pınar Yolum

Agents based on Large Language Models (LLM agents) have the potential to work with humans as part of a team to achieve specific goals. The natural language interface of LLM agents and their high level of autonomy enables more seamless collaborations than previous technologies, allowing them to carry out tasks autonomously and engage in conversations with humans, e.g., to clarify goals, request authorizations, or double-check decisions. However, the current literature lacks systematic design guidelines for these human-LLM agent teams. This gap might foster misunderstandings, misuse of autonomy, and lack of common ground, potentially leading to collaboration pitfalls. To mitigate these risks, we develop 24 guidelines for the principled design of human-LLM agent teams. We adopt a multi-stakeholder approach and propose guidelines for LLM agents, human team members, team designers and embedding organizations. To develop these guidelines, we distill design recommendations from an exploratory workshop with 15 experts on human-AI teaming and a literature review of 93 empirical papers in human-LLM collaboration. Drawing from literature on human teams, we conceptually categorize the recommendations across different stages of the teaming process. A user study with 10 additional experts suggests the guidelines can help prevent collaboration pitfalls in human-LLM agent teams within workplace settings. ...

Guiding Sociotechnical Systems toward Value-Norm Equilibrium Blue Sky Ideas Track

Conference paper (2026) - Nirav Ajmeri, Marina De Vos, Davide Dell’Anna, Pradeep K. Murukannaiah, Vivek Nallur, Luis G. Nardin, Munindar P. Singh

Values and norms are complementary constructs that undergird prosocial behavior in sociotechnical systems (STSs). Whereas values are intrinsic motivators for prosocial behavior, norms are extrinsic motivators for meeting mutual expectations. An STS is in equilibrium when the values of its member actors and the norms that govern it align with each other. Such an equilibrium is not permanent as actors join or leave the STS, and their values and norms evolve. In general, an STS must be guided toward equilibrium by systematically refining the norm specifications and influencing the values of its member actors. We formulate the challenges involved in building systematic methods to detect misalignment and guide the STS toward, and maintain, value-norm equilibrium. ...

MORL4Water

A Modular Multi-Objective Reinforcement Learning Toolkit for Water Resource Management

Conference paper (2026) - Zuzanna Osika, Roxana Rădulescu, Jazmin Zatarain-Salazar, Frans A. Oliehoek, Pradeep K. Murukannaiah

Many real-world decision problems involve conflicting objectives. Multi-objective reinforcement learning (MORL) extends standard RL to optimize multiple objectives simultaneously, producing policy sets that capture different trade-offs. However, MORL research often relies on simplified benchmarks with limited real-world relevance. We present MORL4Water, a modular toolkit for creating realistic MORL environments in water resource management. Built on MO-Gymnasium, MORL4Water enables scenario construction from real data and systematic evaluation of MORL methods. We illustrate its use on the Nile and Susquehanna rivers, benchmarking several MORL algorithms against EMODPS, a domain-specific baseline. Beyond standard performance metrics, we analyze solution sets to reveal differences in exploration, scalability, and trade-off diversity. Our results show that most state-of-the-art MORL algorithms underperform relative to EMODPS, especially in higher-dimensional settings, and highlight the value of solution-set analysis for robust, real-world applications. ...

Diverse Committees with Incomplete or Inaccurate Approval Ballots

Conference paper (2026) - Feline Lindeboom, Martijn Brehm, Davide Grossi, Pradeep K. Murukannaiah

We study diversity in approval-based committee elections with incomplete or inaccurate information. We define diversity according to the Maximum Coverage problem, which is known to be NP-complete, with a best attainable polynomial time approximation ratio of 1 − 1/e. In the incomplete information setting, voters vote only on a small portion of the candidates, and we prove that getting arbitrarily close to the optimal approximation ratio w.h.p. requires Ω(m2) non-adaptive queries, where m is the number of candidates. This motivates studying adaptive querying algorithms, that can adapt their querying strategy to information obtained from previous query outcomes. In that setting, we lower this bound to only Ω(m) queries. We propose a greedy algorithm to match this lower bound up to log-factors. We prove the same Θ (m) bound for the generalized problem of Maximum Coverage over a matroid constraint, using a local search algorithm. Specifying a matroid of valid committees lets us implement extra structural requirements on the committee, like quota. In the inaccurate information setting, voters’ responses are corrupted with a small probability. We prove Θ (nm) queries are required to attain a (1−1/e)-approximation with high probability, where n is the number of voters. While the proven bounds show that all our algorithms are viable asymptotically, they also show that some of them would still require large numbers of queries in instances of practical relevance. Using real data from Polis as well as synthetic data, we observe that our algorithms perform well also on smaller instances, both with incomplete and inaccurate information. ...

Dementia Empowerment with Heart Health Intervention and LLM-based Health AI Research Assistant

Book chapter (2026) - Luuk P.A. Simons, Pradeep K. Murukannaiah, Mark A. Neerincx

Dementia is one of the most pressing health problems in the world. Still, the good news is that it is much better preventable than (advanced-stage) treatable. Over the years, a new narrative has come up: heart health = brain health. But its translation into healthcare interventions has been slow. In this design approach, we propose two empowerment options for patients, caregivers, and their health professionals. Firstly, we describe how cardiac health successes in enticing senior citizens to large lifestyle improvements may be used for treating early-stage dementia and cognitive decline. Biologically, this uses causality between blood pressure and cardiovascular health on the one hand and dementia outcomes on the other. Practically, it enables daily success feedback, which empowers patients in their health improvement experiments. Secondly, we describe and user-test an AI Health Research Assistant to extract the best available lifestyle findings from literature, to keep up with over 100,000 new health publications flooding us every year. Our user test highlights challenges and opportunities for a Health AI, especially regarding claim transparency, data quality, and risks of hallucinations. We suggest research metadata criteria to evaluate ambiguous or conflicting health science claims. ...

Exploring Human-AI Synergy for Complex Claim Verification

Journal article (2025) - Shubhalaxmi Mukherjee, Catholijn M. Jonker, Pradeep K. Murukannaiah

Combating widespread misinformation requires scalable and reliable fact-checking methods. Fact-checking involves several steps, including question generation, evidence retrieval, and veracity prediction. Importantly, fact-checking is well-suited to exploit hybrid intelligence since it requires both human expertise and AI’s large-scale information processing abilities. Thus, constructing an effective fact-checking pipeline requires a systematic understanding of the relative strengths and weaknesses of humans and AI in different steps of the fact-checking process. We investigate the ability of LLMs to perform the first step of the process, i.e., to generate pertinent questions for analyzing a claim. To evaluate the quality of the LLM-generated questions, we crowdsource a dataset in which 150 claims are annotated with questions (1) a novice fact-checker would ask and (2) a professional fact-checker would ask when fact-checking those claims. We study the effects of the human- and LLM-generated questions on evidence retrieval and veracity prediction. We find that LLMs are able to generate nuanced questions to verify a complex claim, but the final label prediction depends on the quality of the evidence corpus. However, the evidence collected by automated methods yields lower accuracy in the veracity prediction task than the evidence curated by experts. ...

Advanced RAG-LLM prototype AI on PubMed for Cardiac Health

Conference paper (2025) - L.P.A. Simons, P.K. Murukannaiah, B.S. Han, M.A. Neerincx

Healthy lifestyle behaviours are effective in preventing and treating cardiovascular disease. However, the growing body of scientific literature and the prevalence of conflicting studies make it challenging for healthcare practitioners and patients to stay informed. Large Language Models (LLMs), combined with Retrieval-Augmented Generation (RAG), enable automated claim verification and summarization. We enhanced RAG-LLM with extra modules and evaluated performance. Inclusion-Criteria-based filtering of PubMed papers improved verdict performance. Next, for health claims, PICO-based (Population, Intervention, Comparison, Outcome) paper mapping and summarization improves transparency of evidence used for verdict generation (like ‘Berries reduce blood pressure’). Still, the RAG-LLM models we tested have biases towards positivity (too many foods deemed heart healthy) and neutrality (no clear direction). We discuss mechanisms at play and challenges on the route forward. ...

Gricean Norms as a Basis for Effective Collaboration

Conference paper (2025) - Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh

Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks-common ground, relevance theory, and theory of mind-into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents. ...

Value Preferences Estimation and Disambiguation in Hybrid Participatory Systems

Journal article (2025) - Enrico Liscio, Luciano C. Siebert, Catholijn M. Jonker, Pradeep K. Murukannaiah

Understanding citizens’ values in participatory systems is crucial for citizen-centric policy-making. We envision a hybrid participatory system where participants make choices and provide motivations for those choices, and AI agents estimate their value preferences by interacting with them. We focus on situations where a conflict is detected between participants’ choices and motivations, and propose methods for estimating value preferences while addressing detected inconsistencies by interacting with the participants. We operationalize the philosophical stance that “valuing is deliberatively consequential.” That is, if a participant’s choice is based on a deliberation of value preferences, the value preferences can be observed in the motivation the participant provides for the choice. Thus, we propose and compare value preferences estimation methods that prioritize the values estimated from motivations over the values estimated from choices alone. Then, we introduce a disambiguation strategy that combines Natural Language Processing and Active Learning to address the detected inconsistencies between choices and motivations. We evaluate the proposed methods on a dataset of a large-scale survey on energy transition. The results show that explicitly addressing inconsistencies between choices and motivations improves the estimation of an individual’s value preferences. The disambiguation strategy does not show substantial improvements when compared to similar baselines—however, we discuss how the novelty of the approach can open new research avenues and propose improvements to address the current limitations. ...

Value-Sensitive Disagreement Analysis for Online Deliberation

Conference paper (2024) - Michiel Van Der Meer, Piek Vossen, Catholijn M. Jonker, Pradeep K. Murukannaiah

Disagreements are common in online societal deliberation and may be crucial for effective collaboration, for instance in helping users understand opposing viewpoints. Although there exist automated methods for recognizing disagreement, a deeper understanding of factors that influence disagreement is currently missing. We investigate a hypothesis that differences in personal values influence disagreement in online discussions. Using Large Language Models (LLMs) for estimating both profiles of personal values and disagreement, we conduct a large-scale experiment involving 11.4M user comments. We find that the dissimilarity of value profiles correlates with disagreement only in specific cases, but that incorporating self-reported value profiles changes these results to be more undecided. ...

From large language models to small logic programs

Building global explanations from disagreeing local post-hoc explainers

Journal article (2024) - Andrea Agiollo, Luciano Cavalcante Siebert, Pradeep K. Murukannaiah, Andrea Omicini

The expressive power and effectiveness of large language models (LLMs) is going to increasingly push intelligent agents towards sub-symbolic models for natural language processing (NLP) tasks in human–agent interaction. However, LLMs are characterised by a performance vs. transparency trade-off that hinders their applicability to such sensitive scenarios. This is the main reason behind many approaches focusing on local post-hoc explanations, recently proposed by the XAI community in the NLP realm. However, to the best of our knowledge, a thorough comparison among available explainability techniques is currently missing, as well as approaches for constructing global post-hoc explanations leveraging the local information. This is why we propose a novel framework for comparing state-of-the-art local post-hoc explanation mechanisms and for extracting logic programs surrogating LLMs. Our experiments—over a wide variety of text classification tasks—show how most local post-hoc explainers are loosely correlated, highlighting substantial discrepancies in their results. By relying on the proposed novel framework, we also show how it is possible to extract faithful and efficient global explanations for the original LLM over multiple tasks, enabling explainable and resource-friendly AI techniques. ...

Aggregating value systems for decision support

Journal article (2024) - Roger X. Lera-Leri, Enrico Liscio, Filippo Bistaffa, Catholijn M. Jonker, Maite Lopez-Sanchez, Pradeep K. Murukannaiah, Juan A. Rodriguez-Aguilar, Francisco Salas-Molina

We adopt an emerging and prominent vision of human-centred Artificial Intelligence that requires building trustworthy intelligent systems. Such systems should be capable of dealing with the challenges of an interconnected, globalised world by handling plurality and by abiding by human values. Within this vision, pluralistic value alignment is a core problem for AI– that is, the challenge of creating AI systems that align with a set of diverse individual value systems. So far, most literature on value alignment has considered alignment to a single value system. To address this research gap, we propose a novel method for estimating and aggregating multiple individual value systems. We rely on recent results in the social choice literature and formalise the value system aggregation problem as an optimisation problem. We then cast this problem as an ℓ_p-regression problem. Doing so provides a principled and general theoretical framework to model and solve the aggregation problem. Our aggregation method allows us to consider a range of ethical principles, from utilitarian (maximum utility) to egalitarian (maximum fairness). We illustrate the aggregation of value systems by considering real-world data from two case studies: the Participatory Value Evaluation process and the European Values Study. Our experimental evaluation shows how different consensus value systems can be obtained depending on the ethical principle of choice, leading to practical insights for a decision-maker on how to perform value system aggregation. ...

A hybrid intelligence method for argument mining

Journal article (2024) - Michiel Van Der Meer, Enrico Liscio, Catholijn M. Jonker, Aske Plaat, Piek Vossen, Pradeep K. Murukannaiah

Large-scale survey tools enable the collection of citizen feedback in opinion corpora. Extracting the key arguments from a large and noisy set of opinions helps in understanding the opinions quickly and accurately. Fully automated methods can extract arguments but (1) require large labeled datasets that induce large annotation costs and (2) work well for known viewpoints, but not for novel points of view. We propose HyEnA, a hybrid (human + AI) method for extracting arguments from opinionated texts, combining the speed of automated processing with the understanding and reasoning capabilities of humans. We evaluate HyEnA on three citizen feedback corpora. We find that, on the one hand, HyEnA achieves higher coverage and precision than a state-of-The-Art automated method when compared to a common set of diverse opinions, justifying the need for human insight. On the other hand, HyEnA requires less human effort and does not compromise quality compared to (fully manual) expert analysis, demonstrating the benefit of combining human and artificial intelligence. ...

Designing and Evaluating an LLM-based Health AI Research Assistant for Hypertension Self-Management

Using Health Claims Metadata Criteria

Conference paper (2024) - L.P.A. Simons, P.K. Murukannaiah, M.A. Neerincx

Hypertension is a condition affecting most people over 45 years old. Health Self-Management offers many opportunities for prevention and cure. However, most scientific health literature is unknown by health professionals and/or patients. Per year about 200.000 new scientific papers on cardiovascular health appear, which is too much for a human to read. Hence, an LLM-based Health AI research assistant is developed for mining scientific literature on blood pressure and food. A user evaluation was conducted with n=8 participants who just completed an intensive lifestyle intervention for blood pressure self-management. They highlighted several challenges and opportunities for a Health AI, especially regarding claim transparency, data quality and risks of hallucinations. In the discussion we propose seven criteria using metadata and information characteristics to help evaluate ambiguous or conflicting health science claims. ...

Annotator-Centric Active Learning for Subjective NLP Tasks

Conference paper (2024) - Michiel van der Meer, Neele Falk, Pradeep K. Murukannaiah, Enrico Liscio

Active Learning (AL) addresses the high costs of collecting human annotations by strategically annotating the most informative samples.However, for subjective NLP tasks, incorporating a wide range of perspectives in the annotation process is crucial to capture the variability in human judgments.We introduce Annotator-Centric Active Learning (ACAL), which incorporates an annotator selection strategy following data sampling.Our objective is two-fold: (1) to efficiently approximate the full diversity of human judgments, and (2) to assess model performance using annotator-centric metrics, which value minority and majority perspectives equally.We experiment with multiple annotator selection strategies across seven subjective NLP tasks, employing both traditional and novel, human-centered evaluation metrics.Our findings indicate that ACAL improves data efficiency and excels in annotator-centric performance evaluations.However, its success depends on the availability of a sufficiently large and diverse pool of annotators to sample from. ...

Reflective Hybrid Intelligence for Meaningful Human Control in Decision-Support Systems

Book chapter (2024) - C.M. Jonker, L. Cavalcante Siebert, P.K. Murukannaiah

With the growing capabilities and pervasiveness of AI systems, societies must collectively choose between reduced human autonomy, endangered democracies and limited human rights, and AI that is aligned to human and social values, nurturing collaboration, resilience, knowledge and ethical behaviour. In this chapter, we introduce the notion of self-reflective AI systems for meaningful human control over AI systems. Focusing on decision support systems, we propose a framework that integrates knowledge from psychology and philosophy with formal reasoning methods and machine learning approaches to create AI systems responsive to human values and social norms. We also propose a possible research approach to design and develop self-reflective capability in AI systems. Finally, we argue that self-reflective AI systems can lead to self-reflective hybrid systems (human + AI), thus increasing meaningful human control and empowering human moral reasoning by providing comprehensible information and insights on possible human moral blind spots. ...

An Empirical Analysis of Diversity in Argument Summarization

Conference paper (2024) - Michiel van der Meer, Catholijn M. Jonker, Piek Vossen, Pradeep K. Murukannaiah

Presenting high-level arguments is a crucial task for fostering participation in online societal discussions. Current argument summarization approaches miss an important facet of this task-capturing diversity-which is important for accommodating multiple perspectives. We introduce three aspects of diversity: those of opinions, annotators, and sources. We evaluate approaches to a popular argument summarization task called Key Point Analysis, which shows how these approaches struggle to (1) represent arguments shared by few people, (2) deal with data from various sources, and (3) align with subjectivity in human-provided annotations. We find that both general-purpose LLMs and dedicated KPA models exhibit this behavior, but have complementary strengths. Further, we observe that diversification of training data may ameliorate generalization. Addressing diversity in argument summarization requires a mix of strategies to deal with subjectivity. ...

The Quarrel of Local Post-hoc Explainers for Moral Values Classification in Natural Language Processing

Conference paper (2023) - Andrea Agiollo, Luciano Cavalcante Siebert, Pradeep Kumar Murukannaiah, Andrea Omicini

Although popular and effective, large language models (LLM) are characterised by a performance vs. transparency trade-off that hinders their applicability to sensitive scenarios. This is the main reason behind many approaches focusing on local post-hoc explanations recently proposed by the XAI community. However, to the best of our knowledge, a thorough comparison among available explainability techniques is currently missing, mainly for the lack of a general metric to measure their benefits. We compare state-of-the-art local post-hoc explanation mechanisms for models trained over moral value classification tasks based on a measure of correlation. By relying on a novel framework for comparing global impact scores, our experiments show how most local post-hoc explainers are loosely correlated, and highlight huge discrepancies in their results—their “quarrel” about explanations. Finally, we compare the impact scores distribution obtained from each local post-hoc explainer with human-made dictionaries, and point out that there is no correlation between explanation outputs and the concepts humans consider as salient. ...

Democratic Wireless Channel Assignment

Fair Resource Allocation in Wi-Fi Networks

Journal article (2023) - Ivan Marsa Maestre, Jose Manuel Gimenez-Guzman, Marino Tejedor Romero, Enrique de la Hoz, Pradeep Murukannaiah

User experience is the ultimate quality of service criterion for modern WLAN networks. However, network configuration approaches are mainly network-centric. We envision a paradigm shift, empowering users in network management. We study how automated negotiation and collective intelligence can support the democratic configuration of a wireless network, leveraging client and provider interests. This new paradigm allows for flexible network configuration, which enables better exploitation of resources considering the clients real usage and needs, and a fair distribution of throughput among users. ...