S. Fitrianie | TU Delft Repository

Establishing Reference Points for Artificial Social Agent Evaluation: The ASAQ Representative Set 2025

Journal article (2026) - S. Fitrianie, A. Abdulrahman, Merijn Bruijnes, W.P. Brinkman

)), and intelligence (e.g., the Stanford-Binet Intelligence Scales (Roid and Pomplun, 2012) and a normative dataset (Stevens and Bernier, 2021)). But, also closer at home, when it comes to evaluation of software, System Usability Scale (SUS) (Brooke, 1996) also comes with a representative data set (Lewis and Sauro, 2018).Creating benchmark set to go along with ASA Questionnaire (ASAQ) (Fitrianie et al., 2025b,c) allows us to benchmark peoples experience with an ASA. This is measured on 24 constructs and dimensions covering an extensive part of our community shared interests, such as believability, likeability, and sociability of ASA. ASAQ has been published alongside the norm set "ASAQ representative set 2024", which includes the experience of 1066 individuals with 29 agents. That set is based on a third person perspective, i.e., filling out a questionnaire after seeing a video of someone else interacting with an agent. Although pragmatic for validating the questionnaire, the ASAQ authors also acknowledge possible limitations of this set on generalization towards experiences based on actual interaction (Fitrianie et al., 2025b).A key question when developing a benchmark is what should constitute as a benchmark? Which people should be included in the sample, and which agents? For ASAQ representative set 2024, the research platform Prolific was used, which allows data collection across the world. When using this platform to develop our benchmarking set, we need to know which agents are publicly available that have a global reach and have a sizeable user group. Therefore, our first step in building the benchmark set was to survey contemporary ASA usage. We recruited participants for this study through the crowd-sourcing platform, Prolific, between November 30 and December 19, 2023. For this, we applied the following inclusion criteria, where eligible participants were those who: (1) had not taken part in prior ASAQ validation studies, (2) had a Prolific approval rate above 95%, and (3) were proficient in English. Recruitment spanned multiple time zones, with a staggered approach in six-hour intervals to elicit global participant distribution. The study consisted of two sequential phases: (1) screening the population for familiarity with contemporary ASAs, and (2) establishing the ASAQ Representative Set 2025. For this study, we received approval from the university human research ethics Committee (no. 2685, dated 13 January 2023), preregistered the study (Fitrianie et al., 2023), and made the analysis script and data publicly available (Fitrianie et al., 2025a). We compensated participants according to Prolific's payment guidelines.To develop a benchmark set based on individuals' interaction experiences with widely known agents, we started by creating an initial agent list using input from the OSF working group on Artificial Social Agent Evaluation Instrumentfoot_0 . Twelve workgroup members from all over the world brainstormed on popular and widely used ASAs, selecting agents that various people, e.g., age groups and locations, might have interacted with at home. This resulted in a pre-selection of 11 agents, namely: Amazon's Alexa, Google's Bard chatbot, Microsoft's Bing chatbot, OpenAI's ChatGPT, Microsoft's CoPilot, Android's Google Assistant, IKEA's customer service chatbot, Replika chatbot, Apple's Siri, iRobot's Roomba vacuum cleaner, and Microsoft's Xiaoice. To further diversify the agent group, we included a dog, asking some participants to complete the questionnaire based on their interactions with a dog. Furthermore, with an eye on the future, we also incorporated an online version of the classic Eliza chatbot (Weizenbaum, 1966), making it possible to expose people in the future to the same agent. Finally, we included a non-existent agent, "Xonderfloip," as a distractor check, resulting in a list of 14 agents. Participants were asked to indicate the timing of their last interaction with the agents, with options ranging from "today" to "never."Of the 1,296 individuals initially recruited, 1,253 participants responded "never" to interactions with the distractor agent, meeting the criteria for inclusion in the subsequent phase of the study.Allowing people to compare their agent with agents in the benchmark set, we aimed for a statistical power of 0.80 to detect at least a medium-sized effect in future independent t-tests with an alpha level of 0.05 (Cohen, 1992). Consequently, the benchmark set required a minimum of 64 samples per agent. To ensure participants had interacted with the agents recently, we only used agents used within the last six months, narrowing the agent group from 14 to 10. Including the Eliza chatbot and the dog, we selected the agents: Alexa, Bard, Bing, ChatGPT, CoPilot, Google Assistant, Roomba, and Siri. Participants were assigned to evaluate a single agent they were familiar with, or to interact with the Eliza chatbot for five minutes before assessment to establish their own interaction experience with this agent. Exclusion criteria in this phase were: (1) failing more than 20% of attention checks; (2) providing incoherent responses to open-ended questions (e.g., unintelligible or nonsensical answers, or indicating no interaction with the assigned ASA); and (3) completing fewer than 10 dialogue turns for those assigned to the Eliza chatbot. Each participant was allowed to participate only once, with only their first completion included in the analysis.Out of 1,253 available participants, we invited 777 individuals until we ended up with 666 participants who met the inclusion criteria (per agent: M = 66, SD = 1, range = [64 .. 68]). Among the exclusions, 47 participants did not complete the study with their assigned ASA, five failed attention checks (providing [3 .. 7] incorrect answers out of 10), and one was removed due to an open-ended response indicating no interaction with the assigned agent. Additionally, 58 participants assigned to the Eliza chatbot were excluded for completing fewer than 10 dialogue turns. Additionally, we requested participants to describe their experiences with the ASA to which they were assigned, in their own words, aiming for future research.The resulting dataset included participants from the two phases: Phase 1 (n = 1,253) and a subset of these participants in Phase 2 (n = 666). The majority of participants identified as male (Phase 1: 54.5%; Phase 2: 57.8%), followed by female (Phase 1: 44.9%; Phase 2: 41.9%), with a small proportion identifying as other (Phase 1: 0.6%; Phase 2: 0.3%). The mean age was similar across both Phases (Phase 1: M = 30, SD = 9.2;Phase 2: M = 29.8, SD = 9.2), with the largest age groups being 18-25 (Phase 1: 38.9%; Phase 2: 39.8%) and 26-35 (Phase 1 and Phase 2: 39.6%). Education levels were comparable between groups, with the highest proportions holding an undergraduate degree (Phase 1 and Phase 2: 41.4%) or a graduate degree (Phase 1: 25%; Phase 2: 23.9%). Socioeconomic status, assessed via the MacArthur Scale (Adler et al., 2000) (1 = lowest, 10 = highest), was distributed across the scale, with the largest proportions in the middle ranges (e.g., at level 6, Phase 1: 25.9% at level 6; Phase 2: 28.4%). Geographically (based on the United Nations Regional Groups (United Nations, 2024)), most participants resided in Western Europe (Phase 1: 46.8%; Phase 2: 42.9%), followed by Africa (Phase 1: 21.1%; Phase 2: 22.8%) and Eastern Europe (Phase 1: 18.2%; Phase 2: 20.1%). Smaller proportions were from Latin America and the Caribbean (Phase 1: 11.7%; Phase 2: 12.2%), with limited presentation from the United States (Phase 1: 1.2%; Phase 2: 0.6%), and other regions. Users of this dataset might select sub-datasets based on these characteristics to study specific groups. Table 1 provides an overview of participant interactions with 12 ASAs and a dog. ChatGPT emerged as the most widely used agent, with 89.47% of 1,253 participants reporting interactions. Google Assistant (85.08%) and Siri (71.51%) also demonstrated high usage rates. In contrast, less commonly used agents included Replika (10.45%), Xiaoice (7.98%), and Eliza (2.23%).Among the ASAs, ChatGPT and Google Assistant exhibited the highest proportions of recent interactions (today and this week), reflecting their integration into daily life. For instance, 295 participants interacted with ChatGPT today, and 362 this week. As anticipated, agents such as Eliza showed minimal recent interactions, with the majority of participants reporting never having engaged with them (1,225).The study generated a representative set of nine ASAs and a dog, collecting 666 unique participant ratings on the 90 first-person perspective items of the ASAQ. Sample sizes per agent ranged from 64 to 68.Analysis of the ASAQ long version revealed variability in the ASAQ scores across agents, ranging from -30 (Eliza) to +30 (the dog). The data set, showing a detailed presentation of the scores of the ASAs on each of the 24 constructs and dimensions of the ASAQ, can be accessed publicly online (Fitrianie et al., 2025a).The ASAQ constructs and overall item content remained consistent with the ASAQ representative set 2024; the only difference is the participants' point of view, with the 2024 set collected from a third-person perspective (watching a video of a human-ASA interaction) and the 2025-set from a first-person perspective (interacting directly with an ASA). Items reflect the relevant perspective (e.g., "The user can rely on [the agent]" vs. "I can rely on [the agent]"). The ASAQ construct and dimension scores, derived from both the long and short versions of the ASAQ, for all agents in the Representative Set 2025 are provided in the Supplemental Data accompanying this article (see Supplementary Material, Table S1-S4).foot_2 The ASAQ Representative set 2025 extends the previously established ASAQ representative set 2024, offering an enhanced resource for researchers. The dataset highlights the varying interaction experiences people have in direct interaction with well-known agents. The reported use of contemporary ASAs (e.g., ChatGPT, Google Assistant, and Siri) demonstrates how rapidly conversational agents have become embedded in daily life. The inclusion of a non-artificial social agent (a dog) adds depth to the dataset, allowing for comparisons to other social experiences. Additionally, the variability in ASAQ scores, ranging from -30 for Eliza to +30 for dogs, provides anchor points for researchers to compare their own ASA against when using the ASAQ. Furthermore, the dataset allows for the ranking of results across each ASAQ construct or dimension relative to the agents included in the ASAQ Representative Set. To facilitate analysis, researchers can utilise ASAQ charts, which offer a clear, at-a-glance visualisation of their ASA's scores across all 24 constructs/dimensions, enabling direct comparisons with the representative ASAs. This resource promotes robust and standardised reporting in studies focused on human-agent interactions, which advances methodological consistency in the field. With the here presented dataset, it is possible to create similar guidelines for the first person perspective use of the ASAQ.Two limitations about this dataset should be noted. First, apart from Eliza, participants evaluated ASAs based on their most recent interaction, which relies on recall and may introduce bias due to differences in time since use, ASA version, and interaction context. Second, participants were recruited through Prolific Table 1. Summary of participants' usage of the 13 ASAs participated between November 30 and December 13, 2023 (n = 1253). The reported % of total any-use reported for each ASA, and when this use last occurred. We present the ASAQ score only for the ASAs we measured (n=666).Phase ...

)), and intelligence (e.g., the Stanford-Binet Intelligence Scales (Roid and Pomplun, 2012) and a normative dataset (Stevens and Bernier, 2021)). But, also closer at home, when it comes to evaluation of software, System Usability Scale (SUS) (Brooke, 1996) also comes with a representative data set (Lewis and Sauro, 2018).Creating benchmark set to go along with ASA Questionnaire (ASAQ) (Fitrianie et al., 2025b,c) allows us to benchmark peoples experience with an ASA. This is measured on 24 constructs and dimensions covering an extensive part of our community shared interests, such as believability, likeability, and sociability of ASA. ASAQ has been published alongside the norm set "ASAQ representative set 2024", which includes the experience of 1066 individuals with 29 agents. That set is based on a third person perspective, i.e., filling out a questionnaire after seeing a video of someone else interacting with an agent. Although pragmatic for validating the questionnaire, the ASAQ authors also acknowledge possible limitations of this set on generalization towards experiences based on actual interaction (Fitrianie et al., 2025b).A key question when developing a benchmark is what should constitute as a benchmark? Which people should be included in the sample, and which agents? For ASAQ representative set 2024, the research platform Prolific was used, which allows data collection across the world. When using this platform to develop our benchmarking set, we need to know which agents are publicly available that have a global reach and have a sizeable user group. Therefore, our first step in building the benchmark set was to survey contemporary ASA usage. We recruited participants for this study through the crowd-sourcing platform, Prolific, between November 30 and December 19, 2023. For this, we applied the following inclusion criteria, where eligible participants were those who: (1) had not taken part in prior ASAQ validation studies, (2) had a Prolific approval rate above 95%, and (3) were proficient in English. Recruitment spanned multiple time zones, with a staggered approach in six-hour intervals to elicit global participant distribution. The study consisted of two sequential phases: (1) screening the population for familiarity with contemporary ASAs, and (2) establishing the ASAQ Representative Set 2025. For this study, we received approval from the university human research ethics Committee (no. 2685, dated 13 January 2023), preregistered the study (Fitrianie et al., 2023), and made the analysis script and data publicly available (Fitrianie et al., 2025a). We compensated participants according to Prolific's payment guidelines.To develop a benchmark set based on individuals' interaction experiences with widely known agents, we started by creating an initial agent list using input from the OSF working group on Artificial Social Agent Evaluation Instrumentfoot_0 . Twelve workgroup members from all over the world brainstormed on popular and widely used ASAs, selecting agents that various people, e.g., age groups and locations, might have interacted with at home. This resulted in a pre-selection of 11 agents, namely: Amazon's Alexa, Google's Bard chatbot, Microsoft's Bing chatbot, OpenAI's ChatGPT, Microsoft's CoPilot, Android's Google Assistant, IKEA's customer service chatbot, Replika chatbot, Apple's Siri, iRobot's Roomba vacuum cleaner, and Microsoft's Xiaoice. To further diversify the agent group, we included a dog, asking some participants to complete the questionnaire based on their interactions with a dog. Furthermore, with an eye on the future, we also incorporated an online version of the classic Eliza chatbot (Weizenbaum, 1966), making it possible to expose people in the future to the same agent. Finally, we included a non-existent agent, "Xonderfloip," as a distractor check, resulting in a list of 14 agents. Participants were asked to indicate the timing of their last interaction with the agents, with options ranging from "today" to "never."Of the 1,296 individuals initially recruited, 1,253 participants responded "never" to interactions with the distractor agent, meeting the criteria for inclusion in the subsequent phase of the study.Allowing people to compare their agent with agents in the benchmark set, we aimed for a statistical power of 0.80 to detect at least a medium-sized effect in future independent t-tests with an alpha level of 0.05 (Cohen, 1992). Consequently, the benchmark set required a minimum of 64 samples per agent. To ensure participants had interacted with the agents recently, we only used agents used within the last six months, narrowing the agent group from 14 to 10. Including the Eliza chatbot and the dog, we selected the agents: Alexa, Bard, Bing, ChatGPT, CoPilot, Google Assistant, Roomba, and Siri. Participants were assigned to evaluate a single agent they were familiar with, or to interact with the Eliza chatbot for five minutes before assessment to establish their own interaction experience with this agent. Exclusion criteria in this phase were: (1) failing more than 20% of attention checks; (2) providing incoherent responses to open-ended questions (e.g., unintelligible or nonsensical answers, or indicating no interaction with the assigned ASA); and (3) completing fewer than 10 dialogue turns for those assigned to the Eliza chatbot. Each participant was allowed to participate only once, with only their first completion included in the analysis.Out of 1,253 available participants, we invited 777 individuals until we ended up with 666 participants who met the inclusion criteria (per agent: M = 66, SD = 1, range = [64 .. 68]). Among the exclusions, 47 participants did not complete the study with their assigned ASA, five failed attention checks (providing [3 .. 7] incorrect answers out of 10), and one was removed due to an open-ended response indicating no interaction with the assigned agent. Additionally, 58 participants assigned to the Eliza chatbot were excluded for completing fewer than 10 dialogue turns. Additionally, we requested participants to describe their experiences with the ASA to which they were assigned, in their own words, aiming for future research.The resulting dataset included participants from the two phases: Phase 1 (n = 1,253) and a subset of these participants in Phase 2 (n = 666). The majority of participants identified as male (Phase 1: 54.5%; Phase 2: 57.8%), followed by female (Phase 1: 44.9%; Phase 2: 41.9%), with a small proportion identifying as other (Phase 1: 0.6%; Phase 2: 0.3%). The mean age was similar across both Phases (Phase 1: M = 30, SD = 9.2;Phase 2: M = 29.8, SD = 9.2), with the largest age groups being 18-25 (Phase 1: 38.9%; Phase 2: 39.8%) and 26-35 (Phase 1 and Phase 2: 39.6%). Education levels were comparable between groups, with the highest proportions holding an undergraduate degree (Phase 1 and Phase 2: 41.4%) or a graduate degree (Phase 1: 25%; Phase 2: 23.9%). Socioeconomic status, assessed via the MacArthur Scale (Adler et al., 2000) (1 = lowest, 10 = highest), was distributed across the scale, with the largest proportions in the middle ranges (e.g., at level 6, Phase 1: 25.9% at level 6; Phase 2: 28.4%). Geographically (based on the United Nations Regional Groups (United Nations, 2024)), most participants resided in Western Europe (Phase 1: 46.8%; Phase 2: 42.9%), followed by Africa (Phase 1: 21.1%; Phase 2: 22.8%) and Eastern Europe (Phase 1: 18.2%; Phase 2: 20.1%). Smaller proportions were from Latin America and the Caribbean (Phase 1: 11.7%; Phase 2: 12.2%), with limited presentation from the United States (Phase 1: 1.2%; Phase 2: 0.6%), and other regions. Users of this dataset might select sub-datasets based on these characteristics to study specific groups. Table 1 provides an overview of participant interactions with 12 ASAs and a dog. ChatGPT emerged as the most widely used agent, with 89.47% of 1,253 participants reporting interactions. Google Assistant (85.08%) and Siri (71.51%) also demonstrated high usage rates. In contrast, less commonly used agents included Replika (10.45%), Xiaoice (7.98%), and Eliza (2.23%).Among the ASAs, ChatGPT and Google Assistant exhibited the highest proportions of recent interactions (today and this week), reflecting their integration into daily life. For instance, 295 participants interacted with ChatGPT today, and 362 this week. As anticipated, agents such as Eliza showed minimal recent interactions, with the majority of participants reporting never having engaged with them (1,225).The study generated a representative set of nine ASAs and a dog, collecting 666 unique participant ratings on the 90 first-person perspective items of the ASAQ. Sample sizes per agent ranged from 64 to 68.Analysis of the ASAQ long version revealed variability in the ASAQ scores across agents, ranging from -30 (Eliza) to +30 (the dog). The data set, showing a detailed presentation of the scores of the ASAs on each of the 24 constructs and dimensions of the ASAQ, can be accessed publicly online (Fitrianie et al., 2025a).The ASAQ constructs and overall item content remained consistent with the ASAQ representative set 2024; the only difference is the participants' point of view, with the 2024 set collected from a third-person perspective (watching a video of a human-ASA interaction) and the 2025-set from a first-person perspective (interacting directly with an ASA). Items reflect the relevant perspective (e.g., "The user can rely on [the agent]" vs. "I can rely on [the agent]"). The ASAQ construct and dimension scores, derived from both the long and short versions of the ASAQ, for all agents in the Representative Set 2025 are provided in the Supplemental Data accompanying this article (see Supplementary Material, Table S1-S4).foot_2 The ASAQ Representative set 2025 extends the previously established ASAQ representative set 2024, offering an enhanced resource for researchers. The dataset highlights the varying interaction experiences people have in direct interaction with well-known agents. The reported use of contemporary ASAs (e.g., ChatGPT, Google Assistant, and Siri) demonstrates how rapidly conversational agents have become embedded in daily life. The inclusion of a non-artificial social agent (a dog) adds depth to the dataset, allowing for comparisons to other social experiences. Additionally, the variability in ASAQ scores, ranging from -30 for Eliza to +30 for dogs, provides anchor points for researchers to compare their own ASA against when using the ASAQ. Furthermore, the dataset allows for the ranking of results across each ASAQ construct or dimension relative to the agents included in the ASAQ Representative Set. To facilitate analysis, researchers can utilise ASAQ charts, which offer a clear, at-a-glance visualisation of their ASA's scores across all 24 constructs/dimensions, enabling direct comparisons with the representative ASAs. This resource promotes robust and standardised reporting in studies focused on human-agent interactions, which advances methodological consistency in the field. With the here presented dataset, it is possible to create similar guidelines for the first person perspective use of the ASAQ.Two limitations about this dataset should be noted. First, apart from Eliza, participants evaluated ASAs based on their most recent interaction, which relies on recall and may introduce bias due to differences in time since use, ASA version, and interaction context. Second, participants were recruited through Prolific Table 1. Summary of participants' usage of the 13 ASAs participated between November 30 and December 13, 2023 (n = 1253). The reported % of total any-use reported for each ASA, and when this use last occurred. We present the ASAQ score only for the ASAs we measured (n=666).Phase

The Artificial Social Agent Questionnaire (ASAQ) — Development and evaluation of a validated instrument for capturing human interaction experiences with artificial social agents

Journal article (2025) - Siska Fitrianie, Merijn Bruijnes, Amal Abdulrahman, Willem Paul Brinkman

Validating claims and replicating findings on the impact of artificial social agents (ASA), such as virtual agents, conversational agents, and social robots, requires a standardised measurement instrument that researchers can employ in different settings and for various agents. Such an instrument would allow researchers to evaluate their agents and establish insights beyond their specific study context. Therefore, we present the long and short versions of the ASA questionnaire (ASAQ) for evaluating human-ASA interaction on 19 constructs, such as the agent's believability, sociability, and coherence. It has been developed by an international workgroup with more than 100 ASA-researchers over multiple years who identified community-relevant constructs and associated questionnaire items and examined the questionnaire's reliability, validity, and interpretability. The result is a questionnaire that can capture more than 80% of the constructs that studies in the intelligent virtual agent community investigate, with acceptable levels of reliability, content validity, construct validity, and cross-validity. We suggest that ASA-researchers use the ASAQ short version to report their agent's psychographic information and the ASAQ long version to analyse any constructs in-depth that are specifically relevant to their agent or study. Finally, this paper gives instructions for practical use, such as sample size estimations, and how to interpret and present results. ...

Intelligent Mathematical Tutor Based on ChatGPT or DeepSeek

Conference paper (2025) - C. Rothkrantz, S. Fitrianie, L. Rothkrantz

Many secondary school students in the Netherlands require additional support to prepare for their mathematics exams. This paper presents the design of a Massive Open Online Course (MOOC) featuring training materials tailored for the mathematics exams, integrated with a digital intelligent tutor. The tutor, which students can activate as needed, is embedded within the course content. Two versions of the tutor were developed: one powered by ChatGPT and the other by DeepSeek. To evaluate their effectiveness, both versions were tested on a set of problems from W4Kangoeroe, an annual mathematics competition for primary and secondary school students. The results indicate that while both chatbots are capable of generating correct mathematical solutions, their approaches often differ significantly from those provided by experienced mathematics teachers. This highlights the potential of AI-driven tutors to assist students but also underscores the need for further refinement to align more closely with educational standards and teaching methodologies. ...

Fusing Trajectories of Exploring Agents in a Crisis Environment using DeepSeek

Conference paper (2025) - Leon Rothkrantz, Siska Fitrianie

During a natural disaster, when roads are damaged or blocked, rescue agents search the area to find new routes from start to destination. Their trajectories are sent to a crisis center and merged into a new map. The DeepSeek and ChatGPT algorithms help build this map by combining the agents' explored routes. This paper presents the algorithm and its application. ...

A MOOC for exam training mathematics using intelligent tutoring

Conference paper (2024) - Cyril Rothkrantz, Siska Fitrianie, Leon Rothkrantz

Many high school students require costly private lessons to prepare for their final exams. These lessons go beyond what regular schools can provide. As a proof of concept, a distance learning program for exam preparation has been created using Moodle. It offers study materials that incorporate gaming and real-life applications. Each lesson begins with a diagnostic test and follows a specific training program to address students' weaknesses. An AI-based intelligent tutoring system has been developed to take on the role of a teacher and assist students with their work. After the system analyzes errors, the tutor provides specific hints and guidelines to help students solve problems accurately. The developed study material was tested on a group of students preparing for their final exams in mathematics. ...

Corrigendum

Mandarin Chinese translation of the Artificial-Social-Agent questionnaire instrument for evaluating human-agent interaction (Frontiers in Computer Science, (2023), 5, (1149305), 10.3389/fcomp.2023.1149305)

Journal article (2024) - Fengxiang Li, Siska Fitrianie, Merijn Bruijnes, Amal Abdulrahman, Fu Guo, Willem Paul Brinkman

In the published article, there was an error in Table 5. For each second construct/dimension, the means are swapped between Chinese and English data, which is caused by an error in the underlying R script. Consequently, the plus and minus signs for the delta and CI values are also wrong. The corrected Table 5 and its caption appear below. Construct/dimension rating difference between mixed-international English-speaking and Chinese mother-tongue groups. Δ Score are pairwise differences between Chinese and mother-tongue cultural background and mixed-international cultural background taken from the posterior distribution. M, mean; SD, standard deviation; CI, credible interval. The authors apologize for this error and state that this does not change the scientific conclusions of the article in any way. The original article has been updated. ...

On Head Motion for Recognizing Aggression and Negative Affect during Speaking and Listening

Conference paper (2023) - Siska Fitrianie, Iulia Lefter

Affective aggression is a form of aggression characterized by impulsive reactions driven by strong negative emotions. Despite the extensive research in the area of automatic emotion recognition, affective aggression is a phenomenon that has received less attention. This study investigates the use of head motion as a potential indicator of affective aggression and negative affect. It provides an analysis of head movement patterns associated with various levels of aggression, valence, arousal and dominance, and compares behaviors and recognition performance under speaking and listening conditions. The study was conducted on the Negative Affect and Aggression database - a multimodal corpus of dyadic interactions between aggression regulation training actors and non-actors, annotated for levels of aggression, valence, arousal, and dominance. Results demonstrate that head motion features can serve as promising indicators of affect during both speaking and listening. Valence and arousal prediction achieved better performance during speaking, while aggression and dominance were better predicted during listening. Significant increases in the magnitude of pitch angular acceleration were associated with escalation along all four annotated dimensions. Interestingly, higher escalation was accompanied by a significant increase in the total number of movements during speaking, but a significant decrease of the number of movements was observed as escalation increased along listening intervals. These findings are particularly relevant as head motion can be used solely or potentially as a supplementary modality when other modalities such as speech or facial expressions are unavailable or altered. ...

Mandarin Chinese translation of the Artificial-Social-Agent questionnaire instrument for evaluating human-agent interaction

Journal article (2023) - Fengxiang Li, S. Fitrianie, Merijn Bruijnes, A. Abdulrahman, Fu Guo, W.P. Brinkman

The Artificial-Social-Agent (ASA) questionnaire is an instrument for evaluating human-ASA interaction. It consists of 19 constructs and related dimensions measured by either 24 questionnaire items (short version) or 90 questionnaire items (long version). The questionnaire was built and validated by a research community effort to make evaluation results more comparable between agents and findings more generalizable. The current questionnaire is in English, which limits its use to only a population with an adequate command of the English language. Translating the questionnaire into more languages allows for the inclusion of other populations and the possibility of comparing them. Therefore, this paper presents a Mandarin Chinese translation of the questionnaire. After three construction cycles that included forward and backward translation, we gave both the final version of the translated and original English questionnaire to 242 bilingual crowd-workers to evaluate 14 ASAs. Results show on average a good level of correlation on the construct/dimension level (ICC M = 0.79, SD = 0.09, range [0.61, 0.95]) and on the item level (ICC M = 0.62, SD = 0.14, range [0.19, 0.92]) between the two languages for the long version, and for the short version (ICC M = 0.66, SD = 0.12, range [0.41, 0.92]). The analysis also established correction values for converting questionnaire item scores between Chinese and English questionnaires. Moreover, we also found systematic differences in English questionnaire scores between the bilingual sample and a previously collected mixed-international English-speaking sample. We hope this and the Chinese questionnaire translation will motivate researchers to study human-ASA interaction among a Chinese literate population and to study cultural similarities and differences in this area. ...

The artificial-social-agent questionnaire

Establishing the long and short questionnaire versions

Conference paper (2022) - Siska Fitrianie, Merijn Bruijnes, Fengxiang Li, Amal Abdulrahman, Willem Paul Brinkman

We present the ASA Questionnaire, an instrument for evaluating human interaction with an artificial social agent (ASA), resulting from multi-year efforts involving more than 100 Intelligent Virtual Agent (IVA) researchers worldwide. It has 19 measurement constructs constituted by 90 items, which capture more than 80% of the constructs identified in empirical studies published in the IVA conference 2013 - 2018. This paper reports on construct validity analysis, specifically convergent and discriminant validity of initial 131 instrument items that involved 532 crowd-workers who were asked to rate human interaction with 14 different ASAs. The analysis included several factor analysis models and resulted in the selection of 90 items for inclusion in the long version of the ASA questionnaire. In addition, a representative item of each construct or dimension was selected to create a 24-item short version of the ASA questionnaire. Whereas the long version is suitable for a comprehensive evaluation of human-ASA interaction, the short version allows quick analysis and description of the interaction with the ASA. To support reporting ASA questionnaire results, we also put forward an ASA chart. The chart provides a quick overview of the agent profile. ...

Questionnaire Items for Evaluating Artificial Social Agents - Expert Generated, Content Validated and Reliability Analysed

Conference paper (2021) - Siska Fitrianie, Merijn Bruijnes, Fengxiang Li, Willem Paul Brinkman

In this paper, we report on the multi-year Intelligent Virtual Agents (IVA) community effort, involving more than 90 researchers worldwide, researching the IVA community interests and practice in evaluating human interaction with an artificial social agent (ASA). The joint efforts have previously generated a unified set of 19 constructs that capture more than 80% of constructs used in empirical studies published in the IVA conference between 2013 to 2018. In this paper, we present expert-content-validated 131 questionnaire items for the constructs and their dimensions, and investigate the level of reliability. We establish this in three phases. Firstly, eight experts generated 431 potential construct items. Secondly, 20 experts rated whether items measure (only) their intended construct, resulting in 207 content-validated items. Next, a reliability analysis was conducted, involving 192 crowd-workers who were asked to rate a human interaction with an ASA, which resulted in 131 items (about 5 items per measurement, with Cronbach's alpha ranged [.60 - .87]). These are the starting points for the questionnaire instrument of human-ASA interaction. ...

Factors Affecting User’s Behavioral Intention and Use of a Mobile-Phone-Delivered Cognitive Behavioral Therapy for Insomnia

A Small-Scale UTAUT Analysis

Journal article (2021) - Siska Fitrianie, Corine Horsch, Robbert Jan Beun, Fiemke Griffioen-Both, Willem Paul Brinkman

A mobile app could be a powerful medium for providing individual support for cognitive behavioral therapy (CBT), as well as facilitating therapy adherence. Little is known about factors that may explain the acceptance and uptake of such applications. This study, therefore, examines factors from an extended version of the Unified Theory of Acceptance and Use of Technology (UTAUT2) model to explain variation between people’s behavioral intention to use a CBT for insomnia (CBT-I) app and their use-behavior. The model includes eight aspects of behavioral intention: performance expectancy, effort expectancy, social influence, self-efficacy, trust, hedonic motivation, anxiety, and facilitating conditions, and investigates further the influence of the behavioral intention and facilitating conditions on app-usage behavior. Data were gathered from a field trial involving people (n = 89) with relatively mild insomnia using a CBT-I app. The analysis applied the Partial Least Squares-Structural Equation Modeling method. The results found that performance expectancy, effort expectancy, social influence, self-efficacy, trust, and facilitating conditions all explained part of the variation in behavioral intention, but not beyond the explanation provided by hedonic motivation, which accounted for R² = 0.61. Both behavioral intention and facilitating conditions could explain the use-behavior (R² = 0.32). We anticipate that the findings will help researchers and developers to focus on: (1) users’ positive feelings about the app as this was an indicator of their acceptance of the mobile app and usage; and (2) the availability of resources and support as this also correlated with the technology use. ...

The 19 Unifying Questionnaire Constructs of Artificial Social Agents

An IVA Community Analysis

Conference paper (2020) - Siska Fitrianie, Merijn Bruijnes, Deborah Richards, Andrea Bönsch, Willem Paul Brinkman

In this paper, we report on the multi-year Intelligent Virtual Agents (IVA) community effort, involving more than 80 researchers worldwide, researching the IVA community interests and practises in evaluating human interaction with an artificial social agent (ASA). The effort is driven by previous IVA workshops and plenary IVA discussions related to the methodological crisis on the evaluation of ASAs. A previous literature review showed a continuous practise of creating new questionnaires instead of reusing validated questionnaires. We address this issue by examining questionnaire measurement constructs used in empirical studies between 2013 to 2018 published in the IVA conference. We identified 189 constructs used in 89 questionnaires that are reported across 81 studies. Although these constructs have different names, they often measure the same thing. In this paper, we, therefore, present a unifying set of 19 constructs that captures more than 80% of the 189 constructs initially identified. We established this set in two steps. First, 49 researchers classified the constructs in broad theoretically based categories. Next, 23 researchers grouped the constructs in each category on their similarity. The resulting 19 groups form a unifying set of constructs, which will be the basis for the future questionnaire instrument of human-ASA interaction. ...

Information System Supporting the Management of a Flooding Crisis in the City of Prague

Conference paper (2020) - Leon J.M. Rothkrantz, Siska Fitrianie

In this paper we present an information system improving situational awareness, communication and management during a flooding crisis. The system is based on the agent framework (JADE) and a blackboard like functionality, which enables rescue workers and services to improve communication, increase context awareness and activate rescue services. Observers in the crisis field, modelled as an agent, report about their observations using an iconbased crisis App on a smartphone. A prototype has been implemented and tested in field experiments. ...

What are we measuring anyway?

-A literature survey of questionnaires used in studies reported in the intelligent virtual agent conferences

Conference paper (2019) - Siska Fitrianie, Merijn Bruijnes, Deborah Richards, A. Abdulrahman, Willem Paul Brinkman

Research into artificial social agents aims at constructing these agents and at establishing an empirically grounded understanding of them, their interaction with humans, and howthey can ultimately deliver certain outcomes in areas such as health, entertainment, and education. Key for establishing such understanding is the community's ability to describe and replicate their observations on how users perceive and interact with their agents. In this paper, we address this ability by examining questionnaires and their constructs used in empirical studies reported in the intelligent virtual agent conference proceedings from 2013 to 2018. The literature survey shows the identification of 189 constructs used in 89 questionnaires thatwere reported across 81 papers.We found unexpectedly little repeated use of questionnaires as the vast majority of questionnaires (more than 76%) were only reported in a single paper. We expect that this finding will motivate joint effort by the IVA community towards creating a unified measurement instrument. ...

What are we measuring anyway? A literature survey of questionnaires used in studies reported in the intelligent virtual agent conferences

Conference paper (2019) - Merijn Bruijnes, Siska Fitrianie, Deborah Richards, A. Abdulrahman, Willem-Paul Brinkman

Research into artificial social agents aims at constructing these agents and at establishing an empirically grounded understanding of them, their interaction with humans, and how they can ultimately deliver certain outcomes in areas such as health, entertainment, and education. Key for establishing such understanding is the community’s ability to describe and replicate their observations on how users perceive and interact with their agents. In this paper, we address this ability by examining questionnaires and their constructs used in empirical studies reported in the intelligent virtual agent conference proceedings from 2013 to 2018. The literature survey shows the identification of 189 constructs used in 89 questionnaires that were reported across 81 papers. We found unexpectedly little repeated use of questionnaires as the vast majority of questionnaires (more than 76%) were only reported in a single paper. We expect that this finding will motivate joint effort by the IVA community towards creating a unified measurement instrument and in the broader AI community a renewed interest in replicability of our (user) studies. ...

The Multimodal Dataset of Negative Affect and Aggression

A Validation Study

Conference paper (2018) - Iulia Lefter, Siska Fitrianie

Within the affective computing and social signal processing communities, increasing efforts are being made in order to collect data with genuine (emotional) content. When it comes to negative emotions and even aggression, ethical and privacy related issues prevent the usage of many emotion elicitation methods, and most often actors are employed to act out different scenarios. Moreover, for most databases, emotional arousal is not explicitly checked, and the footage is annotated by external raters based on observable behavior. In the attempt to gather data a step closer to real-life, previous work proposed an elicitation method for collecting the database of negative affect and aggression that involved unscripted role-plays between aggression regulation training actors (actors) and naive participants (students), where only short role descriptions and goals are given to the participants. In this paper we present a validation study for the database of negative affect and aggression by investigating whether the actors' behavior (e.g. becoming more aggressive) had a real impact on the students' emotional arousal. We found significant changes in the students' heart rate variability (HRV) parameters corresponding to changes in aggression level and emotional states of the actors, and therefore conclude that this method can be considered as a good candidate for emotion elicitation. ...

Mobile Phone-Delivered Cognitive Behavioral Therapy for Insomnia

A Randomized Waitlist Controlled Trial

Journal article (2017) - Corine Horsch, Jaap Lancee, Fiemke Griffioen-Both, Sandor Spruit, Siska Fitrianie, Mark A. Neerincx, Robbert Jan Beun, Willem-Paul Brinkman

Background: This study is one of the first randomized controlled trials investigating cognitive behavioral therapy for insomnia (CBT-I) delivered by a fully automated mobile phone app. Such an app can potentially increase the accessibility of insomnia treatment for the 10% of people who have insomnia. Objective: The objective of our study was to investigate the efficacy of CBT-I delivered via the Sleepcare mobile phone app, compared with a waitlist control group, in a randomized controlled trial. Methods: We recruited participants in the Netherlands with relatively mild insomnia disorder. After answering an online pretest questionnaire, they were randomly assigned to the app (n=74) or the waitlist condition (n=77). The app packaged a sleep diary, a relaxation exercise, sleep restriction exercise, and sleep hygiene and education. The app was fully automated and adjusted itself to a participant’s progress. Program duration was 6 to 7 weeks, after which participants received posttest measurements and a 3-month follow-up. The participants in the waitlist condition received the app after they completed the posttest questionnaire. The measurements consisted of questionnaires and 7-day online diaries. The questionnaires measured insomnia severity, dysfunctional beliefs about sleep, and anxiety and depression symptoms. The diary measured sleep variables such as sleep efficiency. We performed multilevel analyses to study the interaction effects between time and condition. Results: The results showed significant interaction effects (P<.01) favoring the app condition on the primary outcome measures of insomnia severity (d=–0.66) and sleep efficiency (d=0.71). Overall, these improvements were also retained in a 3-month follow-up. Conclusions: This study demonstrated the efficacy of a fully automated mobile phone app in the treatment of relatively mild insomnia. The effects were in the range of what is found for Web-based treatment in general. This supports the applicability of such technical tools in the treatment of insomnia. Future work should examine the generalizability to a more diverse population. Furthermore, the separate components of such an app should be investigated. It remains to be seen how this app can best be integrated into the current health regimens ...

Background: This study is one of the first randomized controlled trials investigating cognitive behavioral therapy for insomnia (CBT-I) delivered by a fully automated mobile phone app. Such an app can potentially increase the accessibility of insomnia treatment for the 10% of people who have insomnia. Objective: The objective of our study was to investigate the efficacy of CBT-I delivered via the Sleepcare mobile phone app, compared with a waitlist control group, in a randomized controlled trial. Methods: We recruited participants in the Netherlands with relatively mild insomnia disorder. After answering an online pretest questionnaire, they were randomly assigned to the app (n=74) or the waitlist condition (n=77). The app packaged a sleep diary, a relaxation exercise, sleep restriction exercise, and sleep hygiene and education. The app was fully automated and adjusted itself to a participant’s progress. Program duration was 6 to 7 weeks, after which participants received posttest measurements and a 3-month follow-up. The participants in the waitlist condition received the app after they completed the posttest questionnaire. The measurements consisted of questionnaires and 7-day online diaries. The questionnaires measured insomnia severity, dysfunctional beliefs about sleep, and anxiety and depression symptoms. The diary measured sleep variables such as sleep efficiency. We performed multilevel analyses to study the interaction effects between time and condition. Results: The results showed significant interaction effects (P<.01) favoring the app condition on the primary outcome measures of insomnia severity (d=–0.66) and sleep efficiency (d=0.71). Overall, these improvements were also retained in a 3-month follow-up. Conclusions: This study demonstrated the efficacy of a fully automated mobile phone app in the treatment of relatively mild insomnia. The effects were in the range of what is found for Web-based treatment in general. This supports the applicability of such technical tools in the treatment of insomnia. Future work should examine the generalizability to a more diverse population. Furthermore, the separate components of such an app should be investigated. It remains to be seen how this app can best be integrated into the current health regimens

Talk and Tools

The best of both worlds in mobile user interfaces for E-coaching

Journal article (2017) - Robbert Jan Beun, Siska Fitrianie, Fiemke Griffioen-Both, Sandor Spruit, Corine Horsch, Jaap Lancee, Willem-Paul Brinkman

In this paper, a user interface paradigm, called Talk-and-Tools, is presented for automated e-coaching. The paradigm is based on the idea that people interact in two ways with their environment: symbolically and physically. The main goal is to show how the paradigm can be applied in the design of interactive systems that offer an acceptable coaching process. As a proof of concept, an e-coaching system is implemented that supports an insomnia therapy on a smartphone. A human coach was replaced by a cooperative virtual coach that is able to interact with a human coachee. In the interface of the system, we distinguish between a set of personalized conversations (“Talk”) and specialized modules that form a coherent structure of input and output facilities (“Tools”). Conversations contained a minimum of variation to exclude unpredictable behavior but included the necessary mechanisms for variation to offer personalized consults and support. A variety of system and user tests was conducted to validate the use of the system. After a 6-week therapy, some users spontaneously reported the experience of building a relationship with the e-coach. It is concluded that the addition of a conversational component fills an important gap in the design of current mobile systems. ...

Improving Adherence in Automated e-Coaching

A Case from Insomnia Therapy

Conference paper (2016) - RJ Beun, Willem-Paul Brinkman, Siska Fitrianie, Fiemke Griffioen-Both, Corine Horsch, Jaap Lancee, Sandor Spruit

Non-adherence is considered a problem that seriously undermines the outcome of behavior change therapies, in particular of self-help therapies delivered without human interference. This paper presents the design rationale behind a computer system in the domain of adherence enhancing strategies in automated e-coaching. A variety of persuasive strategies is introduced and implemented in a mobile e-coaching system in the domain of insomnia therapy. The system integrates two types of interface elements, i.e. dedicated tools and natural language conversation, to simplify therapy related activities and to include social strategies to improve motivation. We focus on the crucial role of communication and adaptation. ...

Towards community-based co-creation

Conference paper (2013) - A Huldtgren, CA Detweiler, HEB AL-Ers, S Fitrianie, NA Guldemond