W.P. Brinkman | TU Delft Repository

Training Child Helpline Counselors with Value-Integrated Chat Simulations

Journal article (2026) - M. Al Owayyed, W.P. Brinkman, Kathleen Guan, Loes Keijsers, M.L. Tielman

Children’s helplines train new counselors to adapt to children’s needs and values. This training typically involves roleplay, which can be resource-intensive. Interactive agents offer a promising alternative; yet, simulation-based training systems rarely model how personal values influence decision-making. We present a value-integrated belief–desire–intention (BDI) model that simulates virtual children whose behavior is guided by underlying values. The trainees’ task is to apply motivational interviewing to recognize and align with the child’s values. We conducted a between-subjects experiment (N = 193) comparing three conditions: a base BDI virtual child, a BDI virtual child with integrated values, and one with both integrated values and explanatory feedback on value-based reasoning. Results showed credible support that integrating values improves participants’ opportunities to align with a virtual child and enhances their situational awareness based on a child’s values. We also found some support that feedback improved value recognition and perceived usefulness. Additionally, integrating values improved believability and overall experience. These findings suggest that the proposed values-based model enables more targeted training, which we anticipate will better prepare counselors for value-sensitive conversations. ...

Designing and Evaluating Digital Mental Health Interventions

Scoping Review

Review (2026) - Sarah Zainab Mbawa, Roelof Anne Jelle de Vries, Luciano Cavalcante Siebert, Koen van Turnhout, Willem-Paul Brinkman

Background: The ongoing adoption and use of digital interventions offer promising opportunities to meet the growing demand for mental health support. The effectiveness, implementation, and usage of these interventions depend on how well they are designed and evaluated. However, given the emerging nature of design research in this area, there is still no clear consensus on the specific principles and guidelines for developing digital mental health interventions (DMHIs). There seems to be a lack of clarity regarding the best practices for designing and evaluating these tools. Objective: We aimed to investigate and report on the design principles and evaluation approaches used in digital interventions specific to mental health care. Additionally, we sought to outline how these principles and approaches are applied in research. Methods: This scoping review was conducted in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for scoping reviews. The literature search was performed in 2 electronic databases, SCOPUS and Web of Science, across 3 iterations from January 2024 to January 2025. A total of 2 independent reviewers screened and selected papers based on predefined inclusion and exclusion criteria, followed by data extraction from the selected studies. The data were then synthesized by categorizing the papers according to the primary research aim of each study. The inclusion criteria covered studies involving populations with mental health challenges or users of DMHIs, any digital tools for mental health care, and principles or strategies related to the design, evaluation, or implementation of DMHIs. Results: Our search identified 401 papers, of which 17 met the inclusion criteria for this review. Among these, 11 focused on evaluation studies, while 6 covered both design and evaluation studies (mixed). An iterative user-centered development process, expert inclusion, usability testing, specification of design elements, and user tracking and feedback were identified as common design principles used in studies focused on DMHIs. Evaluation approaches were shaped by the evaluation goal, which influenced the chosen methodologies. We also summarize the recommendations for implementation highlighted in some studies. Based on our findings, we propose 8 guidelines emphasizing stakeholder involvement in the development process and the need for clear justifications for design decisions, among other considerations. Conclusions: Design principles used in DMHI development include user-centered development, expert inclusion, and usability testing, while evaluation approaches often rely on randomized controlled trials to assess efficacy. Qualitative and mixed-method approaches are commonly adopted by studies to capture user experience and bridge both process and outcome measures. We recommend that future research explicitly report its design justification and adopt a multiperspective approach in the research and design of DMHIs. ...

Background: The ongoing adoption and use of digital interventions offer promising opportunities to meet the growing demand for mental health support. The effectiveness, implementation, and usage of these interventions depend on how well they are designed and evaluated. However, given the emerging nature of design research in this area, there is still no clear consensus on the specific principles and guidelines for developing digital mental health interventions (DMHIs). There seems to be a lack of clarity regarding the best practices for designing and evaluating these tools. Objective: We aimed to investigate and report on the design principles and evaluation approaches used in digital interventions specific to mental health care. Additionally, we sought to outline how these principles and approaches are applied in research. Methods: This scoping review was conducted in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for scoping reviews. The literature search was performed in 2 electronic databases, SCOPUS and Web of Science, across 3 iterations from January 2024 to January 2025. A total of 2 independent reviewers screened and selected papers based on predefined inclusion and exclusion criteria, followed by data extraction from the selected studies. The data were then synthesized by categorizing the papers according to the primary research aim of each study. The inclusion criteria covered studies involving populations with mental health challenges or users of DMHIs, any digital tools for mental health care, and principles or strategies related to the design, evaluation, or implementation of DMHIs. Results: Our search identified 401 papers, of which 17 met the inclusion criteria for this review. Among these, 11 focused on evaluation studies, while 6 covered both design and evaluation studies (mixed). An iterative user-centered development process, expert inclusion, usability testing, specification of design elements, and user tracking and feedback were identified as common design principles used in studies focused on DMHIs. Evaluation approaches were shaped by the evaluation goal, which influenced the chosen methodologies. We also summarize the recommendations for implementation highlighted in some studies. Based on our findings, we propose 8 guidelines emphasizing stakeholder involvement in the development process and the need for clear justifications for design decisions, among other considerations. Conclusions: Design principles used in DMHI development include user-centered development, expert inclusion, and usability testing, while evaluation approaches often rely on randomized controlled trials to assess efficacy. Qualitative and mixed-method approaches are commonly adopted by studies to capture user experience and bridge both process and outcome measures. We recommend that future research explicitly report its design justification and adopt a multiperspective approach in the research and design of DMHIs.

Establishing Reference Points for Artificial Social Agent Evaluation: The ASAQ Representative Set 2025

Journal article (2026) - S. Fitrianie, A. Abdulrahman, Merijn Bruijnes, W.P. Brinkman

)), and intelligence (e.g., the Stanford-Binet Intelligence Scales (Roid and Pomplun, 2012) and a normative dataset (Stevens and Bernier, 2021)). But, also closer at home, when it comes to evaluation of software, System Usability Scale (SUS) (Brooke, 1996) also comes with a representative data set (Lewis and Sauro, 2018).Creating benchmark set to go along with ASA Questionnaire (ASAQ) (Fitrianie et al., 2025b,c) allows us to benchmark peoples experience with an ASA. This is measured on 24 constructs and dimensions covering an extensive part of our community shared interests, such as believability, likeability, and sociability of ASA. ASAQ has been published alongside the norm set "ASAQ representative set 2024", which includes the experience of 1066 individuals with 29 agents. That set is based on a third person perspective, i.e., filling out a questionnaire after seeing a video of someone else interacting with an agent. Although pragmatic for validating the questionnaire, the ASAQ authors also acknowledge possible limitations of this set on generalization towards experiences based on actual interaction (Fitrianie et al., 2025b).A key question when developing a benchmark is what should constitute as a benchmark? Which people should be included in the sample, and which agents? For ASAQ representative set 2024, the research platform Prolific was used, which allows data collection across the world. When using this platform to develop our benchmarking set, we need to know which agents are publicly available that have a global reach and have a sizeable user group. Therefore, our first step in building the benchmark set was to survey contemporary ASA usage. We recruited participants for this study through the crowd-sourcing platform, Prolific, between November 30 and December 19, 2023. For this, we applied the following inclusion criteria, where eligible participants were those who: (1) had not taken part in prior ASAQ validation studies, (2) had a Prolific approval rate above 95%, and (3) were proficient in English. Recruitment spanned multiple time zones, with a staggered approach in six-hour intervals to elicit global participant distribution. The study consisted of two sequential phases: (1) screening the population for familiarity with contemporary ASAs, and (2) establishing the ASAQ Representative Set 2025. For this study, we received approval from the university human research ethics Committee (no. 2685, dated 13 January 2023), preregistered the study (Fitrianie et al., 2023), and made the analysis script and data publicly available (Fitrianie et al., 2025a). We compensated participants according to Prolific's payment guidelines.To develop a benchmark set based on individuals' interaction experiences with widely known agents, we started by creating an initial agent list using input from the OSF working group on Artificial Social Agent Evaluation Instrumentfoot_0 . Twelve workgroup members from all over the world brainstormed on popular and widely used ASAs, selecting agents that various people, e.g., age groups and locations, might have interacted with at home. This resulted in a pre-selection of 11 agents, namely: Amazon's Alexa, Google's Bard chatbot, Microsoft's Bing chatbot, OpenAI's ChatGPT, Microsoft's CoPilot, Android's Google Assistant, IKEA's customer service chatbot, Replika chatbot, Apple's Siri, iRobot's Roomba vacuum cleaner, and Microsoft's Xiaoice. To further diversify the agent group, we included a dog, asking some participants to complete the questionnaire based on their interactions with a dog. Furthermore, with an eye on the future, we also incorporated an online version of the classic Eliza chatbot (Weizenbaum, 1966), making it possible to expose people in the future to the same agent. Finally, we included a non-existent agent, "Xonderfloip," as a distractor check, resulting in a list of 14 agents. Participants were asked to indicate the timing of their last interaction with the agents, with options ranging from "today" to "never."Of the 1,296 individuals initially recruited, 1,253 participants responded "never" to interactions with the distractor agent, meeting the criteria for inclusion in the subsequent phase of the study.Allowing people to compare their agent with agents in the benchmark set, we aimed for a statistical power of 0.80 to detect at least a medium-sized effect in future independent t-tests with an alpha level of 0.05 (Cohen, 1992). Consequently, the benchmark set required a minimum of 64 samples per agent. To ensure participants had interacted with the agents recently, we only used agents used within the last six months, narrowing the agent group from 14 to 10. Including the Eliza chatbot and the dog, we selected the agents: Alexa, Bard, Bing, ChatGPT, CoPilot, Google Assistant, Roomba, and Siri. Participants were assigned to evaluate a single agent they were familiar with, or to interact with the Eliza chatbot for five minutes before assessment to establish their own interaction experience with this agent. Exclusion criteria in this phase were: (1) failing more than 20% of attention checks; (2) providing incoherent responses to open-ended questions (e.g., unintelligible or nonsensical answers, or indicating no interaction with the assigned ASA); and (3) completing fewer than 10 dialogue turns for those assigned to the Eliza chatbot. Each participant was allowed to participate only once, with only their first completion included in the analysis.Out of 1,253 available participants, we invited 777 individuals until we ended up with 666 participants who met the inclusion criteria (per agent: M = 66, SD = 1, range = [64 .. 68]). Among the exclusions, 47 participants did not complete the study with their assigned ASA, five failed attention checks (providing [3 .. 7] incorrect answers out of 10), and one was removed due to an open-ended response indicating no interaction with the assigned agent. Additionally, 58 participants assigned to the Eliza chatbot were excluded for completing fewer than 10 dialogue turns. Additionally, we requested participants to describe their experiences with the ASA to which they were assigned, in their own words, aiming for future research.The resulting dataset included participants from the two phases: Phase 1 (n = 1,253) and a subset of these participants in Phase 2 (n = 666). The majority of participants identified as male (Phase 1: 54.5%; Phase 2: 57.8%), followed by female (Phase 1: 44.9%; Phase 2: 41.9%), with a small proportion identifying as other (Phase 1: 0.6%; Phase 2: 0.3%). The mean age was similar across both Phases (Phase 1: M = 30, SD = 9.2;Phase 2: M = 29.8, SD = 9.2), with the largest age groups being 18-25 (Phase 1: 38.9%; Phase 2: 39.8%) and 26-35 (Phase 1 and Phase 2: 39.6%). Education levels were comparable between groups, with the highest proportions holding an undergraduate degree (Phase 1 and Phase 2: 41.4%) or a graduate degree (Phase 1: 25%; Phase 2: 23.9%). Socioeconomic status, assessed via the MacArthur Scale (Adler et al., 2000) (1 = lowest, 10 = highest), was distributed across the scale, with the largest proportions in the middle ranges (e.g., at level 6, Phase 1: 25.9% at level 6; Phase 2: 28.4%). Geographically (based on the United Nations Regional Groups (United Nations, 2024)), most participants resided in Western Europe (Phase 1: 46.8%; Phase 2: 42.9%), followed by Africa (Phase 1: 21.1%; Phase 2: 22.8%) and Eastern Europe (Phase 1: 18.2%; Phase 2: 20.1%). Smaller proportions were from Latin America and the Caribbean (Phase 1: 11.7%; Phase 2: 12.2%), with limited presentation from the United States (Phase 1: 1.2%; Phase 2: 0.6%), and other regions. Users of this dataset might select sub-datasets based on these characteristics to study specific groups. Table 1 provides an overview of participant interactions with 12 ASAs and a dog. ChatGPT emerged as the most widely used agent, with 89.47% of 1,253 participants reporting interactions. Google Assistant (85.08%) and Siri (71.51%) also demonstrated high usage rates. In contrast, less commonly used agents included Replika (10.45%), Xiaoice (7.98%), and Eliza (2.23%).Among the ASAs, ChatGPT and Google Assistant exhibited the highest proportions of recent interactions (today and this week), reflecting their integration into daily life. For instance, 295 participants interacted with ChatGPT today, and 362 this week. As anticipated, agents such as Eliza showed minimal recent interactions, with the majority of participants reporting never having engaged with them (1,225).The study generated a representative set of nine ASAs and a dog, collecting 666 unique participant ratings on the 90 first-person perspective items of the ASAQ. Sample sizes per agent ranged from 64 to 68.Analysis of the ASAQ long version revealed variability in the ASAQ scores across agents, ranging from -30 (Eliza) to +30 (the dog). The data set, showing a detailed presentation of the scores of the ASAs on each of the 24 constructs and dimensions of the ASAQ, can be accessed publicly online (Fitrianie et al., 2025a).The ASAQ constructs and overall item content remained consistent with the ASAQ representative set 2024; the only difference is the participants' point of view, with the 2024 set collected from a third-person perspective (watching a video of a human-ASA interaction) and the 2025-set from a first-person perspective (interacting directly with an ASA). Items reflect the relevant perspective (e.g., "The user can rely on [the agent]" vs. "I can rely on [the agent]"). The ASAQ construct and dimension scores, derived from both the long and short versions of the ASAQ, for all agents in the Representative Set 2025 are provided in the Supplemental Data accompanying this article (see Supplementary Material, Table S1-S4).foot_2 The ASAQ Representative set 2025 extends the previously established ASAQ representative set 2024, offering an enhanced resource for researchers. The dataset highlights the varying interaction experiences people have in direct interaction with well-known agents. The reported use of contemporary ASAs (e.g., ChatGPT, Google Assistant, and Siri) demonstrates how rapidly conversational agents have become embedded in daily life. The inclusion of a non-artificial social agent (a dog) adds depth to the dataset, allowing for comparisons to other social experiences. Additionally, the variability in ASAQ scores, ranging from -30 for Eliza to +30 for dogs, provides anchor points for researchers to compare their own ASA against when using the ASAQ. Furthermore, the dataset allows for the ranking of results across each ASAQ construct or dimension relative to the agents included in the ASAQ Representative Set. To facilitate analysis, researchers can utilise ASAQ charts, which offer a clear, at-a-glance visualisation of their ASA's scores across all 24 constructs/dimensions, enabling direct comparisons with the representative ASAs. This resource promotes robust and standardised reporting in studies focused on human-agent interactions, which advances methodological consistency in the field. With the here presented dataset, it is possible to create similar guidelines for the first person perspective use of the ASAQ.Two limitations about this dataset should be noted. First, apart from Eliza, participants evaluated ASAs based on their most recent interaction, which relies on recall and may introduce bias due to differences in time since use, ASA version, and interaction context. Second, participants were recruited through Prolific Table 1. Summary of participants' usage of the 13 ASAs participated between November 30 and December 13, 2023 (n = 1253). The reported % of total any-use reported for each ASA, and when this use last occurred. We present the ASAQ score only for the ASAs we measured (n=666).Phase ...

)), and intelligence (e.g., the Stanford-Binet Intelligence Scales (Roid and Pomplun, 2012) and a normative dataset (Stevens and Bernier, 2021)). But, also closer at home, when it comes to evaluation of software, System Usability Scale (SUS) (Brooke, 1996) also comes with a representative data set (Lewis and Sauro, 2018).Creating benchmark set to go along with ASA Questionnaire (ASAQ) (Fitrianie et al., 2025b,c) allows us to benchmark peoples experience with an ASA. This is measured on 24 constructs and dimensions covering an extensive part of our community shared interests, such as believability, likeability, and sociability of ASA. ASAQ has been published alongside the norm set "ASAQ representative set 2024", which includes the experience of 1066 individuals with 29 agents. That set is based on a third person perspective, i.e., filling out a questionnaire after seeing a video of someone else interacting with an agent. Although pragmatic for validating the questionnaire, the ASAQ authors also acknowledge possible limitations of this set on generalization towards experiences based on actual interaction (Fitrianie et al., 2025b).A key question when developing a benchmark is what should constitute as a benchmark? Which people should be included in the sample, and which agents? For ASAQ representative set 2024, the research platform Prolific was used, which allows data collection across the world. When using this platform to develop our benchmarking set, we need to know which agents are publicly available that have a global reach and have a sizeable user group. Therefore, our first step in building the benchmark set was to survey contemporary ASA usage. We recruited participants for this study through the crowd-sourcing platform, Prolific, between November 30 and December 19, 2023. For this, we applied the following inclusion criteria, where eligible participants were those who: (1) had not taken part in prior ASAQ validation studies, (2) had a Prolific approval rate above 95%, and (3) were proficient in English. Recruitment spanned multiple time zones, with a staggered approach in six-hour intervals to elicit global participant distribution. The study consisted of two sequential phases: (1) screening the population for familiarity with contemporary ASAs, and (2) establishing the ASAQ Representative Set 2025. For this study, we received approval from the university human research ethics Committee (no. 2685, dated 13 January 2023), preregistered the study (Fitrianie et al., 2023), and made the analysis script and data publicly available (Fitrianie et al., 2025a). We compensated participants according to Prolific's payment guidelines.To develop a benchmark set based on individuals' interaction experiences with widely known agents, we started by creating an initial agent list using input from the OSF working group on Artificial Social Agent Evaluation Instrumentfoot_0 . Twelve workgroup members from all over the world brainstormed on popular and widely used ASAs, selecting agents that various people, e.g., age groups and locations, might have interacted with at home. This resulted in a pre-selection of 11 agents, namely: Amazon's Alexa, Google's Bard chatbot, Microsoft's Bing chatbot, OpenAI's ChatGPT, Microsoft's CoPilot, Android's Google Assistant, IKEA's customer service chatbot, Replika chatbot, Apple's Siri, iRobot's Roomba vacuum cleaner, and Microsoft's Xiaoice. To further diversify the agent group, we included a dog, asking some participants to complete the questionnaire based on their interactions with a dog. Furthermore, with an eye on the future, we also incorporated an online version of the classic Eliza chatbot (Weizenbaum, 1966), making it possible to expose people in the future to the same agent. Finally, we included a non-existent agent, "Xonderfloip," as a distractor check, resulting in a list of 14 agents. Participants were asked to indicate the timing of their last interaction with the agents, with options ranging from "today" to "never."Of the 1,296 individuals initially recruited, 1,253 participants responded "never" to interactions with the distractor agent, meeting the criteria for inclusion in the subsequent phase of the study.Allowing people to compare their agent with agents in the benchmark set, we aimed for a statistical power of 0.80 to detect at least a medium-sized effect in future independent t-tests with an alpha level of 0.05 (Cohen, 1992). Consequently, the benchmark set required a minimum of 64 samples per agent. To ensure participants had interacted with the agents recently, we only used agents used within the last six months, narrowing the agent group from 14 to 10. Including the Eliza chatbot and the dog, we selected the agents: Alexa, Bard, Bing, ChatGPT, CoPilot, Google Assistant, Roomba, and Siri. Participants were assigned to evaluate a single agent they were familiar with, or to interact with the Eliza chatbot for five minutes before assessment to establish their own interaction experience with this agent. Exclusion criteria in this phase were: (1) failing more than 20% of attention checks; (2) providing incoherent responses to open-ended questions (e.g., unintelligible or nonsensical answers, or indicating no interaction with the assigned ASA); and (3) completing fewer than 10 dialogue turns for those assigned to the Eliza chatbot. Each participant was allowed to participate only once, with only their first completion included in the analysis.Out of 1,253 available participants, we invited 777 individuals until we ended up with 666 participants who met the inclusion criteria (per agent: M = 66, SD = 1, range = [64 .. 68]). Among the exclusions, 47 participants did not complete the study with their assigned ASA, five failed attention checks (providing [3 .. 7] incorrect answers out of 10), and one was removed due to an open-ended response indicating no interaction with the assigned agent. Additionally, 58 participants assigned to the Eliza chatbot were excluded for completing fewer than 10 dialogue turns. Additionally, we requested participants to describe their experiences with the ASA to which they were assigned, in their own words, aiming for future research.The resulting dataset included participants from the two phases: Phase 1 (n = 1,253) and a subset of these participants in Phase 2 (n = 666). The majority of participants identified as male (Phase 1: 54.5%; Phase 2: 57.8%), followed by female (Phase 1: 44.9%; Phase 2: 41.9%), with a small proportion identifying as other (Phase 1: 0.6%; Phase 2: 0.3%). The mean age was similar across both Phases (Phase 1: M = 30, SD = 9.2;Phase 2: M = 29.8, SD = 9.2), with the largest age groups being 18-25 (Phase 1: 38.9%; Phase 2: 39.8%) and 26-35 (Phase 1 and Phase 2: 39.6%). Education levels were comparable between groups, with the highest proportions holding an undergraduate degree (Phase 1 and Phase 2: 41.4%) or a graduate degree (Phase 1: 25%; Phase 2: 23.9%). Socioeconomic status, assessed via the MacArthur Scale (Adler et al., 2000) (1 = lowest, 10 = highest), was distributed across the scale, with the largest proportions in the middle ranges (e.g., at level 6, Phase 1: 25.9% at level 6; Phase 2: 28.4%). Geographically (based on the United Nations Regional Groups (United Nations, 2024)), most participants resided in Western Europe (Phase 1: 46.8%; Phase 2: 42.9%), followed by Africa (Phase 1: 21.1%; Phase 2: 22.8%) and Eastern Europe (Phase 1: 18.2%; Phase 2: 20.1%). Smaller proportions were from Latin America and the Caribbean (Phase 1: 11.7%; Phase 2: 12.2%), with limited presentation from the United States (Phase 1: 1.2%; Phase 2: 0.6%), and other regions. Users of this dataset might select sub-datasets based on these characteristics to study specific groups. Table 1 provides an overview of participant interactions with 12 ASAs and a dog. ChatGPT emerged as the most widely used agent, with 89.47% of 1,253 participants reporting interactions. Google Assistant (85.08%) and Siri (71.51%) also demonstrated high usage rates. In contrast, less commonly used agents included Replika (10.45%), Xiaoice (7.98%), and Eliza (2.23%).Among the ASAs, ChatGPT and Google Assistant exhibited the highest proportions of recent interactions (today and this week), reflecting their integration into daily life. For instance, 295 participants interacted with ChatGPT today, and 362 this week. As anticipated, agents such as Eliza showed minimal recent interactions, with the majority of participants reporting never having engaged with them (1,225).The study generated a representative set of nine ASAs and a dog, collecting 666 unique participant ratings on the 90 first-person perspective items of the ASAQ. Sample sizes per agent ranged from 64 to 68.Analysis of the ASAQ long version revealed variability in the ASAQ scores across agents, ranging from -30 (Eliza) to +30 (the dog). The data set, showing a detailed presentation of the scores of the ASAs on each of the 24 constructs and dimensions of the ASAQ, can be accessed publicly online (Fitrianie et al., 2025a).The ASAQ constructs and overall item content remained consistent with the ASAQ representative set 2024; the only difference is the participants' point of view, with the 2024 set collected from a third-person perspective (watching a video of a human-ASA interaction) and the 2025-set from a first-person perspective (interacting directly with an ASA). Items reflect the relevant perspective (e.g., "The user can rely on [the agent]" vs. "I can rely on [the agent]"). The ASAQ construct and dimension scores, derived from both the long and short versions of the ASAQ, for all agents in the Representative Set 2025 are provided in the Supplemental Data accompanying this article (see Supplementary Material, Table S1-S4).foot_2 The ASAQ Representative set 2025 extends the previously established ASAQ representative set 2024, offering an enhanced resource for researchers. The dataset highlights the varying interaction experiences people have in direct interaction with well-known agents. The reported use of contemporary ASAs (e.g., ChatGPT, Google Assistant, and Siri) demonstrates how rapidly conversational agents have become embedded in daily life. The inclusion of a non-artificial social agent (a dog) adds depth to the dataset, allowing for comparisons to other social experiences. Additionally, the variability in ASAQ scores, ranging from -30 for Eliza to +30 for dogs, provides anchor points for researchers to compare their own ASA against when using the ASAQ. Furthermore, the dataset allows for the ranking of results across each ASAQ construct or dimension relative to the agents included in the ASAQ Representative Set. To facilitate analysis, researchers can utilise ASAQ charts, which offer a clear, at-a-glance visualisation of their ASA's scores across all 24 constructs/dimensions, enabling direct comparisons with the representative ASAs. This resource promotes robust and standardised reporting in studies focused on human-agent interactions, which advances methodological consistency in the field. With the here presented dataset, it is possible to create similar guidelines for the first person perspective use of the ASAQ.Two limitations about this dataset should be noted. First, apart from Eliza, participants evaluated ASAs based on their most recent interaction, which relies on recall and may introduce bias due to differences in time since use, ASA version, and interaction context. Second, participants were recruited through Prolific Table 1. Summary of participants' usage of the 13 ASAs participated between November 30 and December 13, 2023 (n = 1253). The reported % of total any-use reported for each ASA, and when this use last occurred. We present the ASAQ score only for the ASAs we measured (n=666).Phase

Supporting adolescents’ mHealth needs

Qualitative and quantitative insights from a user survey of a mental health promoting app

Journal article (2026) - Esra Cemre Su de Groot, Lianne P. de Vries, Ujwal Gadiraju, Olya Kudina, Loes Keijsers, Manon H.J. Hillegers, Willem Paul Brinkman

While mental health apps can help to promote adolescents’ mental health, prevent mental health problems, and reduce symptoms, maintaining sufficient user engagement with these apps remains challenging. This is often caused by a mismatch between the needs and preferences of adolescents and what the apps offer. Therefore, we need a better understanding of (i) adolescents’ needs and preferences and (ii) potential differences based on user characteristics. To this end, we qualitatively and quantitatively analyzed a dataset describing the user experience of 1312 Dutch adolescents (12–25 years) from the general population after they interacted for several weeks with a gamified mHealth app (the Grow It! app) that aims to promote momentary emotional awareness, reflection, and adaptive coping. A total of 4833 free-text survey responses spanning five user experience survey questions were analyzed using an inductive and iterative coding process, while accounting for intercoder reliability. We used (i) a thematic analysis to identify adolescents’ needs and preferences related to the app, and (ii) an exploratory quantitative analysis of the subthemes to investigate potential differences in which needs and preferences were mentioned by adolescents based on demographics. Through our thematic analysis, we identified three overarching themes related to the app’s design: usability , psychological impact , and meaningful interactive features . Furthermore, we identified two overarching themes that related to the adolescents’ motivation to use the app: intrinsic (de)motivators , and social–environmental factors impacting usage . Each of these themes consisted of four subthemes. Our exploratory statistical analysis shed light on several differences in how frequently these subthemes were mentioned based on age, sex, and educational level. By synthesizing our insights, we identify five design implications that can help tailor future mHealth apps to adolescents’ needs and preferences. These include concrete suggestions to personalize self-monitoring, include actionable insights, align content with personal needs, implement meaningful interactive features (e.g., competitions, gamification, and social communication), and make apps appealing to the entire target group. ...

While mental health apps can help to promote adolescents’ mental health, prevent mental health problems, and reduce symptoms, maintaining sufficient user engagement with these apps remains challenging. This is often caused by a mismatch between the needs and preferences of adolescents and what the apps offer. Therefore, we need a better understanding of (i) adolescents’ needs and preferences and (ii) potential differences based on user characteristics. To this end, we qualitatively and quantitatively analyzed a dataset describing the user experience of 1312 Dutch adolescents (12–25 years) from the general population after they interacted for several weeks with a gamified mHealth app (the Grow It! app) that aims to promote momentary emotional awareness, reflection, and adaptive coping. A total of 4833 free-text survey responses spanning five user experience survey questions were analyzed using an inductive and iterative coding process, while accounting for intercoder reliability. We used (i) a thematic analysis to identify adolescents’ needs and preferences related to the app, and (ii) an exploratory quantitative analysis of the subthemes to investigate potential differences in which needs and preferences were mentioned by adolescents based on demographics. Through our thematic analysis, we identified three overarching themes related to the app’s design: usability , psychological impact , and meaningful interactive features . Furthermore, we identified two overarching themes that related to the adolescents’ motivation to use the app: intrinsic (de)motivators , and social–environmental factors impacting usage . Each of these themes consisted of four subthemes. Our exploratory statistical analysis shed light on several differences in how frequently these subthemes were mentioned based on age, sex, and educational level. By synthesizing our insights, we identify five design implications that can help tailor future mHealth apps to adolescents’ needs and preferences. These include concrete suggestions to personalize self-monitoring, include actionable insights, align content with personal needs, implement meaningful interactive features (e.g., competitions, gamification, and social communication), and make apps appealing to the entire target group.

User Experiences With Digital Future-Self Interventions in the Contexts of Smoking and Physical Inactivity: Mixed Methods Multistudy Exploration

Journal article (2025) - Kristell M. Penfornis, N. Albers, W.P. Brinkman, M.A. Neerincx, Andrea W.M. Evers, Winifred A. Gebhardt, Eline Meijer

Background: Smoking and physical inactivity compromise health, especially in combination. Interventions to promote smoking cessation and increased physical activity (PA) often lack impact, especially in the long term. Digital future-self interventions (FSIs), which prompt individuals to imagine who they do and do not want to become (ie, their desired and undesired future selves), show promise in encouraging sustainable changes in both behaviors. However, knowledge of user experiences with digital FSIs is limited. A deeper understanding of these experiences could help optimize FSIs, enhancing their efficacy in supporting smoking cessation and increased PA sustainably. Objective: This study examined behavioral, cognitive, and affective experiences with digital FSIs focused on smoking, PA, or both. Potential differences in user experiences based on behavior (smoking vs PA), polarity (desired vs undesired future self), and modality (verbal vs visual description of future selves) were explored. Methods: Secondary analyses of quantitative and qualitative survey data from 3 studies using digital FSIs as a means to encourage smoking cessation or increase PA were conducted. In study 1, participants (N=144) thought about how it would be to complete the FSI. In studies 2 (N=447) and 3 (N=87), they completed an FSI. Each study highlighted different aspects of user experiences with FSIs, namely, behavioral (eg, time spent), cognitive (eg, mental effort exerted), or affective (eg, emotions) experiences. Quantitative and qualitative findings were integrated for a comprehensive interpretation. Results: Regarding behavioral experiences, participants completed future-self tasks promptly (mean 6.64, SD 8.30 minutes), spent less time completing the desired- versus undesired-future-self (P<.001; η _p ²=0.227) and verbal versus visual (P=.03; η _p ²=0.060; quantitative) tasks, and integrated the tasks into their lives (qualitative). Despite tasks being preparatory and not actively encouraging behavior change, multiple participants reported implementing changes in their smoking or PA (qualitative). Regarding cognitive experiences, moderate effort (mean 5.85/10, SD 2.56) was exerted on the tasks regardless of behavior (P=.69; η _p ²=0.002), modality (P=.45; η _p ²=0.004), or polarity (P=.69; η _p ²=0.002; quantitative). Experiences of task difficulty were inconsistent across studies, individuals, and tasks, although mental visualization and describing one’s future self using images were consistently reported as challenging (quantitative and qualitative). Future-self tasks were reported to prompt cognitive processes such as contemplating consequences of smoking and PA behavior (qualitative). Regarding affective experiences, desired- and undesired-future-self tasks elicited different emotions (P<.001; η _p ²=0.630; quantitative). Desired-future-self tasks were perceived as enjoyable and happiness inducing, whereas undesired-future-self tasks were perceived as confronting and unpleasant, evoking feelings of sadness, fear, and anger (quantitative and qualitative). Conclusions: Digital FSIs appeared to be a time-efficient, feasible, and acceptable way of strengthening identities as a means to encourage smoking cessation and PA. Findings support continued implementation of digital FSIs, although further research is required to optimize their operationalization. Avenues in that regard are proposed and discussed. ...

Background: Smoking and physical inactivity compromise health, especially in combination. Interventions to promote smoking cessation and increased physical activity (PA) often lack impact, especially in the long term. Digital future-self interventions (FSIs), which prompt individuals to imagine who they do and do not want to become (ie, their desired and undesired future selves), show promise in encouraging sustainable changes in both behaviors. However, knowledge of user experiences with digital FSIs is limited. A deeper understanding of these experiences could help optimize FSIs, enhancing their efficacy in supporting smoking cessation and increased PA sustainably. Objective: This study examined behavioral, cognitive, and affective experiences with digital FSIs focused on smoking, PA, or both. Potential differences in user experiences based on behavior (smoking vs PA), polarity (desired vs undesired future self), and modality (verbal vs visual description of future selves) were explored. Methods: Secondary analyses of quantitative and qualitative survey data from 3 studies using digital FSIs as a means to encourage smoking cessation or increase PA were conducted. In study 1, participants (N=144) thought about how it would be to complete the FSI. In studies 2 (N=447) and 3 (N=87), they completed an FSI. Each study highlighted different aspects of user experiences with FSIs, namely, behavioral (eg, time spent), cognitive (eg, mental effort exerted), or affective (eg, emotions) experiences. Quantitative and qualitative findings were integrated for a comprehensive interpretation. Results: Regarding behavioral experiences, participants completed future-self tasks promptly (mean 6.64, SD 8.30 minutes), spent less time completing the desired- versus undesired-future-self (P<.001; η _p ²=0.227) and verbal versus visual (P=.03; η _p ²=0.060; quantitative) tasks, and integrated the tasks into their lives (qualitative). Despite tasks being preparatory and not actively encouraging behavior change, multiple participants reported implementing changes in their smoking or PA (qualitative). Regarding cognitive experiences, moderate effort (mean 5.85/10, SD 2.56) was exerted on the tasks regardless of behavior (P=.69; η _p ²=0.002), modality (P=.45; η _p ²=0.004), or polarity (P=.69; η _p ²=0.002; quantitative). Experiences of task difficulty were inconsistent across studies, individuals, and tasks, although mental visualization and describing one’s future self using images were consistently reported as challenging (quantitative and qualitative). Future-self tasks were reported to prompt cognitive processes such as contemplating consequences of smoking and PA behavior (qualitative). Regarding affective experiences, desired- and undesired-future-self tasks elicited different emotions (P<.001; η _p ²=0.630; quantitative). Desired-future-self tasks were perceived as enjoyable and happiness inducing, whereas undesired-future-self tasks were perceived as confronting and unpleasant, evoking feelings of sadness, fear, and anger (quantitative and qualitative). Conclusions: Digital FSIs appeared to be a time-efficient, feasible, and acceptable way of strengthening identities as a means to encourage smoking cessation and PA. Findings support continued implementation of digital FSIs, although further research is required to optimize their operationalization. Avenues in that regard are proposed and discussed.

A simulation-based training tool for child helpline counsellors

Abstract (2025) - M. Al Owayyed, M.L. Tielman, W.P. Brinkman

Lilobot: A Cognitive Conversational Agent to Train Counsellors at Children’s Helplines

Design and Initial Evaluation

Journal article (2025) - S.A. Grundmann, M. Al Owayyed, Merijn Bruijnes, Ellen Vroonhof, W.P. Brinkman

To equip new counsellors at a Dutch child helpline with the needed counselling skills, the helpline uses role-playing, a form of learning through simulation in which one counsellor-in-training portrays a child seeking help and the other portrays a counsellor. However, this process is time-intensive and logistically challenging-issues that a conversational agent could help address. In this paper, we propose an initial design for a computer agent that acts as a child help-seeker to be used in a role-play setting. Our agent, Lilobot, is based on a Belief-Desire-Intention (BDI) model to simulate the reasoning process of a child who is being bullied at school. Through interaction with Lilobot, counsellors-in-training can practise the Five Phase Model, a conversation strategy that underpins the helpline’s counselling principle of keeping conversations child-centred. We compared a training session with Lilobot to a text-based training, inviting experienced counsellors from the Dutch child helpline to participate in both sessions. We conducted pre- and post-measurement comparisons for both training sessions. Contrary to our expectations, the results show a decrease in counselling self-efficacy at post-measurement, particularly in Lilobot’s condition. Still, the counsellors’ qualitative feedback indicated that, with further development and refinements, they believed Lilobot could potentially serve as a useful supplementary tool for training new helpline counsellors. Our work also highlights three future research directions for training simulators in this domain: integrating emotions into the model, providing guided feedback to the counsellor, and incorporating Large Language Models (LLMs) into the conversations. ...

Controlled Yet Natural: A Hybrid BDI-LLM Conversational Agent for Child Helpline Training

Conference paper (2025) - M. Al Owayyed, A.A. Denga, W.P. Brinkman

Child helpline training often relies on human-led roleplay, which is both time- and resource-consuming. To address this, rule-based interactive agent simulations have been proposed to provide a structured training experience for new counsellors. However, these agents might suffer from limited language understanding and response variety. To overcome these limitations, we present a hybrid interactive agent that integrates Large Language Models (LLMs) into a rule-based Belief-Desire-Intention (BDI) framework, simulating more realistic virtual child chat conversations. This hybrid solution incorporates LLMs into three components: intent recognition, response generation, and a bypass mechanism. We evaluated the system through two studies: a script-based assessment comparing LLM-generated responses to human-crafted responses, and a within-subject experiment (N = 37) comparing the LLM-integrated agent with a rule-based version. The first study provided evidence that the three LLM components were non-inferior to human-crafted responses. In the second study, we found credible support for two hypotheses: participants perceived the LLM-integrated agent as more believable and reported more positive attitudes toward it than the rule-based agent. Additionally, although weaker, there was some support for increased engagement (posterior probability = 0.845, 95% HDI [-0.149, 0.465]). Our findings demonstrate the potential of integrating LLMs into rule-based systems, offering a promising direction for more flexible but controlled training systems. ...

Reinforcement learning for proposing smoking cessation activities that build competencies

Combining two worldviews in a virtual coach

Journal article (2025) - N. Albers, M.A. Neerincx, W.P. Brinkman

Background
Reaching personal goals typically requires building competencies (e.g., insights into personal strengths), but expert health professionals and non-expert clients often think differently about which competencies are needed. Just having a virtual coach advise activities for "expert-devised" competencies may not motivate clients to carry them out, while advising only "non-expert devised" activities may not result in all required competencies being built.

Methods
We integrated the client and health expert worldviews in our modeling method for informing the activity selection by a virtual coach: We created a pipeline to build a reinforcement learning model for proposing activities in the context of preparing for quitting smoking. This model considers smokers’ current and future levels for expert-devised competencies as well as their beliefs about the usefulness of different competencies when choosing activities. To train the model, we conducted a micro-randomized trial in which 542 smokers interacted with a virtual coach in five sessions spread over at least nine days and received a randomly chosen activity in each session. Using data from this study, we performed simulations to systematically assess the impact of the different model components on the competencies built by smokers. Moreover, we performed paired Bayesian t-tests to determine the effect of persuasive activities on smokers’ usefulness beliefs.

Results
Our simulations show that smokers’ current levels for the expert competencies and their usefulness beliefs are important to consider when building expert competencies. In fact, we saw improvements of up to 22% when considering current competencies, and an additional 13% when also accounting for usefulness beliefs. Furthermore, although we found credible evidence that persuasive activities changed smokers’ usefulness beliefs, the effects might be too small to contribute in an optimal strategy for building competencies.

Conclusion
The worldviews of both health experts and smokers are important to consider when proposing activities for preparing for quitting smoking. We have presented a reinforcement learning model that combines these worldviews, and we hope that our work can be an example of incorporating different worldviews in a reinforcement learning model for building competencies. Our code and dataset are publicly available. ...

Background
Reaching personal goals typically requires building competencies (e.g., insights into personal strengths), but expert health professionals and non-expert clients often think differently about which competencies are needed. Just having a virtual coach advise activities for "expert-devised" competencies may not motivate clients to carry them out, while advising only "non-expert devised" activities may not result in all required competencies being built.

Methods
We integrated the client and health expert worldviews in our modeling method for informing the activity selection by a virtual coach: We created a pipeline to build a reinforcement learning model for proposing activities in the context of preparing for quitting smoking. This model considers smokers’ current and future levels for expert-devised competencies as well as their beliefs about the usefulness of different competencies when choosing activities. To train the model, we conducted a micro-randomized trial in which 542 smokers interacted with a virtual coach in five sessions spread over at least nine days and received a randomly chosen activity in each session. Using data from this study, we performed simulations to systematically assess the impact of the different model components on the competencies built by smokers. Moreover, we performed paired Bayesian t-tests to determine the effect of persuasive activities on smokers’ usefulness beliefs.

Results
Our simulations show that smokers’ current levels for the expert competencies and their usefulness beliefs are important to consider when building expert competencies. In fact, we saw improvements of up to 22% when considering current competencies, and an additional 13% when also accounting for usefulness beliefs. Furthermore, although we found credible evidence that persuasive activities changed smokers’ usefulness beliefs, the effects might be too small to contribute in an optimal strategy for building competencies.

Conclusion
The worldviews of both health experts and smokers are important to consider when proposing activities for preparing for quitting smoking. We have presented a reinforcement learning model that combines these worldviews, and we hope that our work can be an example of incorporating different worldviews in a reinforcement learning model for building competencies. Our code and dataset are publicly available.

Psychological, economic, and ethical factors in human feedback for a chatbot-based smoking cessation intervention

Journal article (2025) - N. Albers, Francisco S. Melo, M.A. Neerincx, O. Kudina, W.P. Brinkman

Integrating human support with chatbot-based behavior change interventions raises three challenges: (1) attuning the support to an individual’s state (e.g., motivation) for enhanced engagement, (2) limiting the use of the concerning human resources for enhanced efficiency, and (3) optimizing outcomes on ethical aspects (e.g., fairness). Therefore, we conducted a study in which 679 smokers and vapers had a 20% chance of receiving human feedback between five chatbot sessions. We find that having received feedback increases retention and effort spent on preparatory activities. However, analyzing a reinforcement learning (RL) model fit on the data shows there are also states where not providing feedback is better. Even this “standard” benefit-maximizing RL model is value-laden. It not only prioritizes people who would benefit most, but also those who are already doing well and want feedback. We show how four other ethical principles can be incorporated to favor other smoker subgroups, yet, interdependencies exist. ...

The Artificial Social Agent Questionnaire (ASAQ) — Development and evaluation of a validated instrument for capturing human interaction experiences with artificial social agents

Journal article (2025) - Siska Fitrianie, Merijn Bruijnes, Amal Abdulrahman, Willem Paul Brinkman

Validating claims and replicating findings on the impact of artificial social agents (ASA), such as virtual agents, conversational agents, and social robots, requires a standardised measurement instrument that researchers can employ in different settings and for various agents. Such an instrument would allow researchers to evaluate their agents and establish insights beyond their specific study context. Therefore, we present the long and short versions of the ASA questionnaire (ASAQ) for evaluating human-ASA interaction on 19 constructs, such as the agent's believability, sociability, and coherence. It has been developed by an international workgroup with more than 100 ASA-researchers over multiple years who identified community-relevant constructs and associated questionnaire items and examined the questionnaire's reliability, validity, and interpretability. The result is a questionnaire that can capture more than 80% of the constructs that studies in the intelligent virtual agent community investigate, with acceptable levels of reliability, content validity, construct validity, and cross-validity. We suggest that ASA-researchers use the ASAQ short version to report their agent's psychographic information and the ASAQ long version to analyse any constructs in-depth that are specifically relevant to their agent or study. Finally, this paper gives instructions for practical use, such as sample size estimations, and how to interpret and present results. ...

German and Dutch Translations of the Artificial-Social-Agent Questionnaire Instrument for Evaluating Human-Agent Interactions

Conference paper (2024) - N. Albers, Andrea Bönsch, Jonathan Ehret, B.A. Khodakov, W.P. Brinkman

Enabling the widespread utilization of the Artificial-Social-Agent (ASA) Questionnaire, a research instrument to comprehensively assess diverse ASA qualities while ensuring comparability, necessitates translations beyond the original English source language questionnaire. We thus present Dutch and German translations of the long and short versions of the ASA Questionnaire and describe the translation challenges we encountered. Summative assessments with 240 English-Dutch and 240 English-German bilingual participants show, on average, excellent correlations (Dutch ICC M = 0.82, SD = 0.07, range [0.58, 0.93]; German ICC M = 0.81, SD = 0.09, range [0.58, 0.94]) with the original long version on the construct and dimension level. Results for the short version show, on average, good correlations (Dutch ICC M = 0.65, SD = 0.12, range [0.39, 0.82]; German ICC M = 0.67, SD = 0.14, range [0.30, 0.91]). We hope these validated translations allow the Dutch and German-speaking populations to evaluate ASAs in their own language. ...

An Encore Abstract: Agent-based Social Skills Training Systems: The ARTES Architecture

Abstract (2024) - M. Al Owayyed, M.L. Tielman, Arno Hartholt, M.M. Specht, W.P. Brinkman

Workshop on Algorithmic Behavior Change Support

Journal article (2024) - Nele Albers, Amal Abdulrahman, Deborah Richards, Caroline Figueroa, Bibhas Chakraborty, Ananya Bhattacharjee, Linwei He, Mark A. Neerincx, Willem-Paul Brinkman, More authors...

To increase the effectiveness of behavior change applications, a large variety of algorithms has been developed to adapt what the applications offer, when, how, and with whom. Given the multitude of challenges related to the concept of algorithmic behavior change support, its development, evaluation, and impact on behavior change, this workshop aims to strengthen the community of people with diverse backgrounds (e.g., computer science, psychology, human-computer interaction) and roles in behavior change support (e.g., researcher, designer, practitioner). Combining keynotes of leading researchers with sessions in which individual workshop participants present their work and discuss problems with the audience, the workshop encouraged a lively exchange of ideas that benefits current and future research on algorithmic behavior change support. ...

Algorithmic Support for Health Behavior Change: A Scoping Review Protocol

Journal article (2024) - Diederik Heijbroek, Nele Albers, Willem-Paul Brinkman

A Cognitive Conversational Agent for Training Child Helpline Volunteers

Conference paper (2024) - Mohammed Al Owayyed, Alex Despan, Myrthe Tielman, Willem Paul Brinkman

Child helplines offer a safe and private space for children to share their thoughts and feelings with volunteers. However, training these volunteers to help can be both expensive and time-consuming. In this demo, we present Lilobot, a conversational agent designed to train volunteers for child helplines. Lilobot’s reasoning is based on the Belief-Desire-Intention (BDI) model, which simulates, for example, a bullied child who contacts the helpline through text. Users engage with Lilobot in a role-play format, taking on the volunteer’s role. Through this system, volunteers can practice applying the Five Phase Model, a conversational strategy helplines use. The training tool includes a trainer interface for monitoring and modifying Lilobot’s interactions. Trainers can also create new conversational scenarios through an authoring tool. An initial evaluation led to enhancements in Lilobot’s knowledge base and intent recognition, addressing the main issues encountered by participants. The components used to implement the system were Java Spring for the BDI model and the authoring tool, Rasa for Natural Language Understanding, PostgreSQL for the database, and Vue.js for the front-end. This tool aims to provide volunteers with consistent, interactive training, enhancing their counselling skills in a controlled environment. ...

Collaboratively Setting Daily Step Goals with a Virtual Coach: Using Reinforcement Learning to Personalize Initial Proposals

Conference paper (2024) - M. Dierikx, N. Albers, Bouke Scheltinga, W.P. Brinkman

Goal-setting is commonly used in behavior change applications for physical activity. However, for goals to be effective, they need to be tailored to a user’s situation (e.g., motivation, progress). One way to obtain such goals is a collaborative process in which a healthcare professional and client set a goal together, thus making use of the professional’s expertise and the client’s knowledge about their own situation. As healthcare professionals are not always available, we created a dialog with the virtual coach Steph to collaboratively set daily step goals. Since judgments in human decision-making processes are adjusted based on the starting point or anchor, the first step goal proposal Steph makes is likely to influence the user’s final goal and self-efficacy. Situational factors impacting physical activity (e.g., motivation, self-efficacy, available time) or how users process information (e.g., mood) may determine which initial proposals are most effective in getting users to reach their underlying previous activity-based recommended step goals. Using data from 117 people interacting with Steph for up to five days, we designed a reinforcement learning algorithm that considers users’ current and future situations when choosing an initial step goal proposal. Our simulations show that initial step goal proposals matter: choosing optimal ones based on this algorithm could make it more likely that people move to a situation with high motivation, high self-efficacy, and a favorable daily context. Then, they are more likely to achieve, but also to overachieve, their underlying recommended step goals. Our dataset is publicly available. ...

Corrigendum

Mandarin Chinese translation of the Artificial-Social-Agent questionnaire instrument for evaluating human-agent interaction (Frontiers in Computer Science, (2023), 5, (1149305), 10.3389/fcomp.2023.1149305)

Journal article (2024) - Fengxiang Li, Siska Fitrianie, Merijn Bruijnes, Amal Abdulrahman, Fu Guo, Willem Paul Brinkman

In the published article, there was an error in Table 5. For each second construct/dimension, the means are swapped between Chinese and English data, which is caused by an error in the underlying R script. Consequently, the plus and minus signs for the delta and CI values are also wrong. The corrected Table 5 and its caption appear below. Construct/dimension rating difference between mixed-international English-speaking and Chinese mother-tongue groups. Δ Score are pairwise differences between Chinese and mother-tongue cultural background and mixed-international cultural background taken from the posterior distribution. M, mean; SD, standard deviation; CI, credible interval. The authors apologize for this error and state that this does not change the scientific conclusions of the article in any way. The original article has been updated. ...

Technology-supported social skills training systems: A systematic literature review

Conference paper (2024) - Ding Ding, Pascal Remeijsen, Zian Song, M.A. Neerincx, W.P. Brinkman

Social interactions form an essential aspect of people’s life, however, it is quite challenging for individuals to handle a wide range of social situations. Therefore, a variety of training systems have been developed to improve their skills. This literature review seeks to give an overview of the state of the art of technology-supported systems for social skills training. The studies eligible for inclusion described a technology-supported system with the purpose of training social skills and included an experimental or observational study to evaluate the efficacy of the system. 225 studies (224 publications) with 216 systems were identified, characterized, and analyzed in this literature review. Using the taxonomy as put forward in this study, the analysis shows that the majority of these systems were screen-based applications, with virtual reality technology being the most frequently observed. The systems most often targeted communication skills that focus on transferring information to produce greater understanding, i.e. mending general communication impairments in children with autism. In terms of functions, support for learning-by-doing was the most observed function, while focusing on job interviews provided the largest number of functions. Finally, the studies reported overwhelmingly positively regarding the systems’ impact, including 76 studies with a randomized controlled trial design. Still, most studies only used a quasi-experimental design based on self-report measures. We anticipate the proposed taxonomy to be a starting point for researchers to position their work and that the review will help them with gaining inspiration for the design and evaluation of social skills training systems. ...

Agent-based social skills training systems: the ARTES architecture, interaction characteristics, learning theories and future outlooks

Journal article (2024) - M. Al Owayyed, M.L. Tielman, Arno Hartholt, M.M. Specht, W.P. Brinkman

Agent-based training systems can enhance people's social skills. The effective development of these systems needs a comprehensive architecture that outlines their components and relationships. Such an architecture can pinpoint improvement areas and future outlooks. This paper presents ARTES: a general architecture illustrating how components of agent-based social training systems work together. We studied existing systems and architectures for training and tutoring to design ARTES and identify its essential components and interaction characteristics. ARTES comprises two core components: the agent simulation of social situations, and educational elements to provide guided learning. We link ARTES's crucial components to four primary learning theories (behaviourism, cognitivism, social cognitive theory, and constructivism) to illustrate the role of agent simulation and tutoring elements in establishing desired learning outcomes. Furthermore, we map ARTES's components against eight architectures, 43 systems and three tools to indicate the components' relevance, completeness, generalisation, and deployment potential across contexts. In addition to ARTES, the paper also contributes by identifying future improvements and research directions, such as the agent's thinking, tutoring methods, knowledge transfer, and ethical implications. We believe ARTES can help bridge the gap between virtual human simulations and impactful educational learning, offering training system developers desirable features like understandability and adaptability. ...