M. Izadi
Please Note
35 records found
1
Human-AI experience in integrated development environments
A systematic literature review
Developer Interaction Patterns with Proactive AI
A Five-Day Field Study
HyperSeq
A Hyper-Adaptive Representation for Predictive Sequencing of States
In the rapidly evolving world of software development, the surge in developers’ reliance on AI-driven tools has transformed Integrated Development Environments into powerhouses of advanced features. This transformation, while boosting developers’ productivity to unprecedented levels, comes with a catch: increased hardware demands for software development. Moreover, the significant economic and environmental toll of using these sophisticated models necessitates mechanisms that reduce unnecessary computational burdens. We propose HyperSeq - Hyper-Adaptive Representation for Predictive Sequencing of States - a novel, resource-efficient approach designed to model developers’ cognitive states. HyperSeq facilitates precise action sequencing and enables real-time learning of user behavior. Our preliminary results show how HyperSeq excels in forecasting action sequences and achieves remarkable prediction accuracies that go beyond 70%. Notably, the model’s online-learning capability allows it to substantially enhance its predictive accuracy in a majority of cases and increases its capability in forecasting next user actions with sufficient iterations for adaptation. Ultimately, our objective is to harness these predictions to refine and elevate the user experience dynamically within the IDE.
The Impact of Generative AI on Creativity in Software Development
A Research Agenda
As GenAI becomes embedded in developer toolchains and practices, and routine code is increasingly generated, human creativity will be increasingly important for generating competitive advantage. This article uses the McLuhan tetrad alongside scenarios of how GenAI may disrupt software development more broadly, to identify potential impacts GenAI may have on creativity within software development. The impacts are discussed along with a future research agenda comprising five connected themes that consider how individual capabilities, team capabilities, the product, unintended consequences, and society can be affected.
When People Come First
A Human-Centered Approach to Computer Science Education
The rise of AI tools is reshaping computer science education, shifting the focus from coding skills to teaching students how to effectively use these technologies. Understanding students' mental models and fostering computational and metacognitive skills are now essential, as over-reliance on AI can weaken critical thinking. This panel explores how a human-centered approach can balance these challenges, sharing strategies to optimize learning while addressing the risks of cognitive offloading in an AI-driven world.
Benchmarking AI Models in Software Engineering
A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality
Benchmarks are essential for unified evaluation and reproducibility. The rapid rise of Artificial Intelligence for Software Engineering (AI4SE) has produced numerous benchmarks for tasks such as code generation and bug repair. However, this proliferation has led to major challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant benchmarks, (3) lack of standardization in benchmark creation, and (4) flaws that limit utility. Addressing these requires a dual approach: systematically mapping existing benchmarks for informed selection and defining unified guidelines for robust, adaptable benchmark development. We conduct a review of 247 studies, identifying 273 AI4SE benchmarks since 2014. We categorize them, analyze limitations, and expose gaps in current practices. Building on these insights, we introduce BenchScout, an extensible semantic search tool for locating suitable benchmarks. BenchScout employs automated clustering with contextual embeddings of benchmark-related studies, followed by dimensionality reduction. In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5. To improve benchmarking standards, we propose BenchFrame, a unified approach to improve benchmark quality. Applying BenchFrame to HumanEval yielded HumanEvalNext, which features corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluating 10 state-of-the-art code models on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average pass-at-1 drops of 31.22% and 19.94%, respectively, underscoring the need for continuous benchmark refinement. We further examine BenchFrame's scalability through an agentic pipeline and confirm its generalizability on the MBPP dataset. Lastly, we publicly release the material of our review, user study, and the enhanced benchmark. 1https://github.com/AISE-TUDelft/AI4SE-benchmarks
Prompt-with-Me
In-IDE Structured Prompt Management for LLM-Driven Software Engineering
Generative AI in Software Engineering Must Be Human-Centered
The Copenhagen Manifesto
Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in large language models trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management. ...
Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in large language models trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management.
Correction to
The potential of an adaptive computerized dynamic assessment tutor in diagnosing and assessing learners’ listening comprehension (Education and Information Technologies, (2024), 29, 3, (3637-3661), 10.1007/s10639-023-11871-w)
In the PDF of this article, the pages were incorrectly numbered as ‘2303–2327’ when it should have been ‘3637–3661’. The page range was found to be just correct in the HTML version of the article. The original article has been corrected.
In-IDE Human-AI Experience in the Era of Large Language Models
A Literature Review
We conducted a literature review to study the current state of in-IDE Human-AI Experience research, bridging a gap in understanding the nuanced interactions between programmers and AI assistants within IDEs. By analyzing 36 selected papers, our study illustrates three primary research branches: Design, Impact, and Quality of Interaction.
The trends, challenges, and opportunities identified in this paper emphasize the evolving landscape of software development and inform future directions for research, and development in this dynamic field. Specifically, we invite the community to investigate three aspects of these interactions: designing task-specific user interface, building trust, and improving readability. ...
We conducted a literature review to study the current state of in-IDE Human-AI Experience research, bridging a gap in understanding the nuanced interactions between programmers and AI assistants within IDEs. By analyzing 36 selected papers, our study illustrates three primary research branches: Design, Impact, and Quality of Interaction.
The trends, challenges, and opportunities identified in this paper emphasize the evolving landscape of software development and inform future directions for research, and development in this dynamic field. Specifically, we invite the community to investigate three aspects of these interactions: designing task-specific user interface, building trust, and improving readability.
In today’s environment of growing class sizes due to the prevalence of online and e-learning systems, providing one-to-one instruction and feedback has become a challenging task for teachers. Anyhow, the dialectical integration of instruction and assessment into a seamless and dynamic activity can provide a continuous flow of assessment information for teachers to boost and individualize learning. In this regard, adaptive learning technology is one way to facilitate teacher-supported learning and personalize curriculum and learning experiences. This study aimed to investigate the potential of an adaptive Computerized Dynamic Assessment (C-DA) tool applicable as a language diagnostician and assistant. The study tried to get insight into 75 Iranian EFL learners’ listening development by focusing on the learning potential exhibited through learners’ assessment and the degree of internalization of mediation. To achieve these, a C-DA tutor including two dynamic listening comprehension tests, each comprising 20 items, arranged in the order of difficulty was developed. The test takers unable to answer an item correctly were provided with graduated hints for different comprehension- and production-type items and the overall difficulty level of the test was adapted to the test takers’ proficiency level. In order to have a full diagnosis of each individual’s listening development, the adaptive C-DA automatically generated five test scores on each learner’s performance: actual (unmediated) score, mediated score, gain score, Learning Potential Score (LPS), and transfer score. The results of paired-sample t-tests revealed a significant development from the actual to the mediated scores. Furthermore, the LPSs indicated that the tutor was capable of revealing learners’ potential for learning. Moreover, learners with high LPS gained a higher mean for transfer scores followed by transfer scores of medium and low levels. The results of Mann-Whitney tests revealed a significant difference in the degree of internalization of mediation of learners with mid and low range of LPSs on the easy test and high and low range of LPSs on the difficult test. The findings of this research can have important theoretical and practical implications for researchers and educationalists. The instructional value of this adaptive C-DA tool lies in its unique opportunities for individualizing learning and developing individual learning plans in accordance with learners’ needs.
Code comments are a key resource for information about software artefacts. Depending on the use case, only some types of comments are useful. Thus, automatic approaches to clas-sify these comments have been proposed. In this work, we address this need by proposing, STACC, a set of SentenceTransformers- based binary classifiers. These lightweight classifiers are trained and tested on the NLBSE Code Comment Classification tool competition dataset, and surpass the baseline by a significant margin, achieving an average Fl score of 0.74 against the baseline of 0.31, which is an improvement of 139%. A replication package, as well as the models themselves, are publicly available.