Circular Image

D. Spinellis

info

Please Note

152 records found

Journal article (2026) - Diomidis Spinellis
Residential heating offers many-low hanging fruit for automation and optimization, especially when integrated into a wider Internet of Things (IoT) ecosystem. Typical appliances work in isolation providing minimal controls, such as a thermostat, or extending them into their own domain, for example with daily and weekly settings. Here, I describe how I integrated diverse appliances to obtain intelligence and value beyond that offered by their individual controllers. ...
Journal article (2026) - Diomidis Spinellis
Performance is a critical attribute of system software since even small improvements are amplified across the countless CPU instructions devoted to it. In the two previous installments of this column, I described how I ported the Unix sed stream editor1 from C into Rust2 and the system’s design.3 Here, I describe how I optimized its input/output (I/O) performance by exposing advanced operating system (OS) facilities as Rust abstractions. ...
Journal article (2025) - Diomidis Spinellis
The Unix sed stream editor is a programmable text processing filter, first written in C in the 1970s.1 In the previous installment of this column, I described the tool and how I reimplemented it in Rust with some help from generative AI.2 Here, I describe the new implementation’s design to show how we can create a simple programmable tool. In the next “Adventures in Code” column, I’ll present optimizations that substantially increased the tool’s throughput.

What’s behind a programmable tool, such as Python, sed, or SQLite? It turns out that the key component is the data structures used to represent the code. As a student, I came across the equation “Algorithms + Data Structures = Programs.”3 I can still remember that at the time I knew what algorithms were, but data structures were to me a fuzzy concept of academic only interest. “Why would anyone need something more than the numeric and string arrays supported by the BASIC language?” I thought. Over the years, I’ve come to appreciate the value and significance of how we organize our data. Now, I believe that data structures are far more important than algorithms. First, in modern systems, bespoke sophisticated algorithms play only a minor, if any, role; data structures rule! Second, in many cases the algorithm’s choice is a clear-cut decision, whereas deciding on the data structure to use requires a deep understanding of context and a careful weighing of tradeoffs. Third, type systems have matured to offer us powerful practical aid in dealing with data; corresponding formal models for algorithms less so.

Consequently, the design of many systems should often start by considering how data will be structured: a database’s schema, key classes and their fields, types and operations on them. ...
Journal article (2025) - Diomidis Spinellis
The C preprocessor, a key element of the language, has become a liability due to its lack of integration with modern language semantics. This column describes the analysis of the C preprocessor usage in the Linux kernel, comprising 20 million lines of code, using the CScout refactoring browser. Processing limitations led to a solution leveraging a supercomputer’s parallel processing capabilities. The analysis divided the kernel’s source files across 32 supercomputer nodes and implemented a binary tournament database merging strategy. Initial efforts revealed multiple difficulties. Resolving them involved several false starts involving recursive SQL statements, an SQLite extension, and the GraphViz connected components tool. After a number of redesigns guided by stress-testing, the analysis finished in just 32 hours rather than a week, using 374 CPU hours and 640 GiB RAM on the supercomputer’s nodes. ...
Journal article (2025) - D. Spinellis
Background

The proliferation of generative artificial intelligence (AI) has facilitated the creation and publication of fraudulent scientific articles, often in predatory journals. This study investigates the extent of AI-generated content in the Global International Journal of Innovative Research (GIJIR), where a fabricated article was falsely attributed to me.
Methods

The entire GIJIR website was crawled to collect article PDFs and metadata. Automated scripts were used to extract the number of probable in-text citations, DOIs, affiliations, and contact emails. A heuristic based on the number of in-text citations was employed to identify the probability of AI-generated content. A subset of articles was manually reviewed for AI indicators such as formulaic writing and missing empirical data. Turnitin’s AI detection tool was used as an additional indicator. The extracted data were compiled into a structured dataset, which was analyzed to examine human-authored and AI-generated articles.
Results

Of the 53 examined articles with the fewest in-text citations, at least 48 appeared to be AI-generated, while five showed signs of human involvement. Turnitin’s AI detection scores confirmed high probabilities of AI-generated content in most cases, with scores reaching 100% for multiple papers. The analysis also revealed fraudulent authorship attribution, with AI-generated articles falsely assigned to researchers from prestigious institutions. The journal appears to use AI-generated content both to inflate its standing through misattributed papers and to attract authors aiming to inflate their publication record.
Conclusions

The findings highlight the risks posed by AI-generated and misattributed research articles, which threaten the credibility of academic publishing. Ways to mitigate these issues include strengthening identity verification mechanisms for DOIs and ORCIDs, enhancing AI detection methods, and reforming research assessment practices. Without effective countermeasures, the unchecked growth of AI-generated content in scientific literature could severely undermine trust in scholarly communication. ...
Journal article (2025) - Diomidis Spinellis
Science typically advances in small incremental steps, but in some rare instances it leaps forward. One discovery or invention can change how we see the world around us. Would it not be neat to be able to accurately pinpoint those moments of time in an objective way and thereby investigate science and technology’s progress? In 2016, Russel Funk of the University of Minnesota’s Carlson School of Management and Jason Owen-Smith from the University of Michigan published a measure for exactly this purpose.1 Their so-called consolidation-disruption (CD) index quantifies the extent to which published findings affect the subsequent use of the knowledge on which those findings relied. Worryingly, a widely cited subsequent study applied this measure on patents and scientific publications, finding a slowdown in disruptive progress.2 Thickening the plot, a later preprint attributed the finding to dataset artefacts.3 These studies prompt the need for an efficient way to calculate the CD index on large amounts of openly available data. [...] ...
Journal article (2025) - Diomidis Spinellis
IN CONTRAST TO physical objects and living things, software doesn’t deteriorate with the passage of time. While we age and our shoes fall apart, digital storage ensures that the software’s bits stay immutable. And yet, software needs substantial maintenance over time, owing to changes in its environment.1 Advancing technology and new requirements prompt us to modernize the software to keep it relevant. Here, I show how these changes happen in practice by describing the evolution and modernization of a burglar alarm security system I first developed a quarter-century ago. […] ...
Journal article (2024) - Diomidis Spinellis
RDBUnit is a unit testing framework designed to test relational database queries, created out of a need for unit testing them while working on software analytics tasks. It is available as a Python package on PyPI and open-source software on GitHub. RDBUnit tests consist of three parts: setup, query, and expected result, with the input and output defined as table contents. The framework utilizes a domain-specific language (DSL) for test specifications, employs a simple parsing mechanism, and uses a class hierarchy for managing database differences. It evaluates test results through SQL code generated and handled by the database engine. RDBUnit supports SQLite, mySQL, and PostgreSQL, and is implemented as a command-line tool suitable for diverse operating systems and continuous integration environments. It has proved beneficial in identifying subtle bugs and facilitating a focused and efficient approach to experimenting with SQL queries, especially in big data scenarios, signifying the assurance provided by unit testing in SQL-centric tasks. ...

Insights from a Case Study at ING

Conference paper (2024) - Eileen Kapel, Luís Cruz, Diomidis Spinellis, Arie Van Deursen
An incident management process is necessary in businesses that depend strongly on software and services. A proper process is essential to guarantee that incidents are well-handled, especially in a financial software-defined business needing to adhere to guidelines and regulations. This paper aims to enhance understanding of the current state of practice through a single-case exploratory case study, at the international bank ING, by interviewing 15 subject matter experts on the incident management process. The research identifies eight core observations on tool usage, the challenges experienced and future opportunities. Core challenges include monitoring data quality, the complexity of the environment, and the balance between minimising incident resolution time and following procedural guidelines. Future opportunities can lessen these challenges by making better use of available tooling and employing machine learning approaches. This requires tight supervision on the use of best practices and good monitoring data quality. The findings emphasise the need for a strengthened focus on improving the quality of monitoring data, handling environment complexity, incident clustering, and better support for regulatory compliance. ...
Journal article (2024) - Diomidis Spinellis
Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization. ...
Review (2024) - Diomidis Spinellis
Code refactoring is an essential part of software development, because it reduces technical debt, enhances long-term code sustainability, and enables the implementation of functionality that might have been incompatible with an original design. IDEs automate many refactoring tasks, but they sometimes lack support for specific operations or languages. In such cases, regular expressions offer a powerful alternative, automating tedious tasks, reducing errors, and saving time. This article shares a practical example: extending the CScout refactoring browser to collect metrics on C preprocessor usage, which required addressing widespread cyclic dependencies. The changes were facilitated by the "git-subst"Git extension, which makes global text replacements using regular expressions in Git-managed files. A series of 30 git-subst invocations were automatically generated, again using regular expression replacements. While not a cure-all, regular expressions are invaluable for many refactoring tasks, making them a key skill for software developers. ...
Journal article (2024) - Diomidis Spinellis
Generative AI based on large-language models is significantly impacting software development through IDE assistants, cloud-based APIs, and interactive chatbots for coding assistance. It excels in generating and translating code and data, navigating APIs, and creating boilerplate content, thereby enhancing productivity. However, it is prone to generating inaccurate information (“hallucinations”), erroneous code, and potentially introducing security vulnerabilities. To counter these risks, employing automated analysis tools, conducting rigorous testing, and maintaining a deep understanding of computer science concepts are essential. While generative AI can substantially aid development tasks it is not a replacement for human expertise, especially in understanding complex software, its requirements, and architecture. ...
Effective change management is crucial for businesses heavily reliant on software and services to minimise incidents induced by changes. Unfortunately, in practice it is often difficult to effectively use artificial intelligence for IT Operations (AIOps) to enhance service management, primarily due to inadequate data quality. Establishing reliable links between changes and the induced incidents is crucial for identifying patterns, improving change deployment, identifying high-risk changes, and enhancing incident response. In this research, we investigate the enhancement of traceability between changes and incidents through AIOps methods. Our approach involves a close examination of incident-inducing changes, the replication of methods linking incidents to the changes that caused them, introducing an adapted method, and demonstrating its results using historical data and practical evaluations. Our findings reveal that incident-inducing changes exhibit different characteristics dependent on context. Furthermore, a significant disparity exists between assessments based on historical data and real-world observation, with an increased occurrence of false positives when identifying links between unlabeled changes and incidents. This study highlights the complex nature of identifying links between changes and incidents, emphasising the contextual influence on AIOps method effectiveness. While we are actively working on improving the quality of current data through AIOps approaches, it remains apparent that further measures are necessary to address issues like data imbalances and promote a postmortem culture that brings attention to the value of properly administrating tickets. A better overview of change failure rates contributes to improved risk compliance and reliable change management. ...
Journal article (2023) - Zoe Kotti, Rafaila Galanopoulou, Diomidis Spinellis
Machine learning (ML) techniques increase the effectiveness of software engineering (SE) lifecycle activities. We systematically collected, quality-assessed, summarized, and categorized 83 reviews in ML for SE published between 2009 and 2022, covering 6,117 primary studies. The SE areas most tackled with ML are software quality and testing, while human-centered areas appear more challenging for ML. We propose a number of ML for SE research challenges and actions, including conducting further empirical validation and industrial studies on ML, reconsidering deficient SE methods, documenting and automating data collection and pipeline processes, reexamining how industrial practitioners distribute their proprietary data, and implementing incremental ML approaches. ...
Journal article (2023) - Diomidis Spinellis
Developers and data scientists often struggle to write command-line inputs, even though graphical interfaces or tools like ChatGPT can assist. The solution? "ai-cli,"an open-source system inspired by GitHub Copilot that converts natural language prompts into executable commands for various Linux command-line tools. By tapping into OpenAI's API, which allows interaction through JSON HTTP requests, "ai-cli"transforms user queries into actionable command-line instructions. However, integrating AI assistance across multiple command-line tools, especially in open source settings, can be complex. Historically, operating systems could mediate, but individual tool functionality and the lack of a unified approach have made centralized integration challenging. The "ai-cli"tool, by bridging this gap through dynamic loading and linking with each program's Readline library API, makes command-line interfaces smarter and more user-friendly, opening avenues for further enhancement and cross-platform applicability. ...
Journal article (2023) - Diomidis Spinellis
Considerable scientific work involves locating, analyzing, systematizing, and synthesizing other publications, often with the help of online scientific publication databases and search engines. However, use of online sources suffers from a lack of repeatability and transparency, as well as from technical restrictions. Alexandria3k is a Python software package and an associated command-line tool that can populate embedded relational databases with slices from the complete set of several open publication metadata sets. These can then be employed for reproducible processing and analysis through versatile and performant queries. We demonstrate the software’s utility by visualizing the evolution of publications in diverse scientific fields and relationships among them, by outlining scientometric facts associated with COVID-19 research, and by replicating commonly-used bibliometric measures and findings regarding scientific productivity, impact, and disruption. ...
Context: An incident management process is necessary in businesses that depend strongly on software and services. A proper process is essential to guarantee that incidents are well-handled, especially in a software-defined financial services company needing to adhere to guidelines and regulations.

Objective: This paper aims to improve the understanding of the current state of practice through a case study of the incident management process on a software-defined company.

Method: We conduct a single-case exploratory case study by interviewing 15 subject matter experts on how tools are used, the challenges experienced and the future opportunities of the process. The findings are triangulated with documentation and data.

Results: We make nine core observations in this paper. Certain tools are prescribed to teams to be used in the incident management process, complemented with flexible support for diverse tooling that aid teams in handling incidents. Core challenges include monitoring data quality, the complexity of the environment, and the balance between minimising incident resolution time and following procedural guidelines. Future opportunities can lessen these challenges by making better use of available tooling and employing machine learning approaches. This requires tight supervision on the use of best practices and good monitoring data quality.

Conclusion: The tools, challenges, and future opportunities for incident management in software-defined businesses identified in this paper call for a strengthened focus on improving the quality of monitoring data, handling environment complexity, issue clustering, and better support for regulatory compliance. ...
Journal article (2022) - Zoe Kotti, Georgios Gousios, Diomidis Spinellis
Existing work on the practical impact of software engineering (SE) research examines industrial relevance rather than adoption of study results, hence the question of how results have been practically applied remains open. To answer this and investigate the outcomes of impactful research, we performed a quantitative and qualitative analysis of 4,354 SE patents citing 1,690 SE papers published in four leading SE venues between 1975–2017. Moreover, we conducted a survey on 475 authors of 593 top-cited and awarded publications, achieving 26% response rate. Overall, researchers have equipped practitioners with various tools, processes, and methods, and improved many existing products. SE practice values knowledge-seeking research and is impacted by diverse cross-disciplinary SE areas. Practitioner-oriented publication venues appear more impactful than researcher-oriented ones, while industry-related tracks in conferences could enhance their impact. Some research works did not reach a wide footprint due to limited funding resources or unfavorable cost-benefit trade-off of the proposed solutions. The need for higher SE research funding could be corroborated through a dedicated empirical study. In general, the assessment of impact is subject to its definition. Therefore, academia and industry could jointly agree on a formal description to set a common ground for subsequent research on the topic. ...
Conference paper (2022) - Stefanos Chaliasos, Thodoris Sotiropoulos, Diomidis Spinellis, Arthur Gervais, Benjamin Livshits, Dimitris Mitropoulos
We propose a testing framework for validating static typing procedures in compilers. Our core component is a program generator suitably crafted for producing programs that are likely to trigger typing compiler bugs. One of our main contributions is that our program generator gives rise to transformation-based compiler testing for finding typing bugs. We present two novel approaches (type erasure mutation and type overwriting mutation) that apply targeted transformations to an input program to reveal type inference and soundness compiler bugs respectively. Both approaches are guided by an intra-procedural type inference analysis used to capture type information flow. We implement our techniques as a tool, which we call Hephaestus. The extensibility of Hephaestus enables us to test the compilers of three popular JVM languages: Java, Kotlin, and Groovy. Within nine months of testing, we have found 156 bugs (137 confirmed and 85 fixed) with diverse manifestations and root causes in all the examined compilers. Most of the discovered bugs lie in the heart of many critical components related to static typing, such as type inference. ...