Engineering Data Processing Workflows

Journal Article (2024)
Author(s)

Diomidis Spinellis (TU Delft - Software Engineering, Athens University of Economics and Business)

Research Group
Software Engineering
DOI related publication
https://doi.org/10.1109/MS.2024.3385665
More Info
expand_more
Publication Year
2024
Language
English
Research Group
Software Engineering
Issue number
4
Volume number
41
Pages (from-to)
25-29
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Effective data processing workflows are crucial in data science, business analytics, and machine learning. Domain-specific tools can be invaluable, but often custom workflows are needed. Key to their success is splitting data and tasks into manageable chunks to enhance reliability, troubleshooting, and parallelization. Avoid monolithic programs; instead, favor modular designs that simplify data management and processing. Utilizing tools like xargs and GNU parallel can leverage multiple cores or hosts efficiently. Logging and documenting your workflow are essential for monitoring progress and understanding the process. Handling data subsets allows for quicker feedback and testing. Prepare for invalid data and system failures by designing processes that can gracefully manage exceptions and ensure results are reproducible and incremental, avoiding over-engineering. Simplify where possible, leveraging powerful, mature Unix tools and focusing optimization efforts on parts of the code responsible for the bulk of runtime costs. Adhere to software engineering practices to maintain the quality and integrity of your workflow, ensuring it remains a reliable asset to your organization.

Files

Engineering_Data_Processing_Wo... (pdf)
(pdf | 1.24 Mb)
- Embargo expired in 12-12-2024
License info not available