This review surveys the current state of data used in the development of Machine Learning models for disease outbreak forecasting, with a focus on identifying systemic shortcomings and areas for improvement. A set of 26 development papers was selected and analyzed based on the da
...
This review surveys the current state of data used in the development of Machine Learning models for disease outbreak forecasting, with a focus on identifying systemic shortcomings and areas for improvement. A set of 26 development papers was selected and analyzed based on the dataset's attributes such as scope, type, accessibility, and quality. Through a thematic analysis technique, five dominant categories of data failure were identified: structural, procedural, accessibility, logistical and temporary. Hospital-collected data remains the dominant source but is hindered by under-sampling and latency, while non-traditional data sources offer improved responsiveness at the cost of increased pre-processing complexity. Supplementary datasets, such as climate or mobility data, were found to be underutilized, despite their potential to improve forecasting accuracy. Key areas for improvement include the standardization and public availability of datasets, integration of complementary data sources, and use of language models to manage linguistically ambiguous data. The findings suggest that the current data limitations are structural and widespread, requiring procedural and institutional reforms to improve model generalizability and reliability in disease outbreak forecasting.