D. Hogendoorn

Master thesis (1)

1 records found

Predicting data quality of event-based container trackers

Master thesis (2025) - D. Hogendoorn (author) , F. Schulte (graduation committee member) , Neil Yorke-Smith (graduation committee member) , Yusong Pang (graduation committee member) , Bart Van Riessen (mentor)

Reliable container-tracking depends on the quality of estimated time-of-arrival (ETA) data, yet existing logistics platforms offer little guidance on how trustworthy those timestamps really are. This thesis proposes a fit-for-use data-quality (DQ) framework for Digital Container Shipping Association (DCSA)-compliant event logs that flags ETA records likely to deviate from actual time of arrival (ATA) by more than one calendar day.

Event logs from $\sim$90\,k transport legs were preprocessed into records capturing origin-destination pair, carrier, publisher type, and timing information. Four supervised models, namely Linear Regression (LR), Random Forest, XGBoost, and a Neural Network, were trained to predict leg duration. A prediction that placed ATA $>1$ day from the published ETA labeled that record \textit{low-quality}. Model outputs were evaluated with a precision-oriented $\mathrm{F}_{\beta}$-score, where a false alarm is 50 times more costly than a missed detection ($\beta \approx 0.141$).

The simplest model prevailed: standard LR achieved the highest overall $\mathrm{F}_{0.141}$-score (68.5 \%), balancing few false positives with robust recall, while more-complex tree-based and neural models produced excessive false alarms. When the analysis was narrowed to early-stage ETAs published by carriers (arguably the least reliable yet most operationally valuable subset) LR’s score rose to 72.0 \%. These findings highlight that careful feature engineering and data curation outweigh algorithmic complexity for this task.

The study delivers the first systematic, event-data-only method to quantify DQ in container tracking, enabling near-real-time plausibility checks without AIS feeds. Limitations include a three-month observation window and absence of exogenous factors such as weather or port congestion. Future work should extend the temporal scope, integrate AIS-derived and environmental features, and explore meta-learning techniques to adapt to disruptions. It could also use process-mining to uncover anomalous event sequences to take a different approach in dataquality assessment within container-eventlogs.

By demonstrating that a transparent LR baseline can reliably surface dubious ETAs, the thesis provides a practical blueprint for logistics platforms seeking to bolster trust in their tracking data and to prioritise corrective action where it matters most.