Reliable container-tracking depends on the quality of estimated time-of-arrival (ETA) data, yet existing logistics platforms offer little guidance on how trustworthy those timestamps really are. This thesis proposes a fit-for-use data-quality (DQ) framework for Digital Container
...
Reliable container-tracking depends on the quality of estimated time-of-arrival (ETA) data, yet existing logistics platforms offer little guidance on how trustworthy those timestamps really are. This thesis proposes a fit-for-use data-quality (DQ) framework for Digital Container Shipping Association (DCSA)-compliant event logs that flags ETA records likely to deviate from actual time of arrival (ATA) by more than one calendar day.
Event logs from $\sim$90\,k transport legs were preprocessed into records capturing origin-destination pair, carrier, publisher type, and timing information. Four supervised models, namely Linear Regression (LR), Random Forest, XGBoost, and a Neural Network, were trained to predict leg duration. A prediction that placed ATA \(>1\) day from the published ETA labeled that record \textit{low-quality}. Model outputs were evaluated with a precision-oriented \(\mathrm{F}_{\beta}\)-score, where a false alarm is 50 times more costly than a missed detection (\(\beta \approx 0.141\)).
The simplest model prevailed: standard LR achieved the highest overall \(\mathrm{F}_{0.141}\)-score (68.5 \%), balancing few false positives with robust recall, while more-complex tree-based and neural models produced excessive false alarms. When the analysis was narrowed to early-stage ETAs published by carriers (arguably the least reliable yet most operationally valuable subset) LR’s score rose to 72.0 \%. These findings highlight that careful feature engineering and data curation outweigh algorithmic complexity for this task.
The study delivers the first systematic, event-data-only method to quantify DQ in container tracking, enabling near-real-time plausibility checks without AIS feeds. Limitations include a three-month observation window and absence of exogenous factors such as weather or port congestion. Future work should extend the temporal scope, integrate AIS-derived and environmental features, and explore meta-learning techniques to adapt to disruptions. It could also use process-mining to uncover anomalous event sequences to take a different approach in dataquality assessment within container-eventlogs.
By demonstrating that a transparent LR baseline can reliably surface dubious ETAs, the thesis provides a practical blueprint for logistics platforms seeking to bolster trust in their tracking data and to prioritise corrective action where it matters most.