Exploring the enhancement of predictive accuracy for minority classes in travel mode choice models

More Info
expand_more

Abstract

Transportation systems are pivotal in shaping the economic and social dynamics of contemporary societies, fostering connectivity and opportunities while reducing geographical distances. Despite these benefits, they also contribute to adverse effects such as emissions, congestion, and traffic fatalities. Effectively developing and maintaining transportation infrastructure and services that cater to evolving population needs and align with environmental goals requires accurate forecasting of travel demand. However, due to inherent uncertainty in individuals' behavior and data limitations, forecasting this demand is a complex task.
A common limitation often encountered in transport datasets is class imbalance, with regard to the utilization of the different modes. Class imbalance in this context refers to the uneven distribution of samples among the various modes. Modes with a higher number of samples are termed majority modes, while those with fewer instances are labeled as minority modes. The existence of class imbalance within the dataset has the potential to compromise the performance of classifiers, especially for the minority modes, leading to inaccurate forecasts. This, in turn, may result in insufficient investments and provisions for these modes, ultimately having adverse consequences for the population segments that rely on them. Existing studies in the literature have either entirely overlooked or only partially addressed the impact of class imbalance. Recognizing the significance of precise demand predictions and acknowledging the identified gaps within the literature, the primary research question of this study revolves around systematically identifying and addressing the impact of class imbalance in mode choice forecasting.
To address the main question, a framework was proposed. This framework encompassed various aspects including a) the measurement of class imbalance within a dataset and the assessment of its impact on classification performance, b) the investigation of other challenging factors coexisting in imbalanced datasets, with a specific focus on class overlap, and c) the proper evaluation of classification performance across classes. As an integral part of this framework, the 'Performance Gap Metric’ was introduced - a metric employed to evaluate the difference in classification performance between the majority and minority classes. Establishing a threshold of 20%, favorable classifier performance was determined when this metric fell below the threshold, signifying the classifier’s equitable treatment of both minority and majority classes. Subsequently, this framework was applied using the ODiN data as a case study to predict mode choices in the Netherlands. Mode choices encompassed car, bike, and transit, with car representing the majority and transit the minority class. Two modeling techniques, namely Random Forest and an MNL model, were employed in conjunction with various sampling techniques, including the SMOTENC, the Neighborhood-based Undersampling, and the Separation scheme...