PK
P. Koev
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
2 records found
1
Social Impact Regularization in IQ-Learn
Steering Social Intent in Heterogeneous Driving Demonstrations
Autonomous driving relies heavily on Reinforcement Learning (RL) to train agents in sequential decision-making settings. However, RL's success is deeply bottlenecked by the need to manually specify a reward function, a notoriously difficult task when attempting to balance safety, efficiency, and nuanced social etiquette in highly interactive domains. Inverse Reinforcement Learning (IRL) circumvents this challenge by extracting latent objectives directly from expert data. Yet, standard IRL operates under a critical assumption: that all demonstrations stem from a single, homogeneous behavioural profile. In reality, traffic is fundamentally heterogeneous, composed of a mixture of distinct driving styles ranging from calm and cooperative to aggressive and assertive. When standard IRL is applied to such mixed datasets, it inherently struggles to fit a single reward function to the conflicting behaviours. Consequently, the recovered reward typically collapses into an arbitrary average, completely misrepresenting varied driving profiles and failing to account for the essential social context of driving. To resolve this ambiguity, this thesis introduces the Social Impact Regularized IQ-Learn framework. This approach decomposes the driving reward into two distinct components: an individual reward capturing the ego vehicle's own progress, and an ego-centric social impact signal measuring how the vehicle's actions directly affect its neighbours. By combining these into a social scoring function, the framework integrates a normative prior as an additive regularizer within the IQ-Learn objective. This formulation exploits a vital separation: the core IQ-Learn objective absorbs universal physical driving dynamics from the entire mixed dataset, while the regularizer selectively steers the social interpretation of those dynamics towards a specific, designer chosen behavioural target. Evaluations spanning a tabular gridworld proof-of-concept, a multi-agent stochastic environment, and a continuous observation intersection simulator confirm that the regularizer effectively resolves behavioural ambiguity. The framework can successfully steer the recovered policy towards a targeted social alignment. Ultimately, by making the social orientation of the learned policy an explicit and inspectable parameter, this methodology provides a concrete, auditable mechanism for designers and regulators to verify that an autonomous vehicle's social behaviour actively matches its intended design.
...
Autonomous driving relies heavily on Reinforcement Learning (RL) to train agents in sequential decision-making settings. However, RL's success is deeply bottlenecked by the need to manually specify a reward function, a notoriously difficult task when attempting to balance safety, efficiency, and nuanced social etiquette in highly interactive domains. Inverse Reinforcement Learning (IRL) circumvents this challenge by extracting latent objectives directly from expert data. Yet, standard IRL operates under a critical assumption: that all demonstrations stem from a single, homogeneous behavioural profile. In reality, traffic is fundamentally heterogeneous, composed of a mixture of distinct driving styles ranging from calm and cooperative to aggressive and assertive. When standard IRL is applied to such mixed datasets, it inherently struggles to fit a single reward function to the conflicting behaviours. Consequently, the recovered reward typically collapses into an arbitrary average, completely misrepresenting varied driving profiles and failing to account for the essential social context of driving. To resolve this ambiguity, this thesis introduces the Social Impact Regularized IQ-Learn framework. This approach decomposes the driving reward into two distinct components: an individual reward capturing the ego vehicle's own progress, and an ego-centric social impact signal measuring how the vehicle's actions directly affect its neighbours. By combining these into a social scoring function, the framework integrates a normative prior as an additive regularizer within the IQ-Learn objective. This formulation exploits a vital separation: the core IQ-Learn objective absorbs universal physical driving dynamics from the entire mixed dataset, while the regularizer selectively steers the social interpretation of those dynamics towards a specific, designer chosen behavioural target. Evaluations spanning a tabular gridworld proof-of-concept, a multi-agent stochastic environment, and a continuous observation intersection simulator confirm that the regularizer effectively resolves behavioural ambiguity. The framework can successfully steer the recovered policy towards a targeted social alignment. Ultimately, by making the social orientation of the learned policy an explicit and inspectable parameter, this methodology provides a concrete, auditable mechanism for designers and regulators to verify that an autonomous vehicle's social behaviour actively matches its intended design.
Conflict in the World of Inverse Reinforcement Learning
Investigating Inverse Reinforcement Learning with Conflicting Demonstrations
Inverse Reinforcement Learning (IRL) algorithms are closely related to Reinforcement Learning (RL) but instead try to model the reward function from a given set of expert demonstrations. In IRL, many algorithms have been proposed, but most assume consistent demonstrations. Consistency is the assumption that all demonstrations follow the same underlying reward function and near-optimal policy, without any contradictions. This, however, is not always the case. This study investigates the effect of conflicting demonstrations on IRL algorithms. For our experiments, the Lunar Lander environment and a grid-world environment are used in combination with a state-of-the-art IRL algorithm. To obtain the expert demonstrations, agents were trained using RL algorithms with explicit differences in the reward functions to achieve optimal policy. Then these demonstrations were used in training IRL in a variety of different configurations of hyperparameters. Our results show that IRL algorithms can be trained using demonstrations with varying levels of conflict. In conclusion, we demonstrate that IRL can learn even when provided with a set of conflicting demonstrations.
...
Inverse Reinforcement Learning (IRL) algorithms are closely related to Reinforcement Learning (RL) but instead try to model the reward function from a given set of expert demonstrations. In IRL, many algorithms have been proposed, but most assume consistent demonstrations. Consistency is the assumption that all demonstrations follow the same underlying reward function and near-optimal policy, without any contradictions. This, however, is not always the case. This study investigates the effect of conflicting demonstrations on IRL algorithms. For our experiments, the Lunar Lander environment and a grid-world environment are used in combination with a state-of-the-art IRL algorithm. To obtain the expert demonstrations, agents were trained using RL algorithms with explicit differences in the reward functions to achieve optimal policy. Then these demonstrations were used in training IRL in a variety of different configurations of hyperparameters. Our results show that IRL algorithms can be trained using demonstrations with varying levels of conflict. In conclusion, we demonstrate that IRL can learn even when provided with a set of conflicting demonstrations.