Dimitri Coelho Mollo

Journal article (1)

1 records found

Helpful, harmless, honest?

Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback

Journal article (2025) - Adam Dahlgren Lindström (author) , Leila Methnani (author) , Lea Krause (author) , Petter Ericson (author) , I. Martinez de Rituerto de Troya (author) , Dimitri Coelho Mollo (author) , R.I.J. Dobbe (author)

This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback methods, involving either human feedback (RLHF) or AI feedback (RLAIF ...