Learning to recognise words using visually grounded speech