Word recognition in a model of visually grounded speech