Modelling Human Word Learning and Recognition Using Visually Grounded Speech