Visually grounded fine-grained speech representations learning