Matching images and text with multi-modal tensor fusion and re-ranking