DAnny Merkx, Stefan L. Frank, and Mirjam Ernestus
Humans learn language by interaction with their environment and listening to other humans. It should also be possible for computational models to learn language directly from speech but so far most approaches require text. We use a neural network approach to create visually grounded embeddings for spoken utterances. The model is trained to embed speech and images to a common embedding space such that for a spoken caption the correct image is retrieved and vice versa. In doing so, the model has to learn which constituents in the spoken input refer to which visual features and so may learn to recognise words. Our results show a remarkable increase in image-caption retrieval performance over previous work. Unlike text-based sentence representations, our model does not receive any explicit information about the linguistic units present in the utterances (e.g. word boundaries), and we investigate if our model learns to recognise words in the input. We find that deeper network layers are better at encoding word presence, although the final layer has slightly lower performance. This shows that our visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition.