Multimodal Neural MT Natural language processing, and especially Machine Translation (MT), has a number of well known challenges, such as lexical, syntactic and semantic ambiguities. Recent approaches based on deep learning has a better implicit modelling of the dependencies involved in those challenges, leading to better performance. However, it’s hardly the case that the semantic can be modelled from text only. In this work, I will present our work on using visual inputs (images) in order to help a Neural MT system. By translating image captions, we expect to visually ground the language, leading to better semantic disambiguation and in turn, better translations.