Humans perceive the world and interact with it in multimodal ways. Language understanding and generation is not an exception. However, current natural language processing methods often solely rely on text to produce their hypotheses. In this talk, I will present recent works aiming to bring visual context to machine translation along with a qualitative assessment of the model capability to leverage this information. We show that while visual context helps, the model can be lazy.