Visualize: Figure 1 & 4 & 5. Description: "Examples of our generated question-relevant captions. During the training phase, our model selects the most relevant human captions for each question (marked by the same color)."