Visualize: Figure 1 & 4 & 5.
Description: "Examples of our generated question-relevant
captions. During the training phase, our model selects
the most relevant human captions for each question
(marked by the same color)."