Visualize: Figure2. Description: "The statement and the most relevant caption are both parsed into constituency trees. These two trees are then aligned by the common node. The subtree including the common node in the statement is merged into the caption tree to obtain the explanation"