Visualize: Figure2.
Description: "The statement and the most relevant caption are both parsed
into constituency trees. These two trees are then aligned by the common node. The
subtree including the common node in the statement is merged into the caption tree
to obtain the explanation"