Visualize: Figure 1 & 4 & 5. Description: "Visualization of the sampled video frames, the neuron activations associated with the POS tags, the weights of the sentinel gate, the generated captions, and the real POS tags of the captions."