Description: "We next probe each attention head for linguistic
phenomena. In particular, we treat each head as a
simple no-training-required classifier that, given a
word as input, outputs the most-attended-to other
word. We then evaluate the ability of the heads
to classify various syntactic relations."