This post is the 2nd part in a 2 part series about transfer learning. In Part1, we spoke about what contextual word representations are and how to learn representations like Elmo and BERT. In this part, we are going to discuss different works that address why contextual representations are useful. Does architecture matter? What about the pre-training task? Can we learn good contextual representations from tasks like Machine Translation? How do contextual word representations capture sentence structure? Do they obtain any hierarchical information of text, which is crucial in NLP? Do Elmo and BERT get more abstract knowledge in higher layers of the network as we see in Imagenet pre-trained CNN?? And more.

Does the type of neural architecture have an impact on the quality of learnt representations?

In the original work on ELMO, Peters et al. used bi-directional LSTM to train a language model. But, can one use different architectures to learn better representations? In a paper titled Dissecting Contextual Word Embedding, they explore the effect of different architectures and different RNN units like GRU and CNN on the quality of representations. They experiment with four different architectures - two-layer LSTM, four-layer LSTM, Transformer model with language model objective (BERT uses different objective functions), CNN language model (Dauphin et al.). All four different architectures are trained to learn a language model.

The 2 Layer and 4 Layer LSTM models perform the best on many downstream tasks like Natural Language Inference and NER. But others are not too far away. So, the type of neural architecture does not have a profound impact on the quality of the representations learnt.

However, does the architecture itself - barring any effect of the learnt representations have any impact on the quality of representations? In their work, Tenney et al. 2019 compare the performance of ELMO like architecture having random weights with one having learnt weights. They conclude that just the architecture can improve results over baselines.

So it is safe to say that architecture plays a role in learning good quality word representations. However, training on the language model objective provides a significant boost to performance, nullifying in no small extent, the effects of architecture on learning good quality representations.

How about using Machine Translation to learn contextual representations

The paper titled Learned in Translation: Contextualized Word Vectors (CoVe) utilises hidden representations from a machine translation model as contextual representations. ELMO shows that using a language model performs better compared to using contextualised images from machine translation. Further, BERT confirmed that self-supervised trained training of language models is beneficial.

In a recent paper Linguistic Knowledge and Transferability of Word Representations, Liu et al. pursue a related question on the effect of pertaining task on capturing linguistic information required to perform well on a suite of different tasks. If the pre-training task is closely related to the down-stream task, then the performance improves. However, if large models are pre-trained in a semi-supervised manner with vast amounts of data, then they perform similarly.

Self-supervised style of training using vast amounts of data performs well compared to pre-training on other tasks

What does ELMO and BERT capture - Hierarchy in Representations and Structure

Local Information Captured in Lower Layers and Long Range Information Captured in Higher Layers.

Semantic Similarity

The left panel of the image above shows a semantic similarity between words, calculated using hidden representations from the bottom layer of a 4-layer bi-LSTM. The right panels capture the same information using hidden representations from the top layer of a 4-layer Bi-LSTM.

In the image above, on the left panel, the words in the noun phrase “The Russian Government” cluster together. Now find the right panel. The word ‘say’ is more similar to words like “afford, maintain, meet”, which are other verbs.

It suggests that lower layers capture information which would be useful for tasks like POS tagging that require localised knowledge. Higher layers capture information which would be helpful for tasks like coreference resolution that require knowledge about different parts of the sentence that span over a long-range. The authors empirically confirm the observations by performing various experiments like unsupervised coreference resolution using contextual representations from different layers.

In a separate work BERT Rediscovers the Classical NLP Pipeline, the authors further confirm that contextual representations (specifically BERT) encode abstractions in a manner that reflects a classical NLP pipeline. The model captures the information required to perform tasks like POS tagging, parsing, NER, semantic role labelling and coreference resolution from the lower layer to higher layer. (For a deep dive into this paper refer to the last section of this blog post)


Syntactic trees are a cornerstone for NLP in English. A parser represents a sentence as a syntax tree where the nodes represent some constituents of a sentence like a Noun Phrase, Verb Phrase, Adjective, amongst other things. This paper tries to probe contextual word representations to understand whether they encode such syntax trees

When one thinks about the question itself, you would wonder what it even means to encode syntax trees in the representation. How would you know that the representations capture the syntax tree? Given the word representations of all the words in a sentence, can you reconstruct the syntax tree in some sense?? If so, then we can say that the representations capture syntax trees. How to reconstruct the tree?? If we know the distance of a node in the tree to all the other nodes, and if we know at what depth the node is in the tree, then we can reconstruct the tree. The authors learn a transformation wherein the new transformation space, the distance between two words is close to the distance between words in the syntax tree.

In a learnt-transformed space, the word representations encode syntax trees. The square L2 distance between the word representation in the transformed space corresponds to the number of edges between them in the gold-standard syntax tree. The squared l2 norm corresponds to the depth of the word in the parse tree.

Settling the debate between ELMO and BERT

Now as I am typing this, more models are brewing up somewhere (there are XLNet and RoBERTa). However, there have been comparisons between ELMO and BERT by various works on different kinds of tasks. The original BERT paper outperforms ELMO on the GLUE (General Language Understanding Evaluation) set of tasks.

In another line of work Tenney et al. 2019, compare BERT and ELMO on different syntactic and semantic tasks like POS tagging, Named Entity Recognition and Relation Extraction with a set of lexical baselines. BERT performs well in most of the syntactic and semantic tasks.

Between BERT and ELMO, BERT seems to perform better in general.

Contextual Word Representations in Practice

ELMO provides three representations, and their linear confirmation is useful for a downstream task. BERT provides representations from different layers of the transformer that may be linearly combined and used in downstream tasks. The linear combination can be learnt, or a simple average can be performed. The parameters of the ELMO model or BERT model can be tuned for a specific task, or they can be frozen. There are a lot of decisions to make, and there are efforts to answer such questions See To Tune or not to Tune for more information. Here are some general results

  • Just adding a linear layer performs well for most of the tasks - Adding a linear layer on top of contextualised word representations performs well on many of the tasks. However, it performs poorly for some of the tasks like Named Entity Recognition. Adding more parameters by training a BI-LSTM or a Multi-layer Perceptron on top of the contextualised word representations provides better results for such models.
  • Task trained Contextualization is necessary in some cases - The different contextual representation outshines their non-contextualised counter-parts in syntactic tasks and not on tasks demanding semantic understanding. Some representation power might be used to do well on the pre-training task itself, suggesting that task-specific parameters are needed to do well on semantic tasks
  • How transferable are the different layers - What does it mean for layers to be transferable?? How well does a particular layer perform on a range of tasks? In ELMO, lower layers perform consistently well across a suite of tasks. However, a similar trend is not observed for BERT. Some of the layers in the middle perform well.
  • The lower layers are more robust to the pre-training task - The top layers of the Elmo model perform well on language modelling (the task which is used to retrain Elmo architecture), and there is a monotonic decrease in performance as we move from layer 0 and upwards.


Contextual word representations are taking the world by a storm. They are a drop-in addition to the traditional word representations that are so prevalent in modern NLP. We have seen that representations learnt - especially from a language model perform close to state of the art on many language understanding tasks, with little modification to the current architectures. Given a deep neural network, more abstract information is captured at higher levels, akin to Imagenet features in the computer vision community. It is quite fascinating to see how contextual word representations also capture the syntactic tree structure in the latent space. Releasing newer, better and bigger models is the norm of the day; and the NLP community is in a state of flux, albeit an interesting space. The Fc7 of NLP has probably arrived.


BERT changes its decision based on evidence from lower layers

One of the interesting sections in Bert Rediscovers the Classical NLP pipeline paper, is the visualisation on how BERT changes its decision at different layers. Let us look at one example below.


The sentence “he smoked Toronto in the playoffs with six hits”. As a human being when you read from left to right, and you are asked to do NER after reading Toronto, you would guess it a place. But if you are a sports enthusiast, then after reading the whole sentence you can suppose that Toronto is an organisation (a sports club from Toronto).

The vertical bars in the image above shows normalised probabilities of different named entities. The X-axis represents the layer at which representation is considered. For example, at layer 2, representations from layer two and below are supposed to make a decision. The blue colour shows the confidence in the correct label, and the yellow colour indicates the highest probability assigned amongst the wrong-labels. As you can see, at layer 0, the trust is high on the other wrong-label. But consider the image in the second panel that shows the confidence for Semantic Role Labelling. The model becomes more and more confident that “Toronto” is something that is getting “smoked”. Now it looks like the confidence of the model that “Toronto” is not a city; instead, an organisation increases at the next layer when the model decides its semantic role. In some-way, the model revises the decision about the named entity after its semantic role is confirmed.

BERT shows a trend of revising its decision made at lower layers, based on information from the top layers

However, there is no way to confirm this in this paper quantitatively. This work lists many more such examples. A more quantitative investigation into this phenomenon of decision reversals is needed to make it concrete. New papers that analyse how attentions influence the final classification results deal with such decision reversals (Is Attention Interpretable?)