It was once true that classifying images was a difficult task for machines. The introduction of Imagenet and the rapid advances in compute power has made image classification a reality. Features from neural networks trained on Imagenet, have been reused for tasks like object detection, segmentation, image/video captioning. These features from deep neural networks are known to encode abstract information. The top layer from which features were extracted was named fc7 and hence the term fc7 features.

While computer vision advanced by using fc7 features, Natural Language Processing was just getting started to represent words in a way that would enable efficient text classification. Words which were traditionally represented using integers were now represented as continuous vectors. The meaning of the words is distributed across the vector and hence they are also known as distributed representations. Such continuous representations are used as the first layer in complex neural network architectures. They became responsible for the improvement in many NLP tasks including text classification, sentiment analysis, named entity recognition. For a layman introduction to distributed representation, you can refer to Noah Smith’s article Contextual Representation: A contextual introduction.

Computer vision community found fc7 features to be useful for many diverse tasks. What is its equivalent in NLP?

Yet there was still a lack of radical development, that enabled capturing different levels of abstract information from a piece of text using deep neural networks. The fc7 of NLP had still not arrived. There was no way that one could take pre-trained features and fine-tune it on downstream tasks. What is the equivalent for Imagenet dataset in NLP that can enable learning of abstract information?. What different characteristics of text should be captured to enable good performance on down-stream NLP tasks?? These are some open questions for which answers are yet to be found.

In recent time, there has been a spur of development in this area. Answers to some of the questions have been pursued and significant performance improvements have been found. They are called contextual word representations and enables Imagenet like transfer learning for NLP. Now that we have spoken about the context (no pun intended), I would like to highlight recent developments in this area in possibly a multipart blog post.

  • Contextual Word Representations - Part-1
  • Probing Contextual Word Representations Part-2

Part 1 talks about general word representations including Elmo, BERT, Sci-BERT and ULMFit. They differ in the neural architectures and training task used to learn representations. We will see how they have improved the performance on many natural language understanding tasks.

Characteristics of text captured in these word representations are still not understood very well. Do they learn syntactic information well in the first few layers and semantic information in the deeper layers?? How to devise tests to confirm these and what are the main take-aways? Is the focus of Part 2

Before we start, I would want to highlight other blog posts that talk about the same topic.

We will talk about

Elmo - Embeddings from language models

The meaning of a word depends on its context. For example

  • The banks of the river provide a viable opportunity for all small business selling handmade crafts
  • The Ministry of Finance announced that all all the banks would have to adhere to the new rules

Here bank refers to the banks of a river in the first sentence and to a financial institution in the second. One of the first works that come up with the idea that the representation of a word should be dependent on its word is titled “Deep Contextualised Word Representations” by Peters et al (2017).

Language models are probabilistic models that predict the plausibility of a sequence of words in a language. LSTMs and other RNN variations are often used to learn a language model. Elmo considers the hidden state of a bi-directional language model as contextual representations. A bi-directional LSTM with one layer is shown in the image below.


The hidden state of the two layers are combined in an empirical way (sometimes concatenated, sometimes averaged..) before passing them to a softmax layer to predict the next word. Training a language model is a self-supervised task and any free form text can be used.

Elmo trains the Bi-LSTM language model on huge datasets that has millions of sentences. It argues that using the hidden states from different layers of a Bi-LSTM language model results in good representations that depend on the surrounding context. Going back to the example of the word “bank”, a different representation is learnt for the word based on its context, which helps in capturing polysemy (one word having two meanings depending on the context). Hence the term contextual representations

Elmo in essence argues that using the hidden states from different layers of a Bi-LSTM language model is a good generic representation of words.

So how can Elmo be used in down-stream tasks?? A linear combination of the hidden representations from the two layers can be used for down-stream tasks. The linear combination can be learnt jointly with the down-stream task or a simple average can be used. The paper suggests that learning the linear combination for every task produces the best performance.

For example if you have a sentence “Quick brown fox”, the representation for quick is formed by linearly combining its representation in layer-1 and layer-2.

$$w_{quick} = \underbrace{s_1}_{\textit{Weight for layer 1}} . \underbrace{h_{quick, 1}}_{\textit{repr for quick in layer 1}} + s_2. h_{quick, 2}$$

The layer specific weights $s_{1}, s_{2}$ can be learnt as part of the down stream tasks or they can be set to 1 to average the representation of a word from two layers. Learning to linearly combine along with the final task gives maximum gains.

Ideally, the parameters of the pre-trained Bi-LSTM model should be frozen before using it for downstream tasks. However, the authors backprop through the parameters of of the Bi-LSTM model while learning down-stream tasks. Although this would imply optimisation for a domain representation, the advantage of directly using the pre-trained representation without any more learning is not clear.

Elmo achieves state of the art on 6 different NLP tasks, which demand a syntactic and semantic understanding of the English language. Representations from lower layers of the Bi-LSTM language model help in achieving syntactic tasks like POS tagging. However, the top layers do not help in improving tasks that demand a semantic understanding of the text. Elmo representations also help in capturing polysemy. Elmo also can help with tasks that have very less training data.

Bert - Bi-directional encoder representations from transformers.

While Elmo used language modelling as the task for learning generalised representations, Bert uses a different task and a different model to learn word representations. The authors argue that the advantage lies in the formulation of a different task to learn representations. Two tasks that provide advantage are

Masked Language Model

In a masked language model some of the words are masked, and the network predicts only those words which are masked. Consider the previous example, the word “brown” is replaced with MASK and the network learns to predict the masked word based on the contextual representation of its surrounding words.

Masked Language Model

Next-Sentence Prediction

This is a simple binary predictions task where given two sentences, the network tries to predict whether the second sentence follows the first sentence or no.

An example from the paper reads

Input(Sentence 1) - The man went to the store

Input(Sentence 2) - he bought a gallon of milk

The second sentence follows the first sentence. Even here, some of the words might be masked and the network is forced to learn representations to predict whether the second sentence follows the first.

These two tasks are performed jointly and the internals of the model are exposed as representations. The representations are generic enough to perform very well on many tasks that demand syntactic and semantic understanding of text. (Its magic isn’t it??).

How can Bert representations be used in down stream tasks - The BERT model is followed by a linear layer and these are the only new parameters introduced for the new task. However, the parameters of the language model itself are fine-tuned while training for downstream tasks.

How well does Bert perform when there is no fine-tuning of the transformer model This paper answers this in a separate section that makes it very clear. Even by just fine-tuning the linear layer parameters sufficient gains can be obtained.

Sci-Bert: Pre-trained Contextualised Embeddings for Scientific Text

Millions of scientific articles are published every year all around the world. Automatic understanding of scientific articles helps us organise scientific knowledge. The growing amount of scientific literature offers various challenges. For example, identifying the logical structure of the document. Classifying sentences in a scientific document into various categories like equation and section header can have numerous downstream applications. Elmo and BERT train the language models on a generic text like Wikipedia. Sci-Bert considers the full texts of scientific articles to learn word representations. Papers from biomedical and computer science domain are used to pre-train a transformer model with the same objective as BERT.

The authors use Sci-BERT in various downstream tasks like Named Entity Recognition in the biomedical domain and computer science domain. Out of 14 tasks, they achieve State of the art on 8 tasks (>50% hit-rate). The training data for Sci-BERT contains only computer science and biomedical domain papers and the downstream tasks on which they perform well are also from the same domain. Do they perform well on papers from other domain is a question yet to be answered.

ULMFit - Universal Language Model Fine-tuning for Text Classification

fc7 features have catapulted image classification, object detection, image captioning among other applications. But the same is not true for text classification. ULMFit is a method that suggests that fine tuning Language Models is a good approach to achieve transfer learning in NLP. Language models overfit on small training datasets and suffer from forgetting once fine-tuned on new datasets. To avoid this, ULMFit suggests new techniques, some of them which are inspired by techniques from the Computer Vision Community

The paper suggests three steps to obtain state of the art methods in text classification with minimal efforts.

  • Language Model Training - Training a Language Model on generic text like Wikipedia. The paper uses a model called AWD-LSTM Language Model (add reference to this)
  • Fine Tuning Language Model - Language models is fine-tuned using data from the target task
  • Target Task Classification - Finally a target task specific model is fine tuned.

Here are some more details about each of these three steps

Language Model Training

The paper uses a model called AWD-LSTM Paper here for language model training. Wikipedia-text is used to train the language model. One may notice that they use a very specific language model for training. Does the quality of the language model affect the final classification results? As the paper tests, using a vanilla language model produces bad performance especially when text classification has very less training data.


Fine-tuning the language model to the target task data is the next step suggested by the paper. To do this effectively, the paper suggests two tips and tricks with slight effective modifications

  • *Discriminative training - The last layers of the model would require higher learning rates because they are more task-specific. For a new task with a small amount of training data, having a larger learning rate helps. The authors borrow the idea from the computer vision community and show that having a discriminative fine-tuning technique helps
  • Triangular Learning Rate - Here during the fine-tuning of the language model, the learning rate is increased up-to some number of iterations and then the learning rate is reduced. Having a small number of iterations with an increase in learning rate followed by a large number of iterations with decreasing learning rate is found to be key by the paper

Classifier Pretraining

The target task-specific architecture is very minimal. Two linear blocks with non-linearities and batch normalisation. Again the authors suggest layer-wise unfreezing where the last layer is unfrozen and trained for one epoch followed by the layer previous to it and continuing this way unto some number of layers. This kind of unfreezing is a novel suggestion from the paper and is found to help

The paper thoroughly tests their techniques on 6 different text classification tasks and they achieve State-of-the-Art results on all 6 of them. The paper suggests that the different tips and techniques suggested seem to help in effective transfer-learning.

The nitty-gritty of the paper is quite exhausting. For example, language modelling choice seems to have a lot of impact on the results. Using a sub-par language model would lead to bad results especially in the case where the training dataset is small. The authors also suggest that one tuning the language model itself is useful when training dataset for classification is large. For smaller datasets seems to have no impact. This paper makes a bold effort to fine-tune the language model to obtain a parallel to the techniques that have served the computer vision community well.


A common theme in most of these papers, is pre-training a Language Model. Why should Language Models work well for pre-training downstream tasks? A lot of work has been done in order to understand what language models learn. They indicate that language models learn transferable long-term syntax dependencies, hierarchical structure of text automatically among other things. Language Modelling implicitly seems like a good choice for pre-training in NLP.

Throughout this blog post, we always mentioned how pre-training helps in improving many down-stream tasks. If you would want to know the characteristics of language captured by different techniques, jump to Part2