I work in a group called WING(Web Information Retrieval and Natural Language Processing Group) at the National University of Singapore (NUS). Our group has been relentless in using Natural Language Processing to understand scientific articles and deriving insights. What is so special about scientific articles, you ask? That’s exactly the kind of question my friends, examiners and paper reviewers ask me, every-time I tell them what we work on. This post provides a brief overview on the field of scientific document processing and why it matters?? What are the typical characteristics of this field compared to processing documents like news articles? I will also provide some insights into a few specific computational tasks addressed in this field and some examples on how this quiescent field reinforces and gets reinforced by the developments in Natural Language Processing.

What is Scientific Document Processing??

​ Scientific endeavours entail disseminating the findings in an easy to understand manner. Scientific articles are one of the influential means to spread scientific knowledge. Other experts review scientific articles and upon acceptance are published in conferences and journals - available as PDF documents online. Over the past few years, the number of scientific articles published online has increased rapidly. Biomedical sciences itself has more than 26 million papers in total with 500 thousand papers published year over year. (https://arxiv.org/abs/1905.07870) A large number of scientific articles poses many challenges to scientists to assimilate even a small portion of the available documents. How can computers help scientists organise such enormous amounts of knowledge? How can they help scientists to automatically understand a field quickly and make meaningful contributions without repeating efforts? The grand vision of scientific document processing is to develop effective methods to answer such questions.

Looking at this a little closer, we think about the tools that the scientists need to help them organise the information. One of the common scenarios for a scientist is to get familiar with a new topic in a short amount of time. A handy tool would retrieve relevant papers in the field, provide a summary of each article in the area, among other things. Each of these requires analysis of scientific papers, their structure, extracting relevant information from them and presenting them in a manner suitable for consumption.

Many automated systems have already incorporated such features in recent times. Google scholar shows information like the total number of citations for a paper and the total number of citations a scientist has accumulated over his entire career. However, this is not genuinely effective as researchers seek more than knowing a few metrics. Semantic Scholar is a relatively new tool that goes beyond just gathering numbers and tries to gain a semantic understanding of scientific articles using Natural Language Processing methods. It shows the relevant figures and tables in a scientific article. It also indicates text excerpt relevant to a citation, shows a list of papers that are highly influenced by a particular paper amongst many other things. Can machines ingest all the knowledge pertinent to a specific topic and generate an initial draft of a hypothesis that can be further enhanced and tested by human beings? Tools like iris.ai (https://iris.ai) have the grand vision of solving such problems. Analysing scientific documents to help scientists is a topic that is becoming popular with more and more companies trying to enable scientists to be the best in their jobs.

The Computational Tasks in Scientific Document Processing

Let us look at a couple of rudimentary tasks in scientific document processing. One of the characteristic features of scientific articles is to refer to other papers. Scientists refer to other documents within their field for various purposes. The citation may be made to acknowledge the presence of another related article, to compare the methods and materials used in one paper with the materials used in another paper or to indicate the presence of background knowledge in the referred article. Such different functions of citations are called **citation intent **. How would understanding the citation intent help scientists?? It can help scientists retrieve all the competing methods in a particular field or to qualitatively understand the nature of citations to their own articles and get an overview of the community’s opinion.

Scientists require information about the different materials (chemicals, datasets, etc.) used to achieve particular tasks (gene editing, sentiment classification, etc.) and various processes(CRISPR, Neural Network Classification) used to tackle them. Such requirements would require analysing the text, extracting the different task process and materials used in them and linking them.. Computationally this task is called key-phrase extraction and linking. It enables scientists to query for papers that follow a particular process or those that use a specific material.

What is so different?

What makes processing scientific documents different from processing news articles and social media posts? Several characteristics are typical of scientific papers. They also pose several challenges to modern methods in Natural Language Processing and at the same time, provide opportunities to concoct novel models. One such problem is that scientific articles are longer than an average news article. Modern deep learning methods in Natural Language Processing which process text sequentially are effective for short to medium-length text. While newer methods to process longer documents are imminent, scientific document processing can serve as a testbed for experimentation.

Today, machine learning in Natural Language Processing is dominated by huge models that ingest humungous datasets to deliver improved empirical performance. The creation of datasets is time consuming and requiring large-scale human efforts. In a real world scenario, such large amounts of data is not always available. Also, scientific text consists of terms that can be understood only by people who have some background knowledge about the subject. Forming large-scale datasets for scientific document processing is harder compared to text in the general domain. Consider forming summaries for research papers. A human annotator has to first read the paper, analyse it and come up with a summary. Given that research papers talk about specialised topics, the annotator has to be well versed with the topics dealt in the paper. Annotating datasets with key-phrases is harder, since the annotator has to understand the different processes, materials and tasks that are common in a particular discipline in science. Scholarly document processing has to deal with scarce resources and forces researchers to consider methods alternative to main stream to overcome these challenges.

Validation of an idea in science is as vital if not more than the idea itself. Scientists seek validation from a related community of researchers. Scientific articles cite other papers in the field to position their work concerning other ideas. Although other documents on the Web are an assembly of information from various sources, they are not explicitly mentioned (https://link.springer.com/article/10.1007/s00799-018-0261-y). However, citation analysis enables analysis of a collection of papers and provides a new perspective on a topic. It opens the doors to examine the authority of authors, analyse sub-groups of scientists who work on a related field, amongst many other applications.

Scientists also follow a concrete structure for their discourse. A general scientific article contains an abstract of the paper, an introduction that motivates and lists the challenges of the problem, followed by the methodology, experiment, results and discussions. Incorporating such discourse structure are necessary for bettering some of the tasks in scientific document processing. Consider, citation intent classification where the purpose of the citation is determined. The section in which a citation occurs can help us in determining its purpose. If you have read conference papers, you will notice that the related work is used to distinguish between the current work and previous works. How can we model such information about the structure of the document to improve performance? Becomes a key consideration in processing scientific articles.

Given such challenges, it is interesting to see the adaptation of modern deep learning methods to tackle them. Without delving into more details, researchers often resort to various semi-supervised learning techniques in Natural Language Processing. Multitask learning, transfer learning and domain adaptation are emerging techniques. For more information about the different transfer learning techniques you can refer to Sebastian Ruder’s blog post on this topic (https://ruder.io/state-of-transfer-learning-in-nlp/)


This blog post serves as an introduction to the nature of problems tackled in scientific document processing. It also talks about some primitive tasks like citation intent classification and key-phrase extraction, which can eventually help scientists be more productive. How is it different from Natural Language Processing? Scientific articles have characteristics unique to it. It allows easy analysis of a collection of papers through citation analysis, the structure of the paper has a role to play in performing well on many rudimentary tasks and enforces constraints such as minimal training data that challenges modern data-hungry methods.It borrows techniques and concepts from Natural Language Processing and reinforces the broader field as well. For example, using language model representations to improve other NLP tasks (along with the likes of Elmo, BERT and their cousins) was initially tried to improve key-phrase extraction on the scientific text before they proved useful for the general field of Natural Language Processing. The fruits of scientific articles processing are not tangible to the general public. Unlike Google, search engines for scientific articles cater only to the particular breed of scientists. The goal is to enable them to be non-repetitive, cognisant of current work in their field and provide tools that help them work better. Scientific document processing strengthens and is strengthened by the broader field with techniques invented due to the constraints that inherently exists.