The Intuition for NLP. An Explanation of ERNIE 2.0

Even though ERNIE 2.0 came out for what feels in ML like a long time ago, it’s still an excellent model to look at to understand the larger context of NLP.

Natural language processing (NLP) has been a leading sub-field of machine learning, with pre-trained networks showing dominant results in various language understanding tasks. My goal is to explain ERNIE 2.0 with the goal of providing a solid framework for understanding the newest developments in natural language processing.

The Big Dilemma of NLP

In regular supervised neural networks, data is labeled and used to mathematically adjust the parameters of the neurons to maximize performance. But in NLP, a dilemma is presented where near-endless amounts of non-labeled information exist on the internet in the form of articles, encyclopedias, news, dialog and other language data available, but cannot be used when applying traditional techniques. This has led to a surge in research to find new ways of extracting as much useful information from unlabeled data as possible, in particular by employing co-occurrence probability — the chance words appear beside one another — in a technique called unsupervised learning. After reviewing massive amounts of unlabelled data in a period called pretraining, a more thorough supervised phase of learning called ‘fine-tuning’ takes place, which uses smaller labeled datasets to finalize training and specialize the network in tasks like:

  • Sentiment analysis
  • Name-entity recognition
  • Question answering
  • Translation

This pretraining section is largely responsible for many of the advancements in NLP.

Core Ideas of ERNIE 2.0

The researchers propose a new pre-training architecture called ERNIE 2.0 which allowed them to achieve state-of-the-art results.

Idea 1: Increase information extraction from the unlabeled data during the unsupervised phase of learning.

Problem: Current pre-training procedures usually focus on pre-training the model with several simple tasks to grasp the co-occurrence of words and sentences. However, besides co-occurring, there exists other valuable information that can be extracted from the data that is being left behind.

Solution: The researchers propose expanding pre-training tasks to include more than just co-occurrence of words and sentences, but also tasks like:

  • A letter capitalization prediction task, in which the network is shown certain words like ‘paris’ and ‘cat’ and asked which one should be capitalized, in this case being ‘Paris’.
  • A document-word relationship pattern, where a sentence like: ‘Cats are predominantly active in the night’ is shown, and the document is asked which word is most likely to be repeated in other parts of the same article, the answer to this example being ‘cat’.
  • A sentence reordering task to learn the relationship among sentences. The sentences of a paragraph are split and shuffled randomly. The network is then asked to piece it back together as a paragraph.

Idea 2: Reduce the loss of learned progress when changing tasks.

Problem: Current pre-training architectures tend to forget information when moving on from a task to another. Like when a ship is leaking from multiple holes, a network will go and adjust parameters to satisfy the first task — plug the leak — and then when encountering the second task it will adjust its parameters, causing it to perform well on the new task but at a risk of unlearning information gained previously — Like moving the same hand from the first leak to the second, causing the first to leak once more.

Solution: The researchers propose the use of a technique called: multi-task learning, in which during the unsupervised phase of the training, tasks are introduced incrementally in a way that penalizes ‘unlearning’ information. Specifically, the network is asked to:

  1. Perform and adjust parameters to maximize results on Task 1.
  2. Perform and adjust parameters to maximize results on Task 1 and Task 2 at the same time.
  3. Perform and adjust parameters to maximize results on Task 1, Task 2 and Task 3 at the same time.

The story of NLP is in large part the story of effective pretraining on the massive amounts of language data available online. ERNIE 2.0 was more effective because it expanded the definition of pre-training tasks, and reduced the loss of learned progress by using multi-tasking.

Thanks for reading! I’m Davide, a 19-year-old self-learner who runs The Feynman Mafia, exploring how learning by explaining can be used to teach yourself any topic.

If you want to follow along on my journey, you can join my newsletter, check out my website, and follow me on YouTube or Twitter.