Understanding NLP Algorithms: The Masked Language Model

Written by Coursera Staff • Updated on

Expand your understanding of natural language processing with exciting algorithms such as masked language models. Plus, discover how different models compare and where you might have seen masked language modeling in action.

[Featured Image] Two tech employees in an office discuss using a masked language model.

Masked language models help computers understand nuances in text by considering the context of surrounding words, making them valuable within the broader field of natural language processing. Professionals use masked language models to train popular algorithms such as BERT to interpret and generate human language more effectively. Explore masked language modeling in the context of natural language processing, including popular uses of masked language modeling and how you might see it applied in professional industries.

Overview of natural language processing (NLP)

This growing field falls under the umbrella of artificial intelligence (AI). Natural language processing (NLP) enables machines to understand, interpret, and interact with humans through human language. Society generates vast quantities of text and speech data daily, and NLP algorithms bridge the gap between how humans communicate and what computers can understand. The applications of NLP range broadly, and you’ve likely used a few in your daily life without noticing. Typical applications you might see day-to-day include:

  • Virtual assistants: Virtual assistants like Amazon Alexa can listen to human commands and perform tasks, such as scheduling meetings, playing music, or reciting facts.

  • Spam detection: If you have a “spam” folder in your email, you are already using NLP. NLP algorithms read your emails to look for ones that seem suspicious or follow patterns of known scams.

  • Document processing: NLP algorithms can take large volumes of text and provide summaries to lower the workload on humans.

What is a masked language model?

At its core, a masked language model (MLM) is a type of neural network-based language model that has been trained to predict missing or “masked” words within a piece of text. The concept of masking words involves temporarily replacing some words in a sentence with a token. The model then predicts what the masked word is using the surrounding words to provide context.

Understanding BERT and masked language modeling

You can use masked language modeling in the pre-training phase of BERT, which is a type of NLP. Google AI developed BERT (Bidirectional Encoder Representations from Transformers), significantly changing how machines understand and interpret human language. This revolutionary NLP method improves on traditional approaches, such as recurrent neural networks (RNNs), which have a more limited capacity.

One breakthrough with BERT is its ability to infer the context of a word in a sentence. Previous models typically analyzed words in sequence, typically from left to right. This was limiting because the algorithms did not consider words later in the sequence as important to understanding the context of previous words. BERT, however, examines words in the context of the entire sentence, which leads to a more accurate understanding in many cases. When using BERT, you will encounter two main phases: pre-training and fine-tuning.

Unsupervised pre-training

During pre-training, the model learns how language is represented based on a large bank of unlabeled text. In this phase, the model is exposed to a high volume of text data and learns to predict masked words. In this phase, 15 percent of words in input sentences are “masked,” requiring the model to predict the masked words from context clues in surrounding words.

When considering the unsupervised pre-training phase, you can think of it like this. Imagine you’re on the phone with a friend and they have a spotty connection. You hear them say, “I’m taking my dog to the [BLANK] to get their annual shots.” While you may not have heard one of the words in the sentence, you would likely fill in the blank with “vet” based on surrounding words. This is what BERT is trying to master. Another way BERT is pre-trained is through sentence prediction, where one sentence is given and BERT tries to predict the second sentence.

Supervised fine-tuning

After pre-training, additional training on smaller data sets refined the algorithm to be more specific to the task at hand. This adaptability is what makes BERT a versatile tool for tasks like question answering, text classification, and sentiment analysis.

Examples of masked language models

BERT is one of the pioneering masked language models and is well-respected for its bidirectional context understanding. It considers words on the right and left of the masked word, allowing it to capture contextual information more effectively. While BERT is a popular choice, several types of masked language models are available, many of which are derivatives of BERT, each with its own unique characteristics and applications.


RoBERTa is an extension of BERT with improved training techniques. It allows for dynamic masking, which changes where and how often to mask words using model learning predictions.


SpanBERT looks at phrases rather than individual words. The algorithm masks tokens based on the boundaries and positions of the phrases.


DistilBERT is a smaller, cheaper version of BERT that uses a compressed technique to mimic the algorithm of the full version. This model is faster and requires less storage than the complete model.


ELECTRA is a word-level NLP algorithm that treats every word as a “masked” word by replacing it with alternatives from a small generator. The model then predicts the replaced words.


XLNet is another masked language model that aims to address limitations in BERT. It uses a permutation-based training approach, which masks different words based on possible permutations.

Limitations of masked language models

Knowing the limitations of masked language models can help you determine whether this is the right type of algorithm for your needs. Despite their advantages, masked language models such as BERT can require large amounts of training data and storage space. In addition to this, many algorithms are not applicable to general tasks, as they require domain-specific training (fine-tuning). Because of this, their applications may not apply beyond the scope of their training data.

Depending on your needs, looking for masked language models designed to address different limitations may better suit your goals. For example, DistilBERT is faster and requires less storage than BERT, while RoBERTa is trained on a larger database and may have more accurate predictions in certain contexts.

Industries using masked language models

Many different professional industries use NLP algorithms trained with masked language models. If you work as a professional in these industries, you may benefit from becoming familiar with this type of algorithm and technique. As mentioned previously, many professionals already use BERT as their preferred NLP algorithm trained with masked language modeling. Applications continue to grow, but BERT’s use cases already include:

  • Predicting loss amount based on warranty claims. 

  • Identifying articles to source information previously overlooked.

  • Analyzing clinical notes and medical information more effectively.

  • Predicting medical diagnoses based on combined text and images.

  • Performing sentiment analysis on text to uncover the tone and intent of messages.

  • Understanding user intent in search queries to produce more accurate results.

Learn more on Coursera.

Learn more about NLP algorithms with highly rated courses and Specializations from industry leaders and universities. Consider the Natural Language Processing Specialization offered by DeepLearning.AI for a broad foundation. In this Specialization, you will complete four courses on topics such as classification and vector spaces, probabilistic models, sequence models, and attention models.

Keep reading

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.