Understanding NLP Algorithms: The Masked Language Model

Written by Coursera Staff • Updated on Jan 24, 2026

Expand your understanding of natural language processing (NLP) with exciting algorithms such as masked language models. Plus, discover how different models compare and where you might have seen masked language modeling in action.

[Featured Image] Two tech employees in an office discuss using a masked language model.

Key takeaways

A masked language model is a type of neural network-based language model that has been trained to predict missing (masked) words within a piece of text.

During pre-training, masked language models are exposed to a large amount of unlabeled text and learn language patterns by masking a portion of words (traditionally, about 15 percent), then predicting them from the surrounding context [1].

Professionals use masked language models to train popular algorithms such as BERT to interpret and generate human language more effectively.

You can use masked language models to predict loss amounts, identify articles to source information, analyze clinical notes, predict medical diagnoses, and more.

Explore masked language modeling in the context of natural language processing, including popular uses of masked language modeling and how you might see it applied in professional industries. Or, you can start learning now with the Natural Language Processing Specialization. In this Specialization, you'll explore how to use logistic regression, naïve Bayes, and word vectors to implement sentiment analysis, complete analogies, and translate words. Plus, upon completion, you’ll have a shareable certificate to add to your professional profile.

Overview of natural language processing (NLP)

This growing field falls under the umbrella of artificial intelligence (AI). Natural language processing (NLP) enables machines to understand, interpret, and interact with humans through human language. Society generates vast quantities of text and speech data daily, and NLP algorithms bridge the gap between how humans communicate and what computers can understand. The applications of NLP range broadly, and you’ve likely used a few in your daily life without noticing. Typical applications you might see day-to-day include:

Virtual assistants: Virtual assistants like Amazon Alexa can listen to human commands and perform tasks, such as scheduling meetings, playing music, or reciting facts.

Spam detection: If you have a “spam” folder in your email, you are already using NLP. NLP algorithms scan your emails to identify those that seem suspicious or follow patterns of known scams.

Document processing: NLP algorithms can take large volumes of text and provide summaries to lower the workload on humans.

What is a masked language model?

At its core, a masked language model (MLM) is a type of neural network-based language model that has been trained to predict missing or “masked” words within a piece of text. The concept of masking words involves temporarily replacing some words in a sentence with a token. The model then predicts what the masked word is using the surrounding words to provide context.

Understanding the BERT masked language model

You can use masked language modeling in the pre-training phase of the bidirectional encoder representations from transformers (BERT), which is a type of NLP. Google AI developed BERT, significantly changing how machines understand and interpret human language. This revolutionary NLP method improves on traditional approaches, such as recurrent neural networks (RNNs), which have a more limited capacity.

One breakthrough with BERT is its ability to infer the context of a word in a sentence. Previous models typically analyzed words in sequence, typically from left to right. This was limiting because the algorithms did not consider words later in the sequence as important to understanding the context of previous words. BERT, however, examines words in the context of the entire sentence, which leads to a more accurate understanding in many cases. When using BERT, you will encounter two main phases: pre-training and fine-tuning.

Unsupervised pre-training in masked language modeling

During pre-training, the model learns how language is represented based on a large bank of unlabeled text. In this phase, the model is exposed to a high volume of text data and learns to predict masked words. In this phase, a percentage of the words in input sentences are “masked” (traditionally 15 percent), requiring the model to predict the masked words based on context clues from surrounding words [1].

When considering the unsupervised pre-training phase, think of it in this way. Imagine you’re on the phone with a friend and they have a spotty connection. You hear them say, “I’m taking my dog to the [BLANK] to get their annual shots.” While you may not have heard one of the words in the sentence, you would likely fill in the blank with “vet” based on the surrounding words. This is what BERT is trying to master. Another way BERT is pre-trained is through sentence prediction, where one sentence is given, and BERT tries to predict the next sentence.

Supervised fine-tuning in masked language modeling

After pre-training, additional training on smaller data sets refined the algorithm to be more specific to the task at hand. This adaptability is what makes BERT a versatile tool for tasks like question answering, text classification, and sentiment analysis.

What is masked language modeling vs. causal language modeling?

The primary difference between causal language models and masked language models is that causal language models can only see the tokens they have already predicted, generating tokens from left to right with access to only those on the left. In contrast, a masked language model can see tokens bidirectionally on both the left and the right. This allows a masked language model to answer a fill-in-the-blank question by reading context from both sides of the missing token, thereby gaining the necessary context to predict the correct response. A causal language model would only have access to the tokens leading up to the missing token to predict the response.

Masked language model examples

BERT is one of the pioneering masked language models and is well-respected for its bidirectional context understanding. It considers words on the right and left of the masked word, allowing it to capture contextual information more effectively. While BERT is a popular choice, several types of masked language models are available, many of which are derivatives of BERT, each with its own unique characteristics and applications.

RoBERTa

RoBERTa is an extension of BERT with improved training techniques. It allows for dynamic masking, which changes where and how often to mask words using model learning predictions.

SpanBERT

SpanBERT looks at phrases rather than individual words. The algorithm masks tokens based on the boundaries and positions of the phrases.

DistilBERT

DistilBERT is a smaller, cheaper version of BERT that uses a compressed technique to mimic the algorithm of the full version. This model is faster and requires less storage than the complete model.

ELECTRA

ELECTRA is a word-level NLP algorithm that treats every word as a “masked” word by replacing it with alternatives from a small generator. The model then predicts the replaced words.

XLNet

XLNet is another masked language model that aims to address limitations in BERT. It uses a permutation-based training approach, which masks different words based on possible permutations.

Masked language model limitations

Knowing the limitations of masked language models can help you determine whether this is the right type of algorithm for your needs. Despite their advantages, masked language models like BERT can require substantial amounts of training data and storage space. Additionally, many algorithms are not applicable to general tasks, as they require domain-specific training, also known as fine-tuning. Because of this, their applications may not apply beyond the scope of their training data.

Depending on your needs, looking for masked language models designed to address different limitations may better suit your goals. For example, DistilBERT is faster and requires less storage than BERT, whereas RoBERTa is trained on a larger data set and may yield more accurate predictions in certain contexts.

Industries using masked language models

Many different professional industries use NLP algorithms trained with masked language models. If you work as a professional in these industries, you may benefit from becoming familiar with this type of algorithm and technique. As mentioned previously, many professionals already use BERT as their preferred NLP algorithm trained with masked language modeling. Applications continue to grow, but BERT’s use cases already include:

Predicting the loss amount based on warranty claims
Identifying articles to source information previously overlooked
Analyzing clinical notes and medical information more effectively
Predicting medical diagnoses based on combined text and images
Performing sentiment analysis on text to uncover the tone and intent of messages
Understanding user intent in search queries to produce more accurate results

Learn more about NLP and machine learning with our free resources

Explore emerging trends and topics in your industry with a subscription to our LinkedIn newsletter, Career Chat. Or check out the following natural language processing and machine learning resources to keep learning:

Explore real-life applications: Machine Learning in Real Life: From Spotify to Healthcare

Learn from experts: How to Use GenAI to Advance Your Career: Insight from Coursera's Former CEO

Explore career paths: Machine Learning Career Paths: Explore Roles & Specializations

Whether you want to develop a new skill, get comfortable with an in-demand technology, or advance your abilities, keep growing with a Coursera Plus subscription. You’ll get access to over 10,000 flexible courses.

Build job-ready skills with Coursera Plus

Start 7-day free trial

Article sources

Research Gate. “Should You Mask 15 Percent in Masked Language Modeling?, https://www.researchgate.net/publication/358658540_Should_You_Mask_15_in_Masked_Language_Modeling.” Accessed January 15, 2026.

Updated on Jan 24, 2026

Written by:

Coursera Staff

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.