Synthetic Data Sets: Data Generation for Machine Learning

Written by Coursera Staff • Updated on

Synthetic data makes data more accessible and provides the training materials you need to create machine learning algorithms. Explore why synthetic data sets are important and synthetic data use cases like medical research and autonomous vehicles.

[Feature Image] A learner works with synthetic data sets on their laptop as part of their coursework.

Key takeaways

A synthetic data set created by artificial intelligence (AI) or machine learning (ML) retains the original data's properties, but it isn't real. 

  • Synthetic data is a helpful resource when it would be difficult to create more data or when privacy and ethical concerns limit how many people can access a data set.

  • Synthetic data is typically generated by either statistical distribution or model-based methods.

  • You can create training material using synthetic data that contains the same identifying properties of real data without waiting for fraudulent activities to occur.

Explore why synthetic data sets are important and how you can use them in various applications and industries. You can also start learning with the IBM Machine Learning Professional Certificate. In as little as three months, you can master the most up-to-date practical skills and knowledge machine learning experts use in their daily roles. By the end, you’ll have a shareable certificate to add to your professional profile.

What are synthetic data sets, and why are they important? 

A synthetic data set is artificially created data that you can use in place of real data to train machine learning models, conduct scientific research, develop software, and more. Synthetic data can help you gain insight into the properties and underlying mechanisms of data in situations where creating an authentic data set would be challenging. For example, medical research trials rely heavily on sensitive patient data, which presents a possible privacy risk. 

Researchers could create a synthetic data set using the original, sensitive data and end up with one that many people can access and work with without putting personal information at risk. 

Synthetic data also creates equity around data by giving more people access to data sets. 

Companies and organizations restrict access to their data for many reasons, privacy and the sheer value of the data being two big reasons. Researchers can more easily share synthetic data, allowing many more people and organizations access to it. Kalyan Veeramachaneni, principal research scientist at MIT, compared the opportunities that synthetic data can create for students and individuals early in their careers to the advances in access to computing power and resources in the last 20 years. Veeramachaneni recalled the difficulty he had in graduate school accessing the computing power that he needed for his work, which today’s graduate students can easily access through cloud computing services. “If I hadn’t had access to data sets the way I had in the last 10 years, I wouldn’t have a career,” Veeramachaneni said [1]. Synthetic data can open these opportunities for more and more upcoming research scientists.

What are the types of synthetic data?

Synthetic data typically falls into one of three categories: fully synthetic, partially synthetic, or hybrid synthetic. Fully synthetic data doesn’t include any real-world information, while partially synthetic data uses real-world information as the foundation but replaces portions of it. Hybrid synthetic data, on the other hand, combines real data sets with fully synthetic ones.

How to create a synthetic data sets

You can generate synthetic data with traditional data analysis. Still, you can also apply machine learning and deep learning to a real data set to create a valuable synthetic data set. 

  • Statistical distribution: Using this method, data scientists create statistical models using actual data, which they can then use as the basis for creating synthetic data without losing the important properties of the data. 

  • Model-based: Instead of analyzing the data using data analysis, scientists can deploy machine learning algorithms to complete this analysis. With deep learning, you can use a variety of models, such as generative adversarial networks (GANs), variational autoencoders (VAEs), and large language models, first to understand what characteristics define the data and second to generate synthetic or fake data that has fidelity to the original data. 

Synthetic data set use cases

You can use synthetic data for two primary purposes: to supplement situations where it is difficult or impossible to obtain more real data and to protect privacy in data sets with sensitive information. Explore different scenarios where you might use synthetic data instead of real data. 

Difficult or impossible to obtain real data

You can encounter many situations where collecting the amount of real data you need to accomplish your task would be difficult, impossible, or unethical. One example is crash data for autonomous vehicles. Training a model capable of controlling a vehicle would require you to provide data to the model so it can understand the complex relationships between the objects it sees and how it should react as a result. We can improve these models by giving them data about crashes and accidents so they can understand why those accidents occur and correct their behavior to avoid them in the future. 

However, scientists are limited by how much data they can collect through accidents in the real world. Using synthetic data, researchers can give the model training materials that have the underlying patterns and principles of real crash data without requiring actual humans to crash their cars. 

Similarly, you can apply these concepts to software testing, where you might want more data about security breaches or potentially fraudulent transactions so you can train a model to mitigate these events. Synthetic data allows you to create the needed data without risking your development project. 

You can use synthetic data to train machine learning and AI models in many different situations, above and beyond computer vision and software testing. In addition to helping you access data you wouldn’t be able to before, you can also control synthetic data to allow you to get specific types of additional data. Returning to the example of autonomous vehicles, you could use synthetic vision to create more images in low lighting or darkness to help train the model for these scenarios. 

Read more: Artificial Intelligence in Medical Diagnosis: Real-World Examples and Applications 

Sensitive data with privacy or security concerns

The second main reason you may use synthetic data is to address privacy or security concerns inherent in a data set. For example, scientists and researchers often need sensitive health care or medical research data. Researchers can gain much insight by analyzing patient records, how patients react to medications during clinical trials, or by looking at medical imagery. 

Another example of using synthetic data in place of sensitive data is The Global Synthetic Dataset, a project in collaboration between The Counter-Trafficking Data Collaborative and Microsoft Research. This is a synthetic data set that researchers and organizations can use to study global trafficking patterns in an attempt to develop evidence-based practices to fight human trafficking. By understanding the patterns within this data set, community-based organizations can gain insight into how they might best approach this problem and work to prevent it in their community without risking the private and sensitive information about victims of human trafficking. 

Both difficult and sensitive

You can also use synthetic data for both reasons, such as by using it to train a machine learning algorithm to identify medical images that contain potentially cancerous tumors. In this case, you would need a lot of potentially sensitive data to train the algorithm. Synthetic data solves the problem of creating enough data to effectively train your model without risking real patient information.

Stay up-to-date with in-demand industry topics

Looking to level up your learning? Get insights into in-demand skills and career trends by subscribing to our LinkedIn newsletter, Career Chat. Build or refresh your data analytics or machine learning skills with our other free resources:

With Coursera Plus, you can learn and earn credentials at your own pace from over 350 leading companies and universities. With a monthly or annual subscription, you’ll gain access to over 10,000 programs. Just check the course page to confirm your selection is included.

Article sources

  1. MIT Sloan. “What Is Synthetic Data and How Can it Help You Competitively? , https://mitsloan.mit.edu/ideas-made-to-matter/what-synthetic-data-and-how-can-it-help-you-competitively.” Accessed April 28, 2026.

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.