Learn what exploratory data analysis is, how you might see it applied across industries, and which tools to consider when conducting your own analysis.
![[Featured Image] Three data analysts look at a graph on a computer screen created using exploratory analysis.](https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://images.ctfassets.net/wp1lcwdav1p1/2V7nx9orwON9EniKjhB3bI/b7ad6d9bd023e480b7fdd088c2095c22/GettyImages-1213042830.jpg?w=1500&h=680&q=60&fit=fill&f=faces&fm=jpg&fl=progressive&auto=format%2Ccompress&dpr=1&w=1000)
Exploratory data analysis (EDA) is an essential first step in the data analysis process used to understand the structure of a dataset, identify patterns, and detect anomalies.
EDA is an opportunity to review your data without the pressure of confirming specific hypotheses, helping you find errors and refine your analytical questions.
There are three primary types of EDA: univariate (analyzing one variable), bivariate (comparing two variables), and multivariate (investigating relationships between three or more variables).
Professionals across industries—including finance, healthcare, and public health—use EDA to inform decision-making, identify demand drivers, and plan interventions.
Read on to learn more about the specific techniques and graphical approaches used to conduct effective exploratory analysis. Afterward, consider enrolling in the Data Visualization & Dashboarding with R Specialization from Johns Hopkins University to build practical skills in representing data insights.
Exploratory data analysis (EDA) is typically the first step you would take to “get a feel” for your data before fully diving in. EDA methods allow you to view your data with an open mind and investigate the information present without focusing on confirming specific statistical hypotheses. By removing prior assumptions, you can find errors more clearly, confirm you’re asking relevant questions, and align your analysis methods to produce the most accurate results.
Typically, EDA involves using a mix of visualizations and summary statistics to represent the data as a whole. This allows you to clean your data more effectively, split it into natural groups for subgroup analysis, and begin to understand its story. Sometimes, you may naturally find underlying answers to problems you are trying to solve.
When you conduct EDA, your goal is to learn as much as you can about your data. This includes the variables in your data, how their values are distributed, and the relationship between your variables. By assessing the shape of your data and the information included, you can spot outliers or unusual values, detect patterns or trends, and gain insight into what type of further analysis may be appropriate.
Going further, depending on your intended outcome, you may have more nuanced goals for EDA. Use EDA to identify the most critical variables in your data set, check assumptions for hypothesis testing, or identify the minimum number of variables to explain your data. If you’re using machine learning methods, EDA is an important step toward providing the context you need to create an accurate model.
The three main types of EDA you can use are univariate, bivariate, and multivariate. A good rule of thumb is to start with univariate analysis and work your way to more complex or layered methods. In each case, you can choose graphical or nongraphical approaches, depending on whether you prefer visual or numerical descriptors for your data.
As you might guess by the name, univariate involves looking at a single variable. In this case, you would look at the distribution of one variable, identify outliers, and generally understand the patterns in the values of that one variable without looking at how it relates to any other variables.
Imagine you’re analyzing an “age” variable in your data set. You might examine mean, median, mode, standard deviation, and other metrics to gain an initial understanding. These insights help you decide which descriptors to use and which types of further analysis to conduct. For instance, if 95 percent of your participants are between 60 and 80 years old, your research questions and conclusions will likely differ from a scenario where 95 percent are between 10 and 40 years old. Going further, if 95 percent of your participants are between 60 and 80 years old, and the other 5 percent are 20 years of age, you might need to reconsider how you represent the data. In this case, a single average won’t reflect the wide age range, prompting you to find alternative descriptive statistics to capture the story of your data more clearly.
In bivariate EDA, you’re going one step further to examine how two variables relate. This helps you see how changes in one variable might relate to another. This enables you to find correlations and relationships within your data set. In this EDA, you might look at correlation coefficients or other correlation values to understand how strongly two variables relate.
Continuing the previous example, imagine you’re now looking at the relationship between age and income. You might use a data visualization method like a scatter plot to assess the relationship between age and income. You might see that as age increases, income also increases, which helps you identify a positive correlation between the variables. You might also spot outliers or find age ranges where the trend is more or less consistent.
Multivariate EDA continues bivariate EDA to include three or more variables. This helps you look at complex interrelationships and identify the interactions between variables. Having complex or multifaceted data sets can help you gain a clearer picture of what is happening within your data. In this case, you might use more advanced techniques like clustering, factor analysis, and regression to create and adjust your models.
With this analysis, you can now look at age, income, and education. You might notice that income increases with age more consistently for participants with a higher education level, while it levels off earlier for those with fewer qualifications. This would suggest an interaction between age and education. By looking at multiple variables and their relationships, you can reveal deeper insights and create more informed research hypotheses.
In any professional sphere, having the right roles for EDA can help you uncover trends, spot outliers, and gain insights into your data. Python and R are widely used for EDA thanks to their intuitive user interfaces, built-in packages, and data manipulation and visualization flexibility. Consider the following packages to help you automate your EDA process.
DataPrep: Helps you clean, validate, and explore your data in one go
pandas_profiling: Generates a report with descriptive statistics, text analysis, and more
SweetViz: Reports on data features, target analysis, and even compares data sets
AutoViML / AutoViz: Provides an in-depth set of visualizations appropriate for any size data set
skimr: Provides a concise text summary of your data
corrplot: Creates visualizations showing how your variables relate
SmartEDA: Generates an EDA report with built-in analytics
janitor: Cleans data and creates frequency tables to show distributions
Any profession that utilizes data can benefit from exploratory data analysis to help them understand their data and more accurately set up additional steps. Some ways you might see EDA in action include:
In business, you might use EDA to understand customer demographics, identify demand drivers, develop accurate sales forecast models, and gauge customer satisfaction with a particular product or service.
EDA can inform users of pricing strategies, portfolio management, product development, resource allocation, and investment decisions in banking and finance. It can even identify patterns in customer behavior for marketing strategies and campaign development.
EDA is particularly effective when assessing electronic health care records, understanding patient demographics, planning health care interventions, and uncovering seasonal patterns related to disease incidence.
In public health, EDA can help you understand the spread of infectious diseases and the prevalence and presentation of different conditions globally. It can also help you identify important variables for public health intervention design.
Join Career Chat on LinkedIn for weekly updates on popular skills, tools, and certifications. Discover more about data analytics with our other free digital resources:
Learn the terminology: Data Analysis Terms & Definitions
Watch on YouTube: Data Analytics Projects for Beginners: Where to Start
Accelerate your career growth with a Coursera Plus subscription. When you enroll in either the monthly or annual option, you’ll get access to over 10,000 courses.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.