This course introduces the necessary concepts and common techniques for analyzing data. The primary emphasis is on the process of data analysis, including data preparation, descriptive analytics, model training, and result interpretation. The process starts with removing distractions and anomalies, followed by discovering insights, formulating propositions, validating evidence, and finally building professional-grade solutions. Following the process properly, regularly, and transparently brings credibility and increases the impact of the results.

Data Preparation and Analysis

Data Preparation and Analysis
This course is part of multiple programs.


Instructors: Ming-Long Lam
Instructors


Access provided by The National Institute of Engineering
2,650 already enrolled
Recommended experience
Recommended experience
Intermediate level
Working knowledge of Python programming language
Recommended experience
Recommended experience
Intermediate level
Working knowledge of Python programming language
What you'll learn
1. Apply appropriate techniques for generating insights from data.
2. Present actionable solutions with confidence to the business stakeholders.
Details to know

Add to your LinkedIn profile
32 assignments
See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise
- Learn new concepts from industry experts
- Gain a foundational understanding of a subject or tool
- Develop job-relevant skills with hands-on projects
- Earn a shareable career certificate

There are 9 modules in this course
Welcome to Data Preparation and Analysis! Module 1 guides students through the art of crafting informative and visually appealing histograms, a fundamental aspect of data visualization. Students will learn techniques for measuring the location and scale of data, understanding the origins and impacts of noise and missing values in datasets. This module also introduces the CRISP-DM Process, a structured approach to data mining, along with Gartner's Analytics Ascendancy Model for advanced data analysis. Additionally, students will explore the distinction between raw data and processed information, a key concept for effective data interpretation and decision-making.
What's included
10 videos7 readings4 assignments1 discussion prompt1 ungraded lab
10 videos•Total 54 minutes
- Course Overview•1 minute
- Instructor Introduction•1 minute
- Module 1 Introduction•1 minute
- Why Do We Analyze Data•6 minutes
- The Process of Data Analysis - Part 1•7 minutes
- The Process of Data Analysis - Part 2•6 minutes
- The First Step of Knowing Your Data - Part 1•8 minutes
- The First Step of Knowing Your Data - Part 2•5 minutes
- The First Step of Knowing Your Data - Part 3•9 minutes
- The First Step of Knowing Your Data - Part 4•10 minutes
7 readings•Total 290 minutes
- Syllabus•10 minutes
- Data Files•60 minutes
- Module 1 Introduction•30 minutes
- Big Data and IEEE 754•60 minutes
- CRISP-DM2•60 minutes
- Selecting the Bin Size of a Time Histogram•60 minutes
- Module 1 Summary•10 minutes
4 assignments•Total 225 minutes
- Why Do We Analyze Data Quiz•15 minutes
- The Process of Data Analysis Quiz•15 minutes
- Knowing Your Data Quiz•15 minutes
- Module 1 Summative Assessment•180 minutes
1 discussion prompt•Total 60 minutes
- Meet and Greet Discussion•60 minutes
1 ungraded lab•Total 60 minutes
- Module 1 Python Lab - VS Code•60 minutes
Module 2 delves into the intricacies of statistical analysis, beginning with a thorough understanding of the p-value concept and its significance as a Type I Error indicator. Students will learn to apply statistical tests in Python to identify significantly correlated features, exploring various correlation metrics tailored for categorical, mixed-type, and continuous features. This module emphasizes practical application, equipping students with the skills to calculate and interpret these metrics using Python, thereby enhancing their ability to conduct sophisticated data analysis and draw meaningful conclusions from complex datasets.
What's included
7 videos5 readings4 assignments1 ungraded lab
7 videos•Total 54 minutes
- Module 2 Introduction•2 minutes
- Discover and Measure Associations - Part 1•10 minutes
- Discover and Measure Associations - Part 2•10 minutes
- Measure Associations - Part 1•8 minutes
- Measure Associations - Part 1 (Continued)•7 minutes
- Measure Associations - Part 2•9 minutes
- Measure Associations - Part 2 (Continued)•9 minutes
5 readings•Total 250 minutes
- Module 2 Introduction•60 minutes
- Chicago Taxi Trip Data•60 minutes
- Correlation with Python•60 minutes
- Eta-squared•60 minutes
- Module 2 Summary•10 minutes
4 assignments•Total 225 minutes
- Correlation of Continuous Features Quiz•15 minutes
- Correlation of Mixed Types Features•15 minutes
- Means to an End for Feature Screening Quiz•15 minutes
- Module 2 Summative Assessment•180 minutes
1 ungraded lab•Total 60 minutes
- Module 2 Python Lab - VS Code•60 minutes
Module 3 offers a deep dive into the world of Association Rules, teaching students how to improvise these rules for identifying valuable feature combinations that generate specific label values. Learners will master setting appropriate thresholds for Support and Confidence and gain a comprehensive understanding of the Apriori Algorithm and the significance of Frequent Itemsets within it. This module covers the calculation of common metrics for Association Rules, familiarizing students with the relevant terminology. Additionally, learners will explore the practical application of Association Rules in Market Basket Analysis, including strategies for cross-selling, up-selling, and product bundling, equipping them with valuable skills for advanced data-driven decision making in business contexts.
What's included
7 videos5 readings3 assignments1 ungraded lab
7 videos•Total 46 minutes
- Module 3 Introduction•1 minute
- What is in Your Basket - Part 1•7 minutes
- What is in Your Basket - Part 2•6 minutes
- How Are Association Rules Discovered - Part 1•9 minutes
- How Are Association Rules Discovered - Part 2•8 minutes
- What Can Association Rules Tell Me - Part 1•8 minutes
- What Can Association Rules Tell Me - Part 2•6 minutes
5 readings•Total 200 minutes
- PGML Chapter 3•60 minutes
- Cross-Selling•60 minutes
- Apriori Algorithm and Association Rules•60 minutes
- Module 3 Summary•10 minutes
- Insights from an Industry Leader: Learn More About Our Program•10 minutes
3 assignments•Total 210 minutes
- Market Basket Analysis Quiz•15 minutes
- Association Rules Discovery Quiz•15 minutes
- Module 3 Summative Assessment•180 minutes
1 ungraded lab•Total 60 minutes
- Module 3 Python Lab - VS Code•60 minutes
In Module 4, students will learn how to describe and interpret profiles of clusters, gaining proficiency in deploying the K-Means and K-Modes clustering algorithms. They will explore the application of Recency, Frequency, and Monetary (RFM) Analysis to identify the most valuable customers in retail business settings. The module also covers the technique of Simple Random Sampling with the option of incorporating stratification variables, enhancing the precision of data analysis. Furthermore, it emphasizes the importance of objectively validating models using a testing partition, ensuring the reliability and effectiveness of the analytical models in real-world scenarios.
What's included
8 videos5 readings4 assignments1 ungraded lab
8 videos•Total 70 minutes
- Module 4 Introduction•1 minute
- Partition Observations for Training Models - Part 1•10 minutes
- Partition Observations for Training Models - Part 2•12 minutes
- Create Segments of Observations for Business Reasons - Part 1•10 minutes
- Create Segments of Observations for Business Reasons - Part 2•10 minutes
- Put Observations with Similar Feature Values in Clusters - Part 1•10 minutes
- Put Observations with Similar Feature Values in Clusters - Part 2•11 minutes
- Put Observations with Similar Feature Values in Clusters - Part 3•8 minutes
5 readings•Total 220 minutes
- PGML Chapter 4 •30 minutes
- Sampling Techniques•60 minutes
- RFM•60 minutes
- Clustering•60 minutes
- Module 4 Summary•10 minutes
4 assignments•Total 225 minutes
- Partition Observations for Training Models Quiz•15 minutes
- Segments of Observations Quiz•15 minutes
- Clustering Quiz•15 minutes
- Module 4 Summative Assessment•180 minutes
1 ungraded lab•Total 60 minutes
- Module 4 Python Lab - VS Code•60 minutes
This module delves into feature importance analysis in machine learning, covering Shapley Values, feature selection methods, statistical evaluation, feature interaction, aliasing, and the Least Squares Algorithm. Students will be able to master these concepts to build robust and interpretable models.
What's included
8 videos5 readings4 assignments1 ungraded lab
8 videos•Total 53 minutes
- Module 5 Introduction•1 minute
- Linear Regression Model - Part 1•10 minutes
- Linear Regression Model - Part 2•5 minutes
- Forward Selection - Part 1•8 minutes
- Forward Selection - Part 2•4 minutes
- Feature Importance - Part 1•9 minutes
- Feature Importance - Part 2•8 minutes
- Feature Importance - Part 3•7 minutes
5 readings•Total 250 minutes
- Linear Regression Analysis •60 minutes
- Least Squares Regression •60 minutes
- Forward and Backward Stepwise Regression•60 minutes
- Shapley Values•60 minutes
- Module 5 Summary•10 minutes
4 assignments•Total 225 minutes
- Linear Regression Model Quiz•15 minutes
- Feature Selection Quiz•15 minutes
- Feature Importance Quiz•15 minutes
- Module 5 Summative Assessment•180 minutes
1 ungraded lab•Total 60 minutes
- Module 5 Python Lab - VS Code•60 minutes
In Module 6, students will master the art of feature selection in machine learning by exploring the Forward and Backward Selection Method, the All-Possible Subsets Method, and the concept of complete and quasi-complete separation. Students will also discover association rules for identifying separations, interpret model parameters and predicted probabilities, and delve into the concepts of maximum likelihood estimation, odds, and odds ratios.
What's included
6 videos5 readings4 assignments1 ungraded lab
6 videos•Total 34 minutes
- Module 6 Introduction•1 minute
- Logistic Regression - Part 1•6 minutes
- Logistic Regression - Part 2•7 minutes
- Forward Selection•9 minutes
- Interpret Model and Assess Performance - Part 1•8 minutes
- Interpret Model and Assess Performance - Part 2•4 minutes
5 readings•Total 220 minutes
- PGML Chapter 6•30 minutes
- Predictive Analytics•60 minutes
- Forward Selection•60 minutes
- Best R-squared for Logistic Regression•60 minutes
- Module 6 Summary•10 minutes
4 assignments•Total 225 minutes
- Logistic Regression Quiz•15 minutes
- Forward Selection Quiz•15 minutes
- Blessing and the Curse of Too Many Predictors Quiz•15 minutes
- Module 6 Summative Assessment•180 minutes
1 ungraded lab•Total 60 minutes
- Module 6 Python Lab - VS Code•60 minutes
Module 7 will equip students wth the ability to harness the power of tree-based models to uncover hidden patterns in your data. Students will be able to describe clusters effectively, intelligently set algorithm parameters, construct business rules from tree results, and utilize variance metrics, entropy values, and Gini indices for optimal tree construction.
What's included
7 videos5 readings4 assignments1 ungraded lab
7 videos•Total 37 minutes
- Module 7 Introduction•1 minute
- Motivation of Decision Trees - Part 1•6 minutes
- Motivation of Decision Trees - Part 2•5 minutes
- The CART Algorithm - Part 1•3 minutes
- The CART Algorithm - Part 2•9 minutes
- Cluster Profiling - Part 1•4 minutes
- Cluster Profiling - Part 2•7 minutes
5 readings•Total 220 minutes
- PGML Chapter 5•30 minutes
- CART•60 minutes
- CART as an Equation•60 minutes
- Decision Trees for Clustering•60 minutes
- Module 7 Summary•10 minutes
4 assignments•Total 225 minutes
- Motivation of Decision Trees Quiz•15 minutes
- The CART Algorithm Quiz•15 minutes
- Cluster Profiling Quiz•15 minutes
- Module 7 Summative Assessment•180 minutes
1 ungraded lab•Total 60 minutes
- Module 7 Python Lab - VS Code•60 minutes
Module 8 delves into the realm of evaluation metrics for machine learning models. Students will master the concepts of precision and recall curves, lift curves, and receiver operating characteristics (ROC) curves. Additionally, students will obtain the ability to discover methods for calculating probability thresholds using Kolmogorov-Smirnov statistics and F1 scores. They will be able to explore metrics like misclassification rate, area under the curve (AUC), and root mean squared error (RMSE), along with techniques for computing RMSE and detecting severely misfitted observations using model-specific residuals.
What's included
8 videos5 readings4 assignments1 ungraded lab
8 videos•Total 43 minutes
- Module 8 Introduction•1 minute
- Prediction Models•8 minutes
- Nominal Classification Models•6 minutes
- Binary Classification Models - Part 1•4 minutes
- Binary Classification Models - Part 2•6 minutes
- Binary Classification Models - Part 3•5 minutes
- Binary Classification Models - Part 4•6 minutes
- Binary Classification Models - Part 5•7 minutes
5 readings•Total 235 minutes
- PGML Chapter 7, 8 •45 minutes
- Outliers•60 minutes
- ROC Curve•60 minutes
- Using Life Analysis•60 minutes
- Module 8 Summary•10 minutes
4 assignments•Total 225 minutes
- Metrics for Prediction Models Quiz•15 minutes
- Metrics for Classification Models Quiz•15 minutes
- Charts for Classification Models Quiz•15 minutes
- Module 8 Summative Assessment•180 minutes
1 ungraded lab•Total 60 minutes
- Module 8 Python Lab - VS Code•60 minutes
This module contains the summative course assessment that has been designed to evaluate your understanding of the course material and assess your ability to apply the knowledge you have acquired throughout the course. Be sure to review the course material thoroughly before taking the assessment.
What's included
1 assignment
1 assignment•Total 180 minutes
- Summative Course Assessment•180 minutes
Earn a career certificate
Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.
Build toward a degree
This course is part of the following degree program(s) offered by Illinois Tech. If you are admitted and enroll, your completed coursework may count toward your degree learning and your progress can transfer with you.¹
Build toward a degree
This course is part of the following degree program(s) offered by Illinois Tech. If you are admitted and enroll, your completed coursework may count toward your degree learning and your progress can transfer with you.¹
Illinois Tech
Master of Data Science
Degree · 12-15 months
¹Successful application and enrollment are required. Eligibility requirements apply. Each institution determines the number of credits recognized by completing this content that may count towards degree requirements, considering any existing credits you may have. Click on a specific course for more information.
Instructors


Offered by

Offered by

Illinois Tech is a top-tier, nationally ranked, private research university with programs in engineering, computer science, architecture, design, science, business, human sciences, and law. The university offers bachelor of science, master of science, professional master’s, and Ph.D. degrees—as well as certificates for in-demand STEM fields and other areas of innovation. Talented students from around the world choose to study at Illinois Tech because of the access to real-world opportunities, renowned academic programs, high value, and career prospects of graduates.
Why people choose Coursera for their career

Felipe M.

Jennifer J.

Larry W.
