Spark, Hadoop, and Snowflake for Data Engineering

Spark, Hadoop, and Snowflake for Data Engineering

This course is part of Applied Python Data Engineering Specialization

Instructors: Noah Gift

14,732 already enrolled

Included with Learn more

Ask Coursera

4 modules

Gain insight into a topic and learn the fundamentals.

71 reviews

Advanced level

Recommended experience

3 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

4 modules

Gain insight into a topic and learn the fundamentals.

71 reviews

Advanced level

Recommended experience

3 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Create scalable data pipelines (Hadoop, Spark, Snowflake, Databricks) for efficient data handling.
Optimize data engineering with clustering and scaling to boost performance and resource use.
Build ML solutions (PySpark, MLFlow) on Databricks for seamless model development and deployment.
Implement DataOps and DevOps practices for continuous integration and deployment (CI/CD) of data-driven applications, including automating processes.

Skills you'll gain

Tools you'll learn

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

21 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the Applied Python Data Engineering Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 4 modules in this course

e.g. This is primarily aimed at first- and second-year undergraduates interested in engineering or science, along with high school students and professionals with an interest in programmingGain the skills for building efficient and scalable data pipelines. Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) as well as learn how to optimize and manage them. Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks, while honing your Python data science skills with PySpark. Finally, discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks.

This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. In addition to the technologies you will learn, you will also gain methodologies to help you hone your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops methodologies and best practices. With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.

In this module, you will learn how to work with different data engineering platforms, such as Hadoop and Spark, and apply their concepts to real-world scenarios. First, you will explore the fundamentals of Hadoop to store and process big data. Next, you will delve into Spark concepts, distributed computing, deferred execution, and Spark SQL. By the end of the week, you will gain hands-on experience with PySpark DataFrames, DataFrame methods, and deferred execution strategies.

What's included

10 videos10 readings7 assignments1 discussion prompt2 ungraded labs

10 videosTotal 25 minutes

Meet your Co-Instructor: Kennedy Behrman1 minute
Meet your Co-Instructor: Noah Gift1 minute
Overview of Big Data Platforms2 minutes
Getting Started with Hadoop1 minute
Getting Started with Spark2 minutes
Introduction to Resilient Distributed Datasets (RDD)2 minutes
Resilient Distributed Datasets (RDD) Demo4 minutes
Introduction to Spark SQL2 minutes
PySpark Dataframe Demo: Part 13 minutes
PySpark Dataframe Demo: Part 27 minutes

10 readingsTotal 100 minutes

Welcome to Data Engineering Platforms with Python!10 minutes
Report a problem with the course10 minutes
What is Apache Hadoop?10 minutes
What is Apache Spark?10 minutes
Use Apache Spark in Azure Databricks (optional)10 minutes
Choosing between Hadoop and Spark10 minutes
What are RDDs?10 minutes
Getting Started: Creating RDD's with PySpark10 minutes
Spark SQL, Dataframes and Datasets10 minutes
PySpark and Spark SQL10 minutes

7 assignmentsTotal 210 minutes

PySpark30 minutes
Big Data Platforms30 minutes
Apache Hadoop Concepts30 minutes
Apache Spark Concepts30 minutes
RDD Concepts30 minutes
Spark SQL Concepts30 minutes
PySpark Dataframe Concepts30 minutes

1 discussion promptTotal 10 minutes

Meet and Greet (optional)10 minutes

2 ungraded labsTotal 120 minutes

Practice: Creating RDD's with PySpark60 minutes
Practice: Reading Data into Dataframes60 minutes

In this module, you will explore the Snowflake platform, gaining insights into its architecture and key concepts. Through hands-on practice in the Snowflake Web UI, you'll learn to create tables, manage warehouses, and use the Snowflake Python Connector to interact with tables. By the end of this week, you'll solidify your understanding of Snowflake's architecture and practical applications, emerging with the ability to effectively navigate and leverage the platform for data management and analysis.

What's included

8 videos5 readings6 assignments

8 videosTotal 27 minutes

What is Snowflake?2 minutes
Snowflake Layers2 minutes
Snowflake Web UI4 minutes
Navigating Snowflake4 minutes
Creating a Table in Snowflake5 minutes
Snowflake Warehouses4 minutes
Writing to Snowflake3 minutes
Reading from Snowflake3 minutes

5 readingsTotal 50 minutes

Accessing Snowflake10 minutes
Detailed View Inside Snowflake10 minutes
Snowsight: The Snowflake Web Interface10 minutes
Working with Warehouses10 minutes
Python Connector Documentation10 minutes

6 assignmentsTotal 180 minutes

Snowflake30 minutes
Snowflake Architecture30 minutes
Snowflake Layers30 minutes
Navigating Snowflake30 minutes
Creating a Table30 minutes
Writing to Snowflake30 minutes

In this module, you will practice the essential skills for seamlessly managing machine learning workflows using Databricks and MLFlow. First, you will create a Databricks workspace and configure a cluster, setting the stage for efficient data analysis. Next, you will load a sample dataset into the Databricks workspace using the power of PySpark, enabling data manipulation and exploration. Finally, you will install MLFlow either locally or within the Databricks environment, gaining the ability to orchestrate the entire machine learning lifecycle. By the end of this week, you will be able to craft, track, and manage machine learning experiments within Databricks, ensuring precision, reproducibility, and optimal decision-making throughout your data-driven journey.

What's included

16 videos7 readings4 assignments1 ungraded lab

16 videosTotal 72 minutes

Accessing Databricks1 minute
Spark Notebooks with Databricks5 minutes
Using Data with Databricks5 minutes
Working with Workspaces in Databricks3 minutes
Advanced Capabilities of Databricks2 minutes
PySpark Introduction on Databricks7 minutes
Exploring Databricks Azure Features4 minutes
Using the DBFS to AutoML Workflow4 minutes
Load, Register and Deploy ML Models3 minutes
Databricks Model Registry3 minutes
Model Serving on Databricks2 minutes
What is MLOps?13 minutes
Exploring Open-Source MLFlow Frameworks6 minutes
Running MLFlow with Databricks6 minutes
End to End Databricks MLFlow4 minutes
Databricks Autologging with MLFlow4 minutes

7 readingsTotal 70 minutes

What is Azure Databricks?10 minutes
Introduction to Databricks Machine Learning10 minutes
What is the Databricks File System (DBFS)?10 minutes
Serverless Compute with Databricks10 minutes
MLOps Workflow on Azure Databricks10 minutes
Run MLFlow Projects on Azure Databricks10 minutes
Databricks Autologging10 minutes

4 assignmentsTotal 120 minutes

DataBricks30 minutes
PySpark SQL30 minutes
PySpark DataFrames30 minutes
MLFlow with Databricks30 minutes

1 ungraded labTotal 60 minutes

ETL-Part-1: Keyword Extractor Tool to HashTag Tool 60 minutes

In this module, you will explore the concepts of Kaizen, DevOps, and DataOps and how these methodologies synergistically contribute to efficient and seamless data engineering workflows. Through practical examples, you will learn how Kaizen's continuous improvement philosophy, DevOps' collaborative practices, and DataOps' focus on data quality and integration converge to enhance the development, deployment, and management of data engineering platforms. By the end of this week, you will have the knowledge and perspective needed to optimize data engineering processes and deliver scalable, reliable, and high-quality solutions.

What's included

21 videos7 readings4 assignments1 ungraded lab

21 videosTotal 502 minutes

Kaizen Methodology for Data4 minutes
Introducing GitHub CodeSpaces9 minutes
Compiling Python in GitHub Codespaces18 minutes
Walking through Sagemaker Studio Lab29 minutes
Pytest Master Class (Optional)166 minutes
What is DevOps?2 minutes
DevOps Key Concepts36 minutes
Continuous Integration Overview32 minutes
Build an NLP in Cloud9 with Python43 minutes
Build a Continuously Deployed Containerized FastAPI Microservice44 minutes
Hugo Continuous Deploy on AWS19 minutes
Container Based Continuous Delivery9 minutes
What is DataOps?1 minute
DataOps and MLOps with Snowflake62 minutes
Building Cloud Pipelines with Step Functions and Lambda17 minutes
What is a Data Lake?2 minutes
Data Warehouse vs. Feature Store2 minutes
Big Data Challenges1 minute
Types of Big Data Processing1 minute
Real-World Data Engineering Pipeline2 minutes
Data Feedback Loop1 minute

7 readingsTotal 70 minutes

GitHub Codespaces Overview10 minutes
Getting Started with Amazon SageMaker Studio Lab10 minutes
Teaching MLOps at Scale with GitHub (Optional)10 minutes
Getting Started with DevOps and Cloud Computing10 minutes
Benefits of Serverless ETL Technologies10 minutes
Next Steps10 minutes
Share your learning experience10 minutes

4 assignmentsTotal 120 minutes

DataOps and Operations Methodologies30 minutes
Kaizen Methodology30 minutes
DevOps30 minutes
DataOps30 minutes

1 ungraded labTotal 60 minutes

ETL-Part2: SQLite ETL Destination60 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructors

Instructor ratings

(20 ratings)

Noah Gift

Duke University

40 Courses284,968 learners

Offered by

Duke University

Explore more from Machine Learning

Status: Free Trial
Edureka
Data Engineering and Spark Foundations for AI and ML
Course
Packt
Data Engineering with Scala and Spark
Course
Packt
Data Engineering with Databricks Cookbook
Course
Status: Free Trial
Coursera
Open source Data Engineering with Spark, dbt & Airflow
Professional Certificate

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

5 stars
52.11%
4 stars
19.71%
3 stars
8.45%
2 stars
8.45%
1 star
11.26%

Showing 3 of 71

Reviewed on Aug 6, 2024

Great course, detailed steps by step walkthrough that really simplifies understanding

Reviewed on Jan 15, 2024

A course that cover all aspects basic of data engineer, i love it

View more reviews

Unlock access to 10,000+ courses with a subscription
Advance your career with an online degree
Earn a degree from world-class universities - 100% online
Join over 4,700 global companies that choose Coursera for Business

Frequently asked questions

To access course materials, assignments, and earn a Certificate, you'll need to purchase the Certificate experience when you enroll in a course. Eligible learners may also have the option to start with a Free Trial. Some courses may also offer a Full Course, No Certificate option. This lets you access course materials, submit required assessments, and receive a final grade, but you won't be able to earn or purchase a Certificate.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.