Big Data Processing with Hadoop and Spark

Big Data Processing with Hadoop and Spark

This course is part of Cloud Computing for Data Science Specialization

Instructor: Dmitriy Babichenko

Access provided by NMIMS Indore

3 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

9 hours to complete

Flexible schedule

Learn at your own pace

3 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

9 hours to complete

Flexible schedule

Learn at your own pace

What you'll learn

Explain how Hadoop and Spark enable large-scale data processing.
Build and manage distributed data pipelines using Hadoop frameworks.
Implement in-memory analytics and real-time processing with Spark.
Apply big data tools to design scalable, data-driven applications.

Skills you'll gain

Predictive Modeling
Data Pipelines
Data Science
Data Transformation
Scalability
Distributed Computing
PySpark
Apache Spark
Data Processing
Data Storage Technologies
Scikit Learn (Machine Learning Library)
Apache Hive
Data Analysis
Apache Hadoop
Data Management
Data Storage
Python Programming
Information Technology
Big Data
Skills section collapsed. Showing 10 of 19 skills.

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

8 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the Cloud Computing for Data Science Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 3 modules in this course

Master the tools and techniques that power large-scale data processing and analytics. This course introduces the principles and frameworks of Big Data Processing with Hadoop and Spark, enabling learners to manage, process, and analyze massive datasets efficiently.

You’ll start by understanding the Hadoop ecosystem, including HDFS and MapReduce, and how distributed storage and computation work together to handle data at scale. Then, you’ll explore Apache Spark, a powerful framework for fast, in-memory data processing and real-time analytics. Through guided exercises and case studies, you’ll learn how to build scalable data pipelines, optimize performance, and apply transformations for business insights. By the end of this course, you’ll be equipped to handle complex data workloads using industry-standard big data tools. Ideal for aspiring data engineers, analysts, and developers, this course bridges data management and cloud computing—preparing you to design, implement, and manage big data solutions that drive intelligent decision-making in modern organizations.

This module guides you through the core components of the Hadoop ecosystem, starting with its architecture and distributed file system. You’ll explore how Hadoop processes data, gain insight into its broader ecosystem, and apply your knowledge in hands-on activities using both Docker and a Linux virtual machine.

What's included

6 videos1 reading3 assignments

6 videos Total 41 minutes

Overview: Hadoop 2 minutes
Lecture 1: Introduction to Hadoop 7 minutes
Lecture 2: HDFS Architecture 7 minutes
Lecture 3: Yarn Architecture 7 minutes
Lecture 4: Hadoop Ecosystem 9 minutes
Lecture 5: Hadoop Data Processing 9 minutes

1 reading Total 10 minutes

Course Overview 10 minutes

3 assignments Total 90 minutes

Let's Practice: Hadoop 30 minutes
HDFS Architecture 30 minutes
Test Yourself: Hadoop 30 minutes

This module introduces you to key programming models for distributed data processing, with a focus on MapReduce and its practical applications. You'll explore core concepts and terminology, work through guided code walkthroughs using Python to implement word count and server log analysis tasks, and gain experience using Apache Pig for data transformation. You'll also gain hands-on experience writing data transformation scripts in Apache Pig, culminating in an assignment that applies these skills to web log analysis.

What's included

6 videos6 readings3 assignments

6 videos Total 34 minutes

Overview: Parallel Programming Models 2 minutes
Lecture 1: Programming Models 4 minutes
Lecture 2: Programming Models Concepts and Terminology 11 minutes
Lecture 3: MapReduce 8 minutes
Lecture 4: MapReduce Deeper Dive 6 minutes
Lecture 5: Apache Pig 4 minutes

6 readings Total 60 minutes

Code Review: Introduction to MapReduce With Python 10 minutes
Code Review: Word Count Example with MapReduce + Python 10 minutes
Code Review: Server Log Analysis with MapReduce + Python 10 minutes
Code Review: Server Log Analysis (Reading from File) with MapReduce + Python 10 minutes
Activity & Code Review: Word Count with Apache Pig 10 minutes
Activity: Working with Apache Pig 10 minutes

3 assignments Total 90 minutes

Let's Practice: Programming Models 30 minutes
MapReduce 30 minutes
Test Yourself: Programming Models 30 minutes

This module introduces you to Apache Spark, covering its core concepts, architecture, and machine learning capabilities through MLlib. You’ll learn how to set up Spark using Docker and Linux VM, explore how PySpark operates within the Spark framework, and compare Spark MLlib with scikit-learn through hands-on code walkthroughs. By the end of the module, you'll apply what you've learned in graded activities and an assignment focused on building a predictive model with PySpark and MLlib.

What's included

5 videos3 readings2 assignments

5 videos Total 22 minutes

Lecture 1: Introduction to Apache Spark 3 minutes
Lecture 2: Apache Spark Core Concepts 5 minutes
Lecture 3: Apache Spark Architecture 3 minutes
Lecture 4: PySpark and Its Execution in Apache Spark Architecture 6 minutes
Lecture 5: Introduction to Apache Spark MLlib 6 minutes

3 readings Total 30 minutes

Case Study & Code Review: scikit-learn vs. Spark MLlib 10 minutes
Activity & Code Review: PySpark and MLlib Pipeline 10 minutes
Course Summary 10 minutes

2 assignments Total 60 minutes

Let's Practice: Apache Spark 30 minutes
Test Yourself: Apache Spark 30 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Build toward a degree

This course is part of the following degree program(s) offered by University of Pittsburgh. If you are admitted and enroll, your completed coursework may count toward your degree learning and your progress can transfer with you.¹