What types of data processing tasks will I be able to perform after completing the course?

You will be able to perform a variety of tasks, including data cleaning, transformation, aggregation, and analysis of large datasets using PySpark’s RDDs and DataFrames.

What technologies and frameworks are covered in the course?

You’ll learn PySpark in detail, along with its integration with Hadoop, RDDs, DataFrames, and SQL-based data processing.

Is prior knowledge in data engineering required?

No, prior experience is not required; the course introduces PySpark basics before moving to advanced use cases.

Does the course cover workflow automation and ETL?

Yes, you’ll learn how to design ETL workflows and automate big data processing with PySpark.

Can I preview a course before enrolling?

Yes, you can preview the first video and view the syllabus before you enroll. You must purchase the course to access content not included in the preview.

When will I have access to the lectures and assignments?

If you decide to enroll in the course before the session start date, you will have access to all of the lecture videos and readings for the course. You’ll be able to submit assignments once the session starts.

What will I get when I enroll?

Once you enroll and your session begins, you will have access to all videos and other resources, including reading items and the course discussion forum. You’ll be able to view and submit practice assessments, and complete required graded assignments to earn a grade and a Course Certificate.

When will I receive my Course Certificate?

If you complete the course successfully, your electronic Course Certificate will be added to your Accomplishments page - from there, you can print your Course Certificate or add it to your LinkedIn profile.

Why can’t I audit this course?

This course is currently available only to learners who have paid or received financial aid, when available.

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

PySpark in Action: Hands-On Data Processing

This course is part of PySpark for Data Science Specialization

Instructor: Edureka

Included with

Learn more

5 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

5 modules

Gain insight into a topic and learn the fundamentals.

Intermediate level

Recommended experience

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Explore the fundamental concepts of Big Data and the components of the Hadoop ecosystem.
Explain the architecture and key principles of Apache Spark and its role in big data processing.
Utilize RDD transformations and actions to effectively process large-scale datasets with PySpark.
Execute advanced DataFrame operations, including data manipulation and aggregation techniques.

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

17 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the PySpark for Data Science Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 5 modules in this course

PySpark in Action: Hands-on Data Processing is a practical course that equips you to work confidently with large-scale data using PySpark and distributed data processing frameworks. You’ll discover the fundamentals of Big Data, Apache Hadoop, and Apache Spark, then build on this knowledge through real-world exercises where you’ll process and analyze massive datasets.

During the course, you’ll gain hands-on experience with: - Foundational concepts of Big Data and components of the Hadoop ecosystem such as HDFS, enabling you to understand modern data storage and processing. - Spark architecture and critical design principles for scalable, fault-tolerant data workflows. - RDD transformations and actions, helping you handle large-scale datasets using PySpark’s distributed processing engine. - Advanced DataFrame techniques: manage complex data types, perform aggregations, and solve business data challenges efficiently. - PySpark SQL for applying advanced queries, optimizing processing workflows, and enabling rapid, reliable analysis at scale. This course is ideal for those new to data engineering or distributed computing who want a hands-on introduction to PySpark for large-scale data tasks. If you have basic Python skills but no prior experience in data engineering, you’ll find accessible explanations and step-by-step projects throughout. By course completion, you’ll be prepared to use PySpark in real-world projects, build and monitor data pipelines, automate processing, clean and integrate diverse datasets, and confidently tackle core challenges in distributed data analytics.

This module introduces you to the fundamental concepts of Big Data and Hadoop. You will explore the Hadoop ecosystem, its components, and the Hadoop Distributed File System (HDFS), setting the foundation for understanding big data processing and storage solutions.

What's included

15 videos5 readings4 assignments1 discussion prompt

15 videos Total 74 minutes

Course Introduction 4 minutes
What is Big Data? 4 minutes
Applications of Big Data 5 minutes
What is Hadoop? 5 minutes
Hadoop Ecosystem 2 minutes
Working of HDFS 5 minutes
Introduction to Apache Spark 7 minutes
Master-slave Architecture 7 minutes
Spark Architecture 2 minutes
Data Processing with Apache Spark 6 minutes
Directed Acyclic Graph (DAG) 5 minutes
Introduction to Spark Ecosystem 5 minutes
What is PySpark? 5 minutes
Key Features of PySpark 7 minutes
Basics of Python 6 minutes

5 readings Total 50 minutes

Welcome to PySpark in Action: Hands-On Data Processing 10 minutes
What is Big Data? – A Beginner’s Guide to the World of Big Data 10 minutes
Spark SQL 10 minutes
Features of PySpark 10 minutes
Module Summary: Big Data Processing with PySpark 10 minutes

4 assignments Total 38 minutes

Knowledge Check: Big Data Processing with PySpark 20 minutes
Practice Quiz: Big Data Essentials 6 minutes
Practice Quiz: Apache Spark Fundamentals 6 minutes
Practice Quiz: PySpark 6 minutes

1 discussion prompt Total 10 minutes

Introduce Yourself 10 minutes

Dive into the core of PySpark by learning about Resilient Distributed Datasets (RDDs). This module covers the fundamentals of RDDs, how they work, and their key transformations and actions, enabling efficient distributed data processing in PySpark.

What's included

25 videos4 readings4 assignments3 discussion prompts

25 videos Total 121 minutes

Introduction to RDDs 6 minutes
Working of RDDs 5 minutes
Creating RDDs 7 minutes
Essentials of RDD 6 minutes
Key Concepts of RDD 6 minutes
Understanding Lazy Evaluations 5 minutes
Advantages of Lazy Evaluation 3 minutes
Introduction to Transformations 5 minutes
Narrow and Wide Transformations 6 minutes
Transformations: Map 6 minutes
Transformations: Filter, Reduce and groupBykey 4 minutes
Transformations: Distinct, Sample and Join 5 minutes
Transformations: Union and Subtract 3 minutes
Introduction to Repartition 6 minutes
Significance of Repartition 1 minute
Introduction to Actions 5 minutes
Actions: collect, reduce and reduceBykey 5 minutes
Implementing Actions: collect, reduce and reduceBykey 3 minutes
Actions: count, foreach and aggregate 6 minutes
Implementing Actions: count, foreach and aggregate 3 minutes
Actions: Coalesce, histogram and sortby 4 minutes
Implementing Actions: Coalesce, histogram and sortby 3 minutes
Working with RDD Transformations 6 minutes
Applying Distinct, sample and join Transformations 3 minutes
Grocery Store Data Analysis with PySPark RDDs 7 minutes

4 readings Total 40 minutes

PySpark RDDs in Organization 10 minutes
Managing RDD Transformations in PySpark 10 minutes
Optimizing RDD operations in PySpark 10 minutes
Module Summary: Working with RDD 10 minutes

4 assignments Total 38 minutes

Knowledge Check: Working with RDD 20 minutes
Introduction to RDD 6 minutes
RDD Transformations 6 minutes
RDD Actions 6 minutes

3 discussion prompts Total 30 minutes

Introduction to RDDs 10 minutes
Transformations: Map 10 minutes
Actions: Coalesce, histogram, and sortBy 10 minutes

This module covers the creation and manipulation of DataFrames in PySpark. You will learn how to perform basic and advanced operations, including aggregation, grouping, and handling missing data, with a focus on optimizing large-scale data processing tasks.

What's included

22 videos4 readings4 assignments1 discussion prompt

22 videos Total 116 minutes

Overview of Data frames 7 minutes
Introduction to DataFrames API 4 minutes
Creating Data Frames from Different Sources 7 minutes
Data Frames from RDD 6 minutes
Basic DataFrame Operations 6 minutes
Implementation of DataFrame Operations 4 minutes
Performing Aggregations and Groupings - GroupBy and Window 6 minutes
Performing Aggregations and Groupings - Cube and Rollup 4 minutes
Handling Missing Data - Managing Null Values 7 minutes
Demonstration for Handling Missing Data 4 minutes
Working with Complex Data Types - Arrays and Structs 7 minutes
Demonstration for Working with Complex Data Types 3 minutes
Advanced DataFrame Transformations and Actions 7 minutes
Demonstration: Working with DataFrames 7 minutes
Introduction to Data Visualization and Key Aspects 5 minutes
Introduction to Data Visualization - General Visuals 4 minutes
Libraries for Data Visualization - Matplotlib and Seaborn 4 minutes
Libraries for Data Visualization - Plotly 4 minutes
Implementing Data Visualization 6 minutes
Implementing Data Visualization - Plotting Charts 6 minutes
Customizing the Visualizations 4 minutes
Customizing Charts and Visuals 6 minutes

4 readings Total 40 minutes

Importance of PySpark DataFrames 10 minutes
Window Functions in PySpark 10 minutes
Data Visualization Libraries in PySpark 10 minutes
Module Summary: PySpark DataFrames 10 minutes

4 assignments Total 38 minutes

Knowledge Check: PySpark Dataframes 20 minutes
Introduction to PySpark DataFrames 6 minutes
Advanced DataFrame Operations 6 minutes
Data Visualizations with PySpark DataFrames 6 minutes

1 discussion prompt Total 5 minutes

PySpark DataFrames and Traditional Pandas DataFrames 5 minutes

In this module, you will explore the SQL capabilities of PySpark. Learn how to perform CRUD operations, execute SQL commands, and merge and aggregate data using PySpark SQL. You'll also discover best practices for using SQL with PySpark to enhance data workflows.

What's included

28 videos4 readings4 assignments2 discussion prompts

28 videos Total 135 minutes

Structured Data vs. Unstructured Data 5 minutes
Characteristic of Structured Data 5 minutes
Relational Database and its Components 7 minutes
SQL in Relation with Relational Database 6 minutes
Normalization and its Types 6 minutes
Exploring Different Types of Normalization 4 minutes
Data Querying and Filtering Logic 6 minutes
DDL Commands - Creating Tables 5 minutes
DDL Commands - Altering and Truncating Tables 4 minutes
DQL Commands - Select Statement and Where Clause 4 minutes
DQL Commands - Practical Implementation 4 minutes
DML Commands - Insert, Update, and Delete 4 minutes
DML Commands - Lock 4 minutes
DCL Commands 7 minutes
TCL Commands 6 minutes
Alter - Altering a Table and Constraints 5 minutes
Alter - Altering Indexes and Views 3 minutes
Performing CRUD Operations 6 minutes
Operations on PySpark SQL DataFrames 4 minutes
Performing Operations on PySpark SQL DataFrames 7 minutes
Data Merging and Aggregation using PySpark SQL 5 minutes
Implementing Data Merging and Aggregation using PySpark SQL 4 minutes
SQL Best Practices 6 minutes
Data Integrity and Error Handling with PySpark 3 minutes
Problem Statement: Ecommerce Organization 4 minutes
Data Analysis of an E-commerce Organization 4 minutes
Demonstration: Spark SQL - Retail Organization 4 minutes
Demonstration: Analyzing the Data 4 minutes

4 readings Total 34 minutes

Best Practices for Data Querying: Optimizing SQL Performance 8 minutes
User-Defined Functions (UDFs) in PySpark 8 minutes
Best Practices for Using SQL with PySpark 8 minutes
Module Summary: PySpark SQL 10 minutes

4 assignments Total 38 minutes

Knowledge Check: PySpark SQL 20 minutes
Introduction to SQL 6 minutes
SQL Commands 6 minutes
Working with PySpark SQL 6 minutes

2 discussion prompts Total 10 minutes

Why Normalization is Crucial for Database Design? 5 minutes
Importance of Aggregate Functions 5 minutes

This module is meant to test how well you understand the different ideas and lessons you've learned in this course. You will undertake a project based on these PySpark concepts and complete a comprehensive quiz that will assess your confidence and proficiency in Data Processing with PySpark.

What's included

1 video1 reading1 assignment1 discussion prompt

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructor

Instructor ratings

(5 ratings)

Edureka

138 Courses 131,180 learners

Offered by

Edureka

Explore more from Data Analysis

Status: Free Trial
EDUCBA
PySpark & Python: Hands-On Guide to Data Processing
Course
Status: Preview
Edureka
Introduction to PySpark
Course
Status: Free Trial
EDUCBA
PySpark: Apply & Analyze Advanced Data Processing
Course
Status: Free
Coursera
PySpark Foundations: Process, analyze, and summarize data
Guided Project

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

You will need access to a computer with Python and Apache Spark installed. Detailed setup instructions will be provided at the beginning of the course.

This course is designed for individuals new to big data and PySpark, providing a solid foundation to start working with distributed data processing.

While prior SQL knowledge is beneficial, it is not mandatory. The course will introduce SQL concepts as they relate to PySpark and provide practice with SQL queries.