IBM
Introduction to Big Data with Spark and Hadoop
IBM

Introduction to Big Data with Spark and Hadoop

Aije Egwaikhide
Romeo Kienzler
Rav Ahuja

Instructors: Aije Egwaikhide

Access provided by Colegio de Estudios Superiores de Administracion - CESA

69,571 already enrolled

Gain insight into a topic and learn the fundamentals.
4.4

(461 reviews)

Intermediate level

Recommended experience

Flexible schedule
2 weeks at 10 hours a week
Learn at your own pace
92%
Most learners liked this course
Gain insight into a topic and learn the fundamentals.
4.4

(461 reviews)

Intermediate level

Recommended experience

Flexible schedule
2 weeks at 10 hours a week
Learn at your own pace
92%
Most learners liked this course

What you'll learn

  • Explain the impact of big data, including use cases, tools, and processing methods.

  • Describe Apache Hadoop architecture, ecosystem, practices, and user-related applications, including Hive, HDFS, HBase, Spark, and MapReduce.

  • Apply Spark programming basics, including parallel programming basics for DataFrames, data sets, and Spark SQL.

  • Use Spark’s RDDs and data sets, optimize Spark SQL using Catalyst and Tungsten, and use Spark’s development and runtime environment options.

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

14 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

 logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is available as part of
When you enroll in this course, you'll also be asked to select a specific program.
  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate

There are 7 modules in this course

In this module, you’ll begin your acquisition of Big Data knowledge with the most up-to-date definition of Big Data. You’ll explore the impact of Big Data on everyday personal tasks and business transactions with Big Data Use Cases. You’ll also learn how Big Data uses parallel processing, scaling, and data parallelism. Going further, you’ll explore commonly used Big Data tools and explain the role of open-source in Big Data. Finally, you’ll go beyond the hype and explore additional Big Data viewpoints.

What's included

8 videos1 reading2 assignments2 plugins

In this module, you'll gain a fundamental understanding of the Apache Hadoop architecture, ecosystem, practices, and commonly used applications, including Distributed File System (HDFS), MapReduce, Hive, and HBase. You’ll also gain practical skills in hands-on labs when you query the data added using Hive, launch a single-node Hadoop cluster using Docker, and run MapReduce jobs.

What's included

6 videos1 reading2 assignments3 app items2 plugins

In this module, you’ll turn your attention to the popular Apache Spark platform, where you will explore the attributes and benefits of Apache Spark and distributed computing. You'll gain key insights about functional programming and Lambda functions. You’ll also explore Resilient Distributed Datasets (RDDs), parallel programming, resilience in Apache Spark, and relate RDDs and parallel programming with Apache Spark. Then, you’ll dive into additional Apache Spark components and learn how Apache Spark scales with Big Data. Working with Big Data signals the need for working with queries, including structured queries using SQL. You’ll also learn about the functions, parts, and benefits of Spark SQL and DataFrame queries, and discover how DataFrames work with Spark SQL.

What's included

5 videos1 reading2 assignments2 app items2 plugins

In this module, you’ll learn about Resilient Distributed Datasets (RDDs), their uses in Apache Spark, and RDD transformations and actions. You'll compare the use of datasets with Spark's latest data abstraction, DataFrames. You'll learn to identify and apply basic DataFrame operations. You’ll explore Apache Spark SQL optimization and learn how Spark SQL and memory optimization benefit from using Catalyst and Tungsten. Finally, you’ll fortify your skills with guided hands-on lab to create a table view and apply data aggregation techniques.

What's included

5 videos1 reading2 assignments2 app items4 plugins

In this module, you’ll explore how Spark processes the requests that your application submits and learn how you can track work using the Spark Application UI. Because Spark application work happens on the cluster, you need to be able to identify Apache Cluster Managers, their components, and benefits. You’ll also know how to connect with each cluster manager and how and when you might want to set up a local, standalone Spark instance. Next, you’ll learn about Apache Spark application submission, including the use of Spark’s unified interface, “spark-submit,” and learn about options and dependencies. You’ll also describe and apply options for submitting applications, identify external application dependency management techniques, and list Spark Shell benefits. You’ll also look at recommended practices for Spark's static and dynamic configuration options and perform hands-on labs to use Apache Spark on IBM Cloud and run Spark on Kubernetes.

What's included

6 videos2 readings3 assignments2 app items4 plugins

Platforms and applications require monitoring and tuning to manage issues that inevitably happen. In this module, you'll learn about connecting the Apache Spark user interface web server and using the same UI web server to manage application processes. You’ll also identify common Apache Spark application issues and learn about debugging issues using the application UI and locating related log files. Further, you’ll discover and gain real-world knowledge about how Spark manages memory and processor resources using the hands-on lab.

What's included

5 videos1 reading2 assignments1 app item3 plugins

In this module, you’ll perform a practice lab where you’ll explore two critical aspects of data processing using Spark: working with Resilient Distributed Datasets (RDDs) and constructing DataFrames from JSON data. You will also apply various transformations and actions on both RDDs and DataFrames to gain insights and manipulate the data effectively. Further, you’ll apply your knowledge in a final project where you will create a DataFrame by loading data from a CSV file and applying transformations and actions using Spark SQL. Finally, you’ll be assessed based on your learning from the course.

What's included

3 readings1 assignment2 app items2 plugins

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructors

Instructor ratings
4.3 (116 ratings)
Aije Egwaikhide
IBM
6 Courses754,407 learners
Romeo Kienzler
IBM
10 Courses793,879 learners
Rav Ahuja
IBM
56 Courses4,371,838 learners

Offered by

IBM

Why people choose Coursera for their career

Felipe M.
Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."
Jennifer J.
Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."
Larry W.
Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."
Chaitanya A.
"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

4.4

461 reviews

  • 5 stars

    66.16%

  • 4 stars

    19.30%

  • 3 stars

    8.02%

  • 2 stars

    3.03%

  • 1 star

    3.47%

Showing 3 of 461

AA
5

Reviewed on Jan 15, 2024

JS
4

Reviewed on May 1, 2022

TK
5

Reviewed on Jan 17, 2025

Explore more from Information Technology