Updated in May 2025.
This course now features Coursera Coach — your interactive learning companion that helps you test your knowledge, challenge assumptions, and deepen your understanding as you progress. Build a strong, hands-on foundation in Hadoop and big data processing with this comprehensive course designed for data engineers, developers, and IT professionals. From installation to advanced analytics, you’ll learn how to work confidently with Hadoop’s ecosystem and design scalable solutions for real-world data challenges. You’ll begin by installing the Hortonworks Data Platform (HDP) Sandbox on your local machine, giving you an isolated environment to explore Hadoop’s core components. Through guided exercises, you’ll work with the Hadoop Distributed File System (HDFS) and build your understanding of MapReduce, learning how large-scale distributed processing works behind the scenes. As you progress, you’ll move into advanced Hadoop programming with Pig, Hive, and Spark. You’ll write complex queries, analyze large datasets, and work with real-world data to build scalable data workflows. You’ll also explore machine learning with Spark MLLib, giving you a practical introduction to distributed ML techniques. In the final modules, you’ll learn how to manage and optimize Hadoop clusters using YARN, ZooKeeper, Oozie, and Kafka. You’ll practice feeding data into your cluster, orchestrating workflows, managing resources, and analyzing streaming data in real time — essential skills for production-grade environments. By the end of this course, you will have: - Installed and configured the Hortonworks Sandbox for Hadoop development. - Worked with HDFS, MapReduce, and Hadoop’s core data processing concepts. - Written queries and pipelines using Pig, Hive, and Spark. - Performed distributed machine learning with Spark MLLib. - Integrated relational and non-relational data sources with Hadoop. - Managed clusters and streaming workflows with YARN, ZooKeeper, Oozie, and Kafka. - Gained the confidence to design and implement Hadoop-based data solutions. This course is ideal for data engineers, developers, and IT professionals with basic programming or data management experience. Familiarity with Java, SQL, or the Linux command line is helpful but not required.
















