Spark, Hadoop, and Snowflake for Data Engineering

Obtenez l'une de nos meilleures offres avec Coursera Plus pour 199 $ (habituellement 399 $). Économisez maintenant.

Ce cours n'est pas disponible en Français (France)

Nous sommes actuellement en train de le traduire dans plus de langues. Consultez les langues disponibles.

Spark, Hadoop, and Snowflake for Data Engineering

Ce cours fait partie de Spécialisation "Applied Python Data Engineering"

Instructeurs : Noah Gift

14 632 déjà inscrits

Inclus avec

Demander à Coursera

4 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

71 avis

niveau Avancées

Expérience recommandée

3 semaines à compléter

à 10 heures par semaine

Planning flexible

Apprenez à votre propre rythme

4 modules

Obtenez un aperçu d'un sujet et apprenez les principes fondamentaux.

71 avis

niveau Avancées

Expérience recommandée

3 semaines à compléter

à 10 heures par semaine

Planning flexible

Apprenez à votre propre rythme

Ce que vous apprendrez

Create scalable data pipelines (Hadoop, Spark, Snowflake, Databricks) for efficient data handling.
Optimize data engineering with clustering and scaling to boost performance and resource use.
Build ML solutions (PySpark, MLFlow) on Databricks for seamless model development and deployment.
Implement DataOps and DevOps practices for continuous integration and deployment (CI/CD) of data-driven applications, including automating processes.

Compétences que vous acquerrez

Catégorie : SQL
Catégorie : Data Integration
Catégorie : Data Warehousing
Catégorie : Distributed Computing
Catégorie : MLOps (Machine Learning Operations)
Catégorie : Big Data
Catégorie : Data Pipelines
Catégorie : Snowflake Schema
Catégorie : Data Quality
Catégorie : Data Transformation
Catégorie : Data Architecture
Catégorie : Model Training
Catégorie : Data Processing
Catégorie : DevOps

Outils que vous découvrirez

Catégorie : Databricks
Catégorie : Model Deployment
Catégorie : Python Programming
Catégorie : Apache Spark
Catégorie : Apache Hadoop
Catégorie : PySpark

Détails à connaître

Certificat partageable

Ajouter à votre profil LinkedIn

Évaluations

21 devoirs

Enseigné en Anglais

Découvrez comment les employés des entreprises prestigieuses maîtrisent des compétences recherchées

En savoir plus sur Coursera pour les affaires

logos de Petrobras, TATA, Danone, Capgemini, P&G et L'Oreal

Élaborez votre expertise du sujet

Ce cours fait partie de la Spécialisation "Applied Python Data Engineering"

Lorsque vous vous inscrivez à ce cours, vous êtes également inscrit(e) à cette Spécialisation.

Apprenez de nouveaux concepts auprès d'experts du secteur
Acquérez une compréhension de base d'un sujet ou d'un outil
Développez des compétences professionnelles avec des projets pratiques
Obtenez un certificat professionnel partageable

Il y a 4 modules dans ce cours

e.g. This is primarily aimed at first- and second-year undergraduates interested in engineering or science, along with high school students and professionals with an interest in programmingGain the skills for building efficient and scalable data pipelines. Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) as well as learn how to optimize and manage them. Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks, while honing your Python data science skills with PySpark. Finally, discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks.

This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. In addition to the technologies you will learn, you will also gain methodologies to help you hone your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops methodologies and best practices. With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.

In this module, you will learn how to work with different data engineering platforms, such as Hadoop and Spark, and apply their concepts to real-world scenarios. First, you will explore the fundamentals of Hadoop to store and process big data. Next, you will delve into Spark concepts, distributed computing, deferred execution, and Spark SQL. By the end of the week, you will gain hands-on experience with PySpark DataFrames, DataFrame methods, and deferred execution strategies.

Inclus

10 vidéos10 lectures7 devoirs1 sujet de discussion2 laboratoires non notés

10 vidéosTotal 25 minutes

Meet your Co-Instructor: Kennedy Behrman1 minute
Meet your Co-Instructor: Noah Gift1 minute
Overview of Big Data Platforms2 minutes
Getting Started with Hadoop1 minute
Getting Started with Spark2 minutes
Introduction to Resilient Distributed Datasets (RDD)2 minutes
Resilient Distributed Datasets (RDD) Demo4 minutes
Introduction to Spark SQL2 minutes
PySpark Dataframe Demo: Part 13 minutes
PySpark Dataframe Demo: Part 27 minutes

10 lecturesTotal 100 minutes

Welcome to Data Engineering Platforms with Python!10 minutes
Report a problem with the course10 minutes
What is Apache Hadoop?10 minutes
What is Apache Spark?10 minutes
Use Apache Spark in Azure Databricks (optional)10 minutes
Choosing between Hadoop and Spark10 minutes
What are RDDs?10 minutes
Getting Started: Creating RDD's with PySpark10 minutes
Spark SQL, Dataframes and Datasets10 minutes
PySpark and Spark SQL10 minutes

7 devoirsTotal 210 minutes

PySpark30 minutes
Big Data Platforms30 minutes
Apache Hadoop Concepts30 minutes
Apache Spark Concepts30 minutes
RDD Concepts30 minutes
Spark SQL Concepts30 minutes
PySpark Dataframe Concepts30 minutes

1 sujet de discussionTotal 10 minutes

Meet and Greet (optional)10 minutes

2 laboratoires non notésTotal 120 minutes

Practice: Creating RDD's with PySpark60 minutes
Practice: Reading Data into Dataframes60 minutes

In this module, you will explore the Snowflake platform, gaining insights into its architecture and key concepts. Through hands-on practice in the Snowflake Web UI, you'll learn to create tables, manage warehouses, and use the Snowflake Python Connector to interact with tables. By the end of this week, you'll solidify your understanding of Snowflake's architecture and practical applications, emerging with the ability to effectively navigate and leverage the platform for data management and analysis.

Inclus

8 vidéos5 lectures6 devoirs

8 vidéosTotal 27 minutes

What is Snowflake?2 minutes
Snowflake Layers2 minutes
Snowflake Web UI4 minutes
Navigating Snowflake4 minutes
Creating a Table in Snowflake5 minutes
Snowflake Warehouses4 minutes
Writing to Snowflake3 minutes
Reading from Snowflake3 minutes

5 lecturesTotal 50 minutes

Accessing Snowflake10 minutes
Detailed View Inside Snowflake10 minutes
Snowsight: The Snowflake Web Interface10 minutes
Working with Warehouses10 minutes
Python Connector Documentation10 minutes

6 devoirsTotal 180 minutes

Snowflake30 minutes
Snowflake Architecture30 minutes
Snowflake Layers30 minutes
Navigating Snowflake30 minutes
Creating a Table30 minutes
Writing to Snowflake30 minutes

In this module, you will practice the essential skills for seamlessly managing machine learning workflows using Databricks and MLFlow. First, you will create a Databricks workspace and configure a cluster, setting the stage for efficient data analysis. Next, you will load a sample dataset into the Databricks workspace using the power of PySpark, enabling data manipulation and exploration. Finally, you will install MLFlow either locally or within the Databricks environment, gaining the ability to orchestrate the entire machine learning lifecycle. By the end of this week, you will be able to craft, track, and manage machine learning experiments within Databricks, ensuring precision, reproducibility, and optimal decision-making throughout your data-driven journey.

Inclus

16 vidéos7 lectures4 devoirs1 laboratoire non noté

16 vidéosTotal 72 minutes

Accessing Databricks1 minute
Spark Notebooks with Databricks5 minutes
Using Data with Databricks5 minutes
Working with Workspaces in Databricks3 minutes
Advanced Capabilities of Databricks2 minutes
PySpark Introduction on Databricks7 minutes
Exploring Databricks Azure Features4 minutes
Using the DBFS to AutoML Workflow4 minutes
Load, Register and Deploy ML Models3 minutes
Databricks Model Registry3 minutes
Model Serving on Databricks2 minutes
What is MLOps?13 minutes
Exploring Open-Source MLFlow Frameworks6 minutes
Running MLFlow with Databricks6 minutes
End to End Databricks MLFlow4 minutes
Databricks Autologging with MLFlow4 minutes

7 lecturesTotal 70 minutes

What is Azure Databricks?10 minutes
Introduction to Databricks Machine Learning10 minutes
What is the Databricks File System (DBFS)?10 minutes
Serverless Compute with Databricks10 minutes
MLOps Workflow on Azure Databricks10 minutes
Run MLFlow Projects on Azure Databricks10 minutes
Databricks Autologging10 minutes

4 devoirsTotal 120 minutes

DataBricks30 minutes
PySpark SQL30 minutes
PySpark DataFrames30 minutes
MLFlow with Databricks30 minutes

1 laboratoire non notéTotal 60 minutes

ETL-Part-1: Keyword Extractor Tool to HashTag Tool 60 minutes

In this module, you will explore the concepts of Kaizen, DevOps, and DataOps and how these methodologies synergistically contribute to efficient and seamless data engineering workflows. Through practical examples, you will learn how Kaizen's continuous improvement philosophy, DevOps' collaborative practices, and DataOps' focus on data quality and integration converge to enhance the development, deployment, and management of data engineering platforms. By the end of this week, you will have the knowledge and perspective needed to optimize data engineering processes and deliver scalable, reliable, and high-quality solutions.

Inclus

21 vidéos7 lectures4 devoirs1 laboratoire non noté

21 vidéosTotal 502 minutes

Kaizen Methodology for Data4 minutes
Introducing GitHub CodeSpaces9 minutes
Compiling Python in GitHub Codespaces18 minutes
Walking through Sagemaker Studio Lab29 minutes
Pytest Master Class (Optional)166 minutes
What is DevOps?2 minutes
DevOps Key Concepts36 minutes
Continuous Integration Overview32 minutes
Build an NLP in Cloud9 with Python43 minutes
Build a Continuously Deployed Containerized FastAPI Microservice44 minutes
Hugo Continuous Deploy on AWS19 minutes
Container Based Continuous Delivery9 minutes
What is DataOps?1 minute
DataOps and MLOps with Snowflake62 minutes
Building Cloud Pipelines with Step Functions and Lambda17 minutes
What is a Data Lake?2 minutes
Data Warehouse vs. Feature Store2 minutes
Big Data Challenges1 minute
Types of Big Data Processing1 minute
Real-World Data Engineering Pipeline2 minutes
Data Feedback Loop1 minute

7 lecturesTotal 70 minutes

GitHub Codespaces Overview10 minutes
Getting Started with Amazon SageMaker Studio Lab10 minutes
Teaching MLOps at Scale with GitHub (Optional)10 minutes
Getting Started with DevOps and Cloud Computing10 minutes
Benefits of Serverless ETL Technologies10 minutes
Next Steps10 minutes
Share your learning experience10 minutes

4 devoirsTotal 120 minutes

DataOps and Operations Methodologies30 minutes
Kaizen Methodology30 minutes
DevOps30 minutes
DataOps30 minutes

1 laboratoire non notéTotal 60 minutes

ETL-Part2: SQLite ETL Destination60 minutes

Obtenez un certificat professionnel

Ajoutez ce titre à votre profil LinkedIn, à votre curriculum vitae ou à votre CV. Partagez-le sur les médias sociaux et dans votre évaluation des performances.

Instructeurs

Évaluations de l’enseignant

(20 évaluations)

Noah Gift

Duke University

40 Cours282 310 apprenants

Offert par

Duke University

Pour quelles raisons les étudiants sur Coursera nous choisissent-ils pour leur carrière ?

Felipe M.

Étudiant(e) depuis 2018

’Pouvoir suivre des cours à mon rythme à été une expérience extraordinaire. Je peux apprendre chaque fois que mon emploi du temps me le permet et en fonction de mon humeur.’

Jennifer J.

Étudiant(e) depuis 2020

’J'ai directement appliqué les concepts et les compétences que j'ai appris de mes cours à un nouveau projet passionnant au travail.’

Larry W.

Étudiant(e) depuis 2021

’Lorsque j'ai besoin de cours sur des sujets que mon université ne propose pas, Coursera est l'un des meilleurs endroits où se rendre.’

Chaitanya A.

’Apprendre, ce n'est pas seulement s'améliorer dans son travail : c'est bien plus que cela. Coursera me permet d'apprendre sans limites.’

Avis des étudiants

5 stars
52,11 %
4 stars
19,71 %
3 stars
8,45 %
2 stars
8,45 %
1 star
11,26 %

Affichage de 3 sur 71

Révisé le 15 janv. 2024

A course that cover all aspects basic of data engineer, i love it

Révisé le 6 août 2024

Great course, detailed steps by step walkthrough that really simplifies understanding

Voir plus d’avis

Foire Aux Questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.