Fundamentals of Scalable Data Science

Fundamentals of Scalable Data Science

This course is part of Advanced Data Science with IBM Specialization

Taught in English

Some content may not be translated

Instructor: Romeo Kienzler

78,210 already enrolled

Included with Coursera Plus

Learn more

Course

Gain insight into a topic and learn the fundamentals

4.3

(2,051 reviews)

86%

Beginner level

No prior experience required

27 hours (approximately)

Flexible schedule

Learn at your own pace

View course modules

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

8 quizzes

Course

Gain insight into a topic and learn the fundamentals

4.3

(2,051 reviews)

86%

Beginner level

No prior experience required

27 hours (approximately)

Flexible schedule

Learn at your own pace

View course modules

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

Build your subject-matter expertise

This course is part of the Advanced Data Science with IBM Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

There are 4 modules in this course

Apache Spark is the de-facto standard for large scale data processing. This is the first course of a series of courses towards the IBM Advanced Data Science Specialization. We strongly believe that is is crucial for success to start learning a scalable data science platform since memory and CPU constraints are to most limiting factors when it comes to building advanced machine learning models.

In this course we teach you the fundamentals of Apache Spark using python and pyspark. We'll introduce Apache Spark in the first two weeks and learn how to apply it to compute basic exploratory and data pre-processing tasks in the last two weeks. Through this exercise you'll also be introduced to the most fundamental statistical measures and data visualization technologies. This gives you enough knowledge to take over the role of a data engineer in any modern environment. But it gives you also the basis for advancing your career towards data science. Please have a look at the full specialization curriculum: https://www.coursera.org/specializations/advanced-data-science-ibm If you choose to take this course and earn the Coursera course certificate, you will also earn an IBM digital badge. To find out more about IBM digital badges follow the link ibm.biz/badging. After completing this course, you will be able to: • Describe how basic statistical measures, are used to reveal patterns within the data • Recognize data characteristics, patterns, trends, deviations or inconsistencies, and potential outliers. • Identify useful techniques for working with big data such as dimension reduction and feature selection methods • Use advanced tools and charting libraries to: o improve efficiency of analysis of big-data with partitioning and parallel analysis o Visualize the data in an number of 2D and 3D formats (Box Plot, Run Chart, Scatter Plot, Pareto Chart, and Multidimensional Scaling) For successful completion of the course, the following prerequisites are recommended: • Basic programming skills in python • Basic math • Basic SQL (you can get it easily from https://www.coursera.org/learn/sql-data-science if needed) In order to complete this course, the following technologies will be used: (These technologies are introduced in the course as necessary so no previous knowledge is required.) • Jupyter notebooks (brought to you by IBM Watson Studio for free) • ApacheSpark (brought to you by IBM Watson Studio for free) • Python We've been reported that some of the material in this course is too advanced. So in case you feel the same, please have a look at the following materials first before starting this course, we've been reported that this really helps. Of course, you can give this course a try first and then in case you need, take the following courses / materials. It's free... https://cognitiveclass.ai/learn/spark https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/f8982db1-5e55-46d6-a272-fd11b670be38/view?access_token=533a1925cd1c4c362aabe7b3336b3eae2a99e0dc923ec0775d891c31c5bbbc68 This course takes four weeks, 4-6h per week

What's included

2 videos2 readings1 quiz2 programming assignments2 app items

2 videosTotal 2 minutes

Course Overview and a warm welcome1 minutePreview module
Overview of technology used within the course1 minute

2 readingsTotal 20 minutes

Intro to Apache Spark10 minutes
IMPORTANT: How to submit your programming assignments10 minutes

1 quizTotal 30 minutes

Challenges, terminology, methods and technology30 minutes

2 programming assignmentsTotal 240 minutes

Week 1 Programming Assignment 1180 minutes
Week 1 Programming Assignment 260 minutes

2 app itemsTotal 120 minutes

Hands-on Lab: Graded Week 1 Programming Assignment 160 minutes
Hands-on Lab: Graded Week 1 Programming Assignment 260 minutes

What's included

7 videos2 readings3 quizzes1 programming assignment1 app item

7 videosTotal 46 minutes

Data storage solutions5 minutesPreview module
Parallel data processing strategies of Apache Spark7 minutes
Programming language options on ApacheSpark9 minutes
Functional programming basics6 minutes
Introduction of Cloudant2 minutes
Resilient Distributed Dataset and DataFrames - ApacheSparkSQL6 minutes
OPTIONAL: Test Data Generator (data is provided for you already)8 minutes

2 readingsTotal 52 minutes

Apache Parquet (optional)42 minutes
Create the data on your own (optional)10 minutes

3 quizzesTotal 72 minutes

Data storage solutions, and ApacheSpark30 minutes
Programming language options and functional programming30 minutes
ApacheSparkSQL and Cloudant12 minutes

1 programming assignmentTotal 180 minutes

Week 2 Programming Assignment180 minutes

1 app itemTotal 60 minutes

Hands-on Lab: Graded Week 2 Programming Assignment60 minutes

What's included

7 videos1 reading3 quizzes1 programming assignment1 app item

7 videosTotal 34 minutes

Overview of the week...1 minutePreview module
Averages5 minutes
Standard deviation3 minutes
Skewness3 minutes
Kurtosis2 minutes
Covariance, Covariance matrices, correlation13 minutes
Multidimensional vector spaces5 minutes

1 readingTotal 10 minutes

Exercise 210 minutes

3 quizzesTotal 90 minutes

Averages and standard deviation30 minutes
Skewness and kurtosis30 minutes
Covariance, correlation and multidimensional Vector Spaces30 minutes

1 programming assignmentTotal 180 minutes

Programming Assignment 3180 minutes

1 app itemTotal 60 minutes

Hands-on Lab: Graded Week 3 Programming Assignment60 minutes

What's included

4 videos8 readings1 quiz1 programming assignment2 app items1 plugin

4 videosTotal 23 minutes

Overview of the week0 minutesPreview module
Plotting with ApacheSpark and python's matplotlib12 minutes
Dimensionality reduction4 minutes
PCA5 minutes

8 readingsTotal 80 minutes

Exercise on Plotting10 minutes
Exercise on PCA10 minutes
Assignment and Exercise Environment Setup10 minutes
(Optional) Week 1: Setup the ApacheSpark and Jupyter notebook in Watson Studio Assignment 1_110 minutes
(Optional) Week 1: Setup Programming Assignment 1_2 in Watson Studio 10 minutes
(Optional) Week 2: Setup Programming Assignment in Watson Studio 10 minutes
(Optional) Week 3: Setup Programming Assignment in Watson Studio 10 minutes
(Optional) Week 4: Setup Programming Assignment in Watson Studio 10 minutes

1 quizTotal 30 minutes

Visualization and dimension reduction30 minutes

1 programming assignmentTotal 180 minutes

Programming Assignment Week 4180 minutes

2 app itemsTotal 120 minutes

Hands-on Lab: Graded Week 4 Programming Assignment60 minutes
[OPTIONAL] Obtain an IBM Cloud Feature Code60 minutes

1 pluginTotal 15 minutes

[OPTIONAL] Lab: Create an IBM Cloud Account15 minutes

Instructor

Instructor ratings

4.3 (321 ratings)

Romeo Kienzler

IBM

10 Courses640,180 learners

Offered by

IBM

Recommended if you're interested in Data Analysis

IBM
Advanced Data Science Capstone
Course
IBM
Applied AI with DeepLearning
Course
Meta
Proyecto final sobre la aplicación para iOS
Course
IBM
Advanced Machine Learning and Signal Processing
Course

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

Showing 3 of 2051

4.3

2,051 reviews

5 stars
57.59%
4 stars
25.51%
3 stars
8.95%
2 stars
3.99%
1 star
3.94%

Reviewed on Jun 19, 2021

Reviewed on Jan 6, 2020

Reviewed on Jun 6, 2020

View more reviews

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

If you have started a course that depends on the IBM Bluemix, and your trial has expired, you can continue taking the course on the same environment by providing your credit card information. To avoid being charged, close any application instances you are not using and pay attention to the usage of your environment details.

Alternative, you can export any projects you are working on. Then, you can register for a new trial using a different email account, not used on IBM Bluemix before. Finally, import the projects to the new account.

When exporting your projects, for Node-RED use the process used when submitting assignments (export flow form the old project, then import to the new project via clipboard). For Node.js you can redeploy the code to Bluemix using your new account credentials.

If you have customized your GIT repository, or registered devices, migrating to a new environment will require you to redo those steps to reflect in the new environment.

If you already have an IBM Bluemix account, but your trial period has expired, you can always create a new account with a different email address.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.
The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

Fundamentals of Scalable Data Science

Course

Skills you'll gain

Details to know

Course

See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise

Earn a career certificate

There are 4 modules in this course

Introduction the course and grading environment

What's included

Tools that support BigData solutions

What's included

Scaling Math for Statistics on Apache Spark

What's included

Data Visualization of Big Data

What's included

Instructor

Offered by

Recommended if you're interested in Data Analysis

Advanced Data Science Capstone

Applied AI with DeepLearning

Proyecto final sobre la aplicación para iOS

Advanced Machine Learning and Signal Processing

Why people choose Coursera for their career

Learner reviews

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Advance your career with an online degree

Join over 3,400 global companies that choose Coursera for Business

Frequently asked questions

I am in the middle of taking the course, and my IBM Bluemix trial has expired. What do I do now?

I am about to start the course, my IBM Bluemix trial has expired, how do I proceed with this course?

When will I have access to the lectures and assignments?

More questions