Machine Learning: Clustering & Retrieval

Machine Learning: Clustering & Retrieval

This course is part of Machine Learning Specialization

Instructors: Emily Fox

101,367 already enrolled

Included with

Learn more

6 modules

Gain insight into a topic and learn the fundamentals.

2,369 reviews

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

91%

Most learners liked this course

6 modules

Gain insight into a topic and learn the fundamentals.

2,369 reviews

2 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

91%

Most learners liked this course

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

15 assignments

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Build your subject-matter expertise

This course is part of the Machine Learning Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

There are 6 modules in this course

Case Studies: Finding Similar Documents

A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover? In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce. Learning Outcomes: By the end of this course, you will be able to: -Create a document retrieval system using k-nearest neighbors. -Identify various similarity metrics for text data. -Reduce computations in k-nearest neighbor search by using KD-trees. -Produce approximate nearest neighbors using locality sensitive hashing. -Compare and contrast supervised and unsupervised learning tasks. -Cluster documents by topic using k-means. -Describe how to parallelize k-means using MapReduce. -Examine probabilistic clustering approaches using mixtures models. -Fit a mixture of Gaussian model using expectation maximization (EM). -Perform mixed membership modeling using latent Dirichlet allocation (LDA). -Describe the steps of a Gibbs sampler and how to use its output to draw inferences. -Compare and contrast initialization techniques for non-convex optimization objectives. -Implement these techniques in Python.

Clustering and retrieval are some of the most high-impact machine learning tools out there. Retrieval is used in almost every applications and device we interact with, like in providing a set of products related to one a shopper is currently considering, or a list of people you might want to connect with on a social media platform. Clustering can be used to aid retrieval, but is a more broadly useful tool for automatically discovering structure in data, like uncovering groups of similar patients.<p>This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

What's included

4 videos5 readings

4 videosTotal 25 minutes

Welcome and introduction to clustering and retrieval tasks6 minutes
Course overview3 minutes
Module-by-module topics covered9 minutes
Assumed background6 minutes

5 readingsTotal 45 minutes

Important Update regarding the Machine Learning Specialization10 minutes
Slides presented in this module10 minutes
Software tools you'll need for this course10 minutes
A big week ahead!10 minutes
Get help and meet other learners. Join your Community!5 minutes

We start the course by considering a retrieval task of fetching a document similar to one someone is currently reading. We cast this problem as one of nearest neighbor search, which is a concept we have seen in the Foundations and Regression courses. However, here, you will take a deep dive into two critical components of the algorithms: the data representation and metric for measuring similarity between pairs of datapoints. You will examine the computational burden of the naive nearest neighbor search algorithm, and instead implement scalable alternatives using KD-trees for handling large datasets and locality sensitive hashing (LSH) for providing approximate nearest neighbors, even in high-dimensional spaces. You will explore all of these ideas on a Wikipedia dataset, comparing and contrasting the impact of the various choices you can make on the nearest neighbor results produced.

What's included

22 videos4 readings5 assignments

22 videosTotal 137 minutes

Retrieval as k-nearest neighbor search3 minutes
1-NN algorithm3 minutes
k-NN algorithm7 minutes
Document representation6 minutes
Distance metrics: Euclidean and scaled Euclidean7 minutes
Writing (scaled) Euclidean distance using (weighted) inner products4 minutes
Distance metrics: Cosine similarity9 minutes
To normalize or not and other distance considerations7 minutes
Complexity of brute force search2 minutes
KD-tree representation10 minutes
NN search with KD-trees7 minutes
Complexity of NN search with KD-trees6 minutes
Visualizing scaling behavior of KD-trees4 minutes
Approximate k-NN search using KD-trees8 minutes
Limitations of KD-trees4 minutes
LSH as an alternative to KD-trees4 minutes
Using random lines to partition points6 minutes
Defining more bins3 minutes
Searching neighboring bins9 minutes
LSH in higher dimensions4 minutes
(OPTIONAL) Improving efficiency through multiple tables23 minutes
A brief recap2 minutes

4 readingsTotal 40 minutes

Slides presented in this module10 minutes
Choosing features and metrics for nearest neighbor search10 minutes
(OPTIONAL) A worked-out example for KD-trees10 minutes
Implementing Locality Sensitive Hashing from scratch10 minutes

5 assignmentsTotal 150 minutes

Representations and metrics30 minutes
Choosing features and metrics for nearest neighbor search30 minutes
KD-trees30 minutes
Locality Sensitive Hashing30 minutes
Implementing Locality Sensitive Hashing from scratch30 minutes

In clustering, our goal is to group the datapoints in our dataset into disjoint sets. Motivated by our document analysis case study, you will use clustering to discover thematic groups of articles by "topic". These topics are not provided in this unsupervised learning task; rather, the idea is to output such cluster labels that can be post-facto associated with known topics like "Science", "World News", etc. Even without such post-facto labels, you will examine how the clustering output can provide insights into the relationships between datapoints in the dataset. The first clustering algorithm you will implement is k-means, which is the most widely used clustering algorithm out there. To scale up k-means, you will learn about the general MapReduce framework for parallelizing and distributing computations, and then how the iterates of k-means can utilize this framework. You will show that k-means can provide an interpretable grouping of Wikipedia articles when appropriately tuned.

What's included

13 videos2 readings3 assignments

13 videosTotal 79 minutes

The goal of clustering3 minutes
An unsupervised task7 minutes
Hope for unsupervised learning, and some challenge cases4 minutes
The k-means algorithm8 minutes
k-means as coordinate descent6 minutes
Smart initialization via k-means++5 minutes
Assessing the quality and choosing the number of clusters9 minutes
Motivating MapReduce9 minutes
The general MapReduce abstraction5 minutes
MapReduce execution overview and combiners6 minutes
MapReduce for k-means7 minutes
Other applications of clustering7 minutes
A brief recap1 minute

2 readingsTotal 20 minutes

Slides presented in this module10 minutes
Clustering text data with k-means10 minutes

3 assignmentsTotal 76 minutes

k-means30 minutes
Clustering text data with K-means16 minutes
MapReduce for k-means30 minutes

In k-means, observations are each hard-assigned to a single cluster, and these assignments are based just on the cluster centers, rather than also incorporating shape information. In our second module on clustering, you will perform probabilistic model-based clustering that provides (1) a more descriptive notion of a "cluster" and (2) accounts for uncertainty in assignments of datapoints to clusters via "soft assignments". You will explore and implement a broadly useful algorithm called expectation maximization (EM) for inferring these soft assignments, as well as the model parameters. To gain intuition, you will first consider a visually appealing image clustering task. You will then cluster Wikipedia articles, handling the high-dimensionality of the tf-idf document representation considered.

What's included

15 videos4 readings3 assignments

15 videosTotal 91 minutes

Motiving probabilistic clustering models8 minutes
Aggregating over unknown classes in an image dataset7 minutes
Univariate Gaussian distributions3 minutes
Bivariate and multivariate Gaussians7 minutes
Mixture of Gaussians7 minutes
Interpreting the mixture of Gaussian terms6 minutes
Scaling mixtures of Gaussians for document clustering5 minutes
Computing soft assignments from known cluster parameters7 minutes
(OPTIONAL) Responsibilities as Bayes' rule5 minutes
Estimating cluster parameters from known cluster assignments7 minutes
Estimating cluster parameters from soft assignments8 minutes
EM iterates in equations and pictures7 minutes
Convergence, initialization, and overfitting of EM9 minutes
Relationship to k-means3 minutes
A brief recap2 minutes

4 readingsTotal 40 minutes

Slides presented in this module10 minutes
(OPTIONAL) A worked-out example for EM10 minutes
Implementing EM for Gaussian mixtures10 minutes
Clustering text data with Gaussian mixtures10 minutes

3 assignmentsTotal 90 minutes

EM for Gaussian mixtures30 minutes
Implementing EM for Gaussian mixtures30 minutes
Clustering text data with Gaussian mixtures30 minutes

The clustering model inherently assumes that data divide into disjoint sets, e.g., documents by topic. But, often our data objects are better described via memberships in a collection of sets, e.g., multiple topics. In our fourth module, you will explore latent Dirichlet allocation (LDA) as an example of such a mixed membership model particularly useful in document analysis. You will interpret the output of LDA, and various ways the output can be utilized, like as a set of learned document features. The mixed membership modeling ideas you learn about through LDA for document analysis carry over to many other interesting models and applications, like social network models where people have multiple affiliations.<p>Throughout this module, we introduce aspects of Bayesian modeling and a Bayesian inference algorithm called Gibbs sampling. You will be able to implement a Gibbs sampler for LDA by the end of the module.

What's included

12 videos2 readings3 assignments

12 videosTotal 58 minutes

Mixed membership models for documents4 minutes
An alternative document clustering model5 minutes
Components of latent Dirichlet allocation model3 minutes
Goal of LDA inference5 minutes
The need for Bayesian inference5 minutes
Gibbs sampling from 10,000 feet5 minutes
A standard Gibbs sampler for LDA10 minutes
What is collapsed Gibbs sampling?3 minutes
A worked example for LDA: Initial setup4 minutes
A worked example for LDA: Deriving the resampling distribution8 minutes
Using the output of collapsed Gibbs sampling4 minutes
A brief recap2 minutes

2 readingsTotal 20 minutes

Slides presented in this module10 minutes
Modeling text topics with Latent Dirichlet Allocation10 minutes

3 assignmentsTotal 84 minutes

Latent Dirichlet Allocation30 minutes
Learning LDA model via Gibbs sampling30 minutes
Modeling text topics with Latent Dirichlet Allocation24 minutes

In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to clustering and retrieval, as well as foundational machine learning concepts that are more broadly useful.<p>We provide a quick tour into an alternative clustering approach called hierarchical clustering, which you will experiment with on the Wikipedia dataset. Following this exploration, we discuss how clustering-type ideas can be applied in other areas like segmenting time series. We then briefly outline some important clustering and retrieval ideas that we did not cover in this course.<p> We conclude with an overview of what's in store for you in the rest of the specialization.

What's included

12 videos2 readings1 assignment

12 videosTotal 62 minutes

Module 1 recap10 minutes
Module 2 recap3 minutes
Module 3 recap6 minutes
Module 4 recap7 minutes
Why hierarchical clustering?2 minutes
Divisive clustering4 minutes
Agglomerative clustering3 minutes
The dendrogram5 minutes
Agglomerative clustering details7 minutes
Hidden Markov models9 minutes
What we didn't cover3 minutes
Thank you!2 minutes

2 readingsTotal 20 minutes

Slides presented in this module10 minutes
Modeling text data with a hierarchy of clusters10 minutes

1 assignmentTotal 6 minutes

Modeling text data with a hierarchy of clusters6 minutes

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructors

Instructor ratings

(97 ratings)

Emily Fox

University of Washington

6 Courses499,801 learners

Carlos Guestrin

University of Washington

8 Courses500,600 learners

Offered by

University of Washington

Explore more from Data Analysis

Packt
Cluster Analysis and Unsupervised Machine Learning in Python
Course
University of London
Statistics and Clustering in Python
Course
EDUCBA
R: Apply & Analyze K-Means Clustering for Unsupervised ML
Course
IBM
Unsupervised Machine Learning
Course

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

5 stars
74.37%
4 stars
19.12%
3 stars
4.68%
2 stars
0.75%
1 star
1.05%

Showing 3 of 2369

Reviewed on Dec 15, 2019

Excellent course. I liked the material and the assignments are great to consolidate the learning. I really liked the recap videos to solidify even more what I learned.

Reviewed on Aug 3, 2020

A challenging course!!! It's necessary to fix some compatibility problems with Tury and Windows, because Python 2.7 it's obsolete. I really enjoy it!!!

Reviewed on Jan 6, 2019

This was a really good course, It made me familiar with many tools and techniques used in ML. With this in hand I will be able to go out there and explore and understand things much better.

View more reviews

Unlock access to 10,000+ courses with a subscription
Advance your career with an online degree
Earn a degree from world-class universities - 100% online
Join over 4,700 global companies that choose Coursera for Business

Frequently asked questions

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.