Now, we will introduce you to Spark MLlib. After this video, you will be able to describe what MLlib is, list the main categories of techniques available in MLlib, and explain code segments containing MLlib algorithms. MLlib is a scalable machine learning library that runs on top of Spark Core. It provides distributed implementations of commonly used machine learning algorithms and utilities. As with Spark Core, MLlib has APIs for Scala, Java, Python, and R. MLlib offers many algorithms and techniques commonly used in a machine learning process. The main categories are machine learning, statistics and some common utility tools for common techniques. As the name suggests, many machine learning algorithms are available in MLlib. These are algorithms to build models for classification, regression, and clustering. There are also techniques for evaluating the resulting models. For example, you can compute the values for a receiver of creating characteristic that we call an ROC curve. A common statistical technique for plotting the performance of a binary classifier. Statistical functions are also provided in MLlib. Examples are summary statistics, means, standard deviation, etc. Correlations and methods to sample a dataset. MLlib also has techniques commonly used in the machine learning process, such as dimensionality reduction and feature transformation methods for preprocessing the data. In short, Spark MLlib offers many techniques often used in a machine learning pipeline. Let's take a look at an example to compute summary statistics using MLlib. Note that we will use the spark pipe of API similar to the ones used for our other examples in this course. Here is the code segment to compute summary statistics for a data set consisting of columns of numbers. Lines of code are in white, and the comments are in orange. The first line imports statistics functions from the stat module. The second line creates an RDD of Vectors with the data. You can think of each vector as a column in a data matrix. The next line denoted with three invokes the column stats function to compute summary statistics for each column. The last three lines show by four print out the mean, variance and number of non-zero entries for each column. As you can see from this example, computing the summary statistics for a data set is very straightforward using a MLlib. Here is another example. Although we will go through the ratio learning details in our next course, here we give you a hint of how to use two ratio learning techniques. One for Classification, and one for Clustering. This code segment, shows the six steps to build a DecisionTree for classification. The first line imports the DecisionTree module, the second line imports MLUtils module, the next line fails the DecisionTree to classify the data for two classes. Then, the model is printed out and finally the model is saved in a file. Here is another MLlib example, this time for clustering. This code segment shows the 5-step code to build a k-means clustering. The first line imports the k-means module, the second line imports an array module from numpy, the next two lines read in the data and parses it using space as the limiter, then the k-means model is built by dividing the parsedData into three clusters. Finally, the cluster centers are printed out for each. In summary, MLlib is Spark's machine learning library. It provides algorithms and techniques that are implemented using distributors processing. The main categories of algorithms and techniques available in machine learning library are machine learning, statistics and utility functions for the machine learning process.