Spark machine learning algorithms. There are basic statistics and classification and regression types where under basic statistics, correlation, hypothesis testing. Under classification and regression, there's linear models which include support vector machine, linear regression, logistic regression, Naive Bayes, decision tree, and some others are collaborative filtering, clustering, one of the most famous ones is k-means, dimensionality reduction, and others. Let's look into that list. First, we'll start with correlation. Variables can have a relation of being independent, weakly correlated, strongly correlated. Correlation indicates the extent to which variables have an influence on each other, to increase or decrease each other. For example, in the case of the blue dots over there, for the x and y-axis that you see, the blue dots, they are strongly correlated. How do we know? Well, they are on a diagonal structure, which means that when x increases, the y value will most likely be a large value as well. Compared to these red dots, the red dots show a weaker correlation between x and y. You can see this because just because x is large does not mean that the y value will be large as well. That's why the data values are spread out much more. In addition, if these dots are spread out even more, then the correlation between the x and y will be even lesser, and therefore the dots will have a weaker correlation all the way until they are maybe even uncorrelated, then you will see them even spread out more. Hypothesis testing method is based on the following case where we define a statistic that obeys a certain distribution if the hypothesis is correct, and then we try to go and check collect samples and then calculate the statistics probability. If the sample statistics show a probability lower than the threshold of being drawn from this distribution, then the hypothesis is rejected. For example, if the probability is lower than the threshold Pi, which is shown right there, if it's lower than that, in order to be like that for the case of t, if t is larger than Omega, which is shown right down there, then the hypothesis is rejected, like you see down there. Linear models frequently are used in regression. They include the support vector machine, logistic regression, linear regression, and others. What is regression? Regression is a method to predict the value of one or more continuous target variable y given a D-dimension vector input x. So, it's used in prediction and you know that there's so many places that you can use prediction, so therefore it is very powerful and frequently used. Looking into first linear regression which is the simplest regression model. The output of linear regression is continuous. To obtain output y, from a linear combination of input variables x and weights w, you can see it right down there. Here, an example of linear regression is like this where the red dots over here are the input values and the blue curve is the linear regression output model. Logistic regression is different. It is used in classification problems that need to make a decision where we decide among two options or multiple options. For example, to decide among zero and one using a sigmoid curve is something that would be a typical logistic regression example. Decide among multiple options of course is possible. For example, a sigmoid S curve, which is right here, you see it right here, with a decision boundary is frequently used to make a decision where we have a decision boundary right here in the middle, this yellow line, and based on this, if it's left it's classified to zero, if it's on the right it's classified to one. Small changes on this curve right here near the threshold will either go up or go down based on it, and therefore it is sensitive in that area and you can use it to make a decision. Then we look at support vector machine, SVM. This technique is frequently used in solving problems in classification, regression, and novelty detection. SVM algorithms creates a decision boundary that maximizes the margin between the groups that are being classified. So, for example, H1 represents another algorithm. Let's say like maximum likelihood where it is dividing, classifying the red dots and the white dots in its own criteria. H2 is represents the SVM mechanism where you can see that it attempts to separate data sets with the largest margin. The margin could be these areas right here, so therefore the classification result of SVM will be different from other algorithms. We will now look into Naive Bayes classifier. Conditional independence is assume to simplify the classification decision in Naive Bayes classifying algorithms. What do you mean? Well, Bayes Theory is based on conditional probability. Let's look at a conditional probability of x given y and z that's written over there. This is the conditional probability that x occurs based on the given condition that y and z occurred earlier. If x is independent of z, then it doesn't need to be in the conditional area. It can be removed like you see right here. So, removing a variable can be based on its independence. Everything is dependent to everything else. In a real-world when you analyze data, marketing sales, and various parameters, then in reality everything is connected to everything. Everything has some level of correlation and dependence, nothing is truly independent of each other. But the relations are too complex to analyze, and then put them into a formula maybe even much more difficult. So, in order to simplify our computation, the Naive Bayes Model "Naively" assumes that some events are independent. It provides fast and easy to compute results which is the pros. But on the other hand, the cons, the accuracy and reliability is sacrificed. You use it when the resulting accuracy is sufficient to be applied to its purpose. An example of Naive Bayes classifier where conditioned on class C. If it is assumed that the probability distributions of the input variables X1 through Xd are independent, which they actually are not, then class conditional probability density equation can be written as a simplified multiplication, a product of one dimensional probability density functions. Let's find the probability of occurrence of dataset x through class c. Looking at that conditional probability, "Naively" assume conditional independence between the features, then we can get a form like this. Looking into this, D is the number of features. Looking further into the formula, a multidimensional probability is easily obtained from multiplication of a product of many one-dimensional densities under the assumption, where originally x is a multi-dimensional variable and the product of one-dimensional densities are used to get the overall multidimensional value. This model is called "naive" since in reality these features are not independent although they were assumed to be. Naive Bayes classifier technology actually works very well in many, many cases, and also because they simplify computation it is very popular. Decision tree technology. A logical region is identified classified in a sequence of recursive splitting decisions in an effective way. Fewer number of processing steps are used and each step has lower computation, which is great when you're analyzing big data. Results in a hierarchical tree shape form. The hierarchical algorithm is used in supervised learning training of the decision tree. What does this mean? Supervised learning. Supervised learning is a training used in machine learning systems, where we have labeled data. That is data with a desired output already known. Since the desired output is known and we use it as an input to compute an output, then we know the desired output, we know actually what the output was so we can compute the error, and we can use the air to go and train the machine learning system to make it more accurate. That is what we do in backpropagation, where the air is back propagated into the system and we re-tune it, we do the learning process to make it more fine tuned and accurate. Decision tree example is provided here where for classification, decision boundaries on the dataset, white dots and red dots can be determined using a decision tree. That looks like this. Where we have X larger than H1 as one decision threshold and then we have Y larger or smaller than H2, can be another decision mechanism. If it passes one, then basically a red.domain, if it passes the second decision boundary, then we can divide it into red or white, based on did it pass or not. Now placing the decision boundaries on the graph over there, you can see H1. To the right side is where we classify the red dots, and then we use H2. Below that blue line is where we classify the red dots. Using H1 and H2 in a decision tree mechanism, we can classify where the red dots are. Collaborative filtering. This is a machine learning algorithm that collects preferences or tastes information from many users, and it uses this information to make automated predictions about the interests of other users. Looking into the terminology because we're using combined collect information, that's where we get collaborative, and then we're going to filter out the less probable options until we find the most probable prediction result. That's why we call it filtering, and combining these two words together is where we get the name collaborative filtering. For example, of collaborative filtering being used, a music vendor can recommend music to a new user based on information on User A whose characteristics seem similar. For example, we have a new user that entered our domain. We want to recommend some music but we don't know what the new user would like. No reference value. So, we go through and find a similar user with similar characteristics and find what that user likes like Music B, and then we use that to recommend that to our new user. This type of results can be obtained through collaborative filtering. Clustering technology. This is a process of finding similar characteristics in a dataset to form groups of data. Training data consists of a set of input vectors without any corresponding target values. Dataset contains no information, no labels on data and cluster relation. That is why unsupervised learning is needed. K-means algorithm is one of the most popular, most famous clustering algorithms that exist. So we'll take a look into it. Unlabeled data is classified into k classes. The number of classes k is what you may specify in your algorithm, and meaning that it's unlabeled means that we do not have a desired output label to the input data that we have. So therefore, we have to use this unlabeled data to go ahead and train the system and make a decision on how we're going to do the clustering. The mean, the average of each class is updated when new data vector is received, and the mean value is used to update the division of the classes, the clusters. All data is originally colorless, but I'm going to go ahead and use yellow as its original color before we classify it as red or white. For our first step it goes to decide the two centers of the classes randomly. In the figure over there, you see two Xs. They're just placed anywhere, and based on that in step two allocate the data into the nearest cluster. So, based on these two X's that we have here, we're going to head and classify the data and then we're going to draw a line right here, the blue line, which will be the division of the two clusters, the two classes. Then in step three, Calculate the mean of each class based on the average distance. That is what you're doing over there, and then allocate the data into the nearest center which you see the blue line right here dividing it up. Then we're going to calculate the mean of each class again based on the average distance which you see that's being processed over there, and then in the final step, we're going to allocate the data vector into the nearest centers again over here. So as you see, we've been updating this going back and forth a couple of times, recalculating the average, then calculate the nearest distance, then do the clusterization again, and then we go back and calculate the average and then we do this over again, and then eventually we have a good division of clusters. If new data, a new vector joins in then we will repeat this process again, so that it is properly classified. It took a couple of steps, but it was so simple to do that is why k-means is such a popular and effective clustering algorithm. In addition, it was unsupervised based. Meaning that we did not have any reference information at the beginning, however eventually we found a good way to classify and cluster the data. Dimensionality reduction. This is used to reduce dimensionality by projecting the dataset to a lower dimensional subspace. It captures the essence of the data. It reduces the complexity of the classifier and regressor. Complexity depends on the number of inputs. Both the time and space complexity needs to be considered. For an example, the drone management map generation is used here, where data that we have for a drone is three-dimensional; longitude, latitude, and altitude. Where we're going to map longitude to X, latitude to Y, and altitude to the Z axis individually, and therefore it is three-dimensional. However, if we want to draw a simplified two-dimensional longitude and latitude drone map then, we can go ahead and do this process. Now I must say in advance that dimensionality reduction is done on much more complicated problems I chose this problem to make it as simple as possible for you to understand. However the concept I hope will be well transfer. So, from a three-dimensional location of a lot of drones there that have longitude, latitude as well as altitude, we can go ahead and map them down to the longitude and latitude by removing the altitude information, and therefore we map it down here where- This information mapping is done over here and we take the map results from 3D to 2D, then we see that we can remove the Z-axis since we already removed the altitude information, and then we can flip this up to get that result over there in a 2D structure x and y. Assuming that altitude information Z is not needed, we can conduct this and by eliminating Z from data we can capture the essence of the data in a two-dimensional structure this is what dimensionality reduction is about. Machine learning algorithms that we looked into included the basic statistics ones, classification and regression ones, and these other ones, and all of them are very powerful, and Spark uses them in the machine learning library to analyze data, to filter data, to do predictions therefore Spark is so powerful. In addition as I mentioned in my former lectures, the mahout engine does this for Hadoop as well so therefore Mahout is the machine learning engine that can be used in Hadoop and these technologies will help Spark technology advance much more in the future. These are the references that I used and I recommend them to you. Thank you.