Hi. Last week, we learned how to program descriptive analytical SQL queries to our data warehouse. This week, we will learn the process of data mining, it's typical architecture, and the impact of data mining for business intelligence with some algorithms. Nowadays, large amounts of data from everywhere flow at speeds and volumes never seen before, and companies that have successfully used analytics are obtaining extraordinary business results. However, how good be convenient to start analyzing such amount of data, to get the best business advantages? Well, this is where predictive analytics, data mining, machine learning, and decision management come into play. Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics and database systems. As we have learned, data mining is used to improve OLAP analysis. Data mining is also used within the process of knowledge data discovery. Remember, that there are several approaches to data analysis such as: basic descriptive analysis, many obtained by SQL programming, descriptive statistical analysis, such as mean, mode, standard deviation, et cetera. Inferential statistical analysis with models, inferences and predictions, with correlation, regression, variance, et cetera. Analysis with data mining, involves artificial intelligence and machine learning. In the case of predictive analytics and data mining, it helps to evaluate what will happen in the future. Data mining searches for hidden patterns in the data, that can be used to predict future behavior through machine learning. Businesses, scientifics and governments, have used this approach for years to transform data into proactive knowledge. Decision management converts that knowledge into actions that are used in their operational processes. So as long as the same approaches can be applied today, they need to occur more quickly and on a large scale, using the most modern techniques currently available. However what would be the benefits of discovering patterns or models through data mining? Innovative organizations use data mining and predictive analytics to, among other things: detect fraud and cybersecurity problems, risk management, sports sales trends, develop smarter marketing campaigns, predict customer loyalty, medical treatments et cetera. In the case of automated analysis, are quick implementation of knowledge obtained from predictive analysis, ensures that the convenience of the analytical models, is not lost due to slow processes such as rewriting the code for each environment, revalidating the rewritten models, or any other manual process. One of the very important steps to innovator how system implementation, is to ask even before you start the project, what are the benefits of a decision support system with bring. So, in order to calculate the return of investment, we need to ask, what will be the possible investment? What are the expected returns, and establish the corresponding resources and implement a project accordingly. Plan the project according to milestones, keep track of this in order to control the project, and verify that we have achieved the expected return. For example, have we obtained expected results? Measure the resource and implement the action plan, to correct it in case it is necessary. What can we do to correct the situation if we have not achieved the results? Now, I will present us typical architecture for data mining. As I said before, there are several ways to implement retrospective analysis with basic SQL, OLAP queries, visualization and basic statistics. All these allow to know what happened. In the case of data mining, there are plenty of task and techniques that are helped establish descriptive, prospective analysis by the data patterns or of models that allow to know, why things happen and what is next. A common use of data mining and machine learning technique is automatic cementation of customers by behavior, demographics or attitudes to better understand the needs of specific groups and address them in a more targeted manner. This analytical segmentation, or unsupervised modelling, helps to identify groups of clients that are similar and that could react to certain offers or activities in a similar way. Another important use for data mining and machine learning is to detect frauds, which is important as first as developed more sophisticated tactics. Data-mining provide tangible benefits such as cost reduction, generation of income, reduction of time for different business activities. Provides some intangible benefits like decision-making, improvement of competitive position, and also provides strategic benefits. For example, all those are facilitated formulation of the strategy, that is, to which clients, markets, or with which products to go. There are two main objectives in data mining, on the first place, comes prediction, which often refers to supervised data mining. And on the second place we have description. It includes unsupervised aspects and visualization of data mining. We speak of supervised method when starting from a prior knowledge of the data. If we don't have prior knowledge of data we shall use unsupervised methods, where groups of values, are automatically searched for. So, users try to find correspondences between these automatically selected groups, and the categories that may be of interest. Now we will see that predictive data mining tasks come up with a model from an available data set to predict unknown or future values of another data set. For instance a medical practitioner trying to diagnose a disease based on the medical test results of a patient. And descriptive data mining tasks, find data describing patterns and come up with new significant deformation from the available data sets. For instance, a retailer trying to identify products that are purchased together. There are a number of data mining tasks such as classification, prediction, time series analysis, association, clustering, summarization, et cetera. All these tasks are either predictive data mining task or descriptive data mining task. Our data mining system can execute one or more of the above the specific task as part of data mining. I will explain some of this data mining task as follows. In the case of predictive analysis, a classification task derives a model to determine the class of an object based on its attributes. A collection of records needs to be available. One of the attributes of the record will be a class attribute. The goal of classification task is assigning a class attribute to new set of records as accurately as possible. For instance, classification can be used in direct marketing to know which customers purchased similar products, and then promotion mails can be sent to them directly. We can see in the figure the classification parts from a set of data, which is prepare for the data mining task and divided into a training and testing sets. The training will be the input to the classification algorithm in order to create a predictive model. When the model is ready, it will be evaluated by using the testing set as an input and verifying if the outcomes were how was suspected. Remember that in the input file, we have the data and the corresponding classification already identified previously, because we start from previous knowledge. Prediction task predicts the possible values of missing or future data. Prediction involves developing a model based on the available data, and this model is used in predicting future values of a new data set of interest. For example, a model can predict the income of an employee based on education, experience, and other demographic factors like place of stay, gender, etc. Also, prediction analysis is used in different areas including medical diagnosis, fraud detection, etc. The Predictive task: Time series, is a sequence of events where the next event is determined by one or more of the preceding events. Time series reflect the process being measured, and there are certain components that affect the behavior of a process. Time series analysis includes methods to analyze time-series data in order to extract useful patterns, trends, rules and statistics. Stock market prediction is an important application of time-series analysis. The descriptive tasks association discovers association or connection among a set of items. Association identifies the relationships between objects. Association analysis is used for commodity management, advertising, catalog design, direct marketing, etc. A retailer can identify the products that normally customers purchased together, or even find the customers who respond to the promotion of same kind of products. The descriptive task clustering is similar to classification, except that the groups are not predefined. Clustering is used to identify data objects that are similar to one another. The similarity can be decided based on a number of factors like purchase behavior, responsiveness to certain actions, geographical locations, and so on. For example, an insurance company can cluster its customers based on age, residence, income, etc. This group information will be helpful to understand the customers better, and hence provide better customized services. The summarization task is descriptive and is the generalization of data. A set of relevant data is summarized, and the result is a smaller set that gives aggregated information of the data. For example, the shopping done by a customer can be summarized into total products, total spending, offers used, etc. Such high-level summarized information can be useful for sales or customer relationship team for retail customer and purchase behavior analysis. Data can be summarized in different abstraction levels and from different angles. Here we can see a number of tasks and techniques. We will learn some of the model representative and used task and techniques. In the case of classification task, we will learn the ID3, decision tree, Naive Bayes, and K-nearest neighbor, which is also a regression task. In the case of clustering task, we will learn the K-means algorithm. Now we will see how the K-means algorithm works. K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data. For example, data without defined categories of groups. The goal of this algorithm is to find groups in the data, with a number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on the feature similarity. The results of the K-means clustering algorithms are: first, the centroids of the K clusters, which can be used to label new data. Second, labels for the training data, each data point is assigned to a single cluster. Each centroid of a cluster is a collection of feature values which defined the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents. The K-means pseudocode is as follows: first, selected points as initial center. Second, repeat. Third, form K clusters, assigning each point to its nearest center. Fourth, recalculate the centers of each cluster. Fifth, until the centers to not change. To assign the points to the nearest centers, a measure of proximity is used to determine how "close" the data is to the centers. The most used measure of proximity is Euclidean distance, but other measures of proximity can be used, such as Manhattan distance or the distance of Cosine. The latter is usually used to measure similarity between documents. To ensure that each point is assigned to its cluster center and that the quality of the clustering is good, an objective function is used that tries to guarantee the minimum proximity between points and centers. This objective function is the sum of the square error, which is defined as: Where E represents the sum of the quadratic error for all the objects in the data set; p represents a point in space which represents a given object; Ci is the center of the cluster, and dist is a standard measure, usually the Euclidean distance between two objects in an Euclidean space. To calculate the midpoint Ci, the formula of the mean value is used, where Mi represents the number of objects in the cluster i. Now we will see an example of K-means algorithm. There are four people with their corresponding exam rates under age. Each person can be represented as a point in a coordinate space. During the first iteration, the distance of each object to the centroid is calculated using the Euclidean distance. Here we can see the seven distances calculated. For instance, the Euclidean distance between the same point W is zero. The distance between the point W and the point X is 3.16. After that, we can calculate the following distance matrix. Then, each object is assigned to each group, taking into account the minimum partition error of each element in each group. Therefore, people or point W and Y are assigned to group W, which center in people X, Z to group X, which center in. During second iteration, we need to determine the new centers of the clusters. Knowing the members of the groups or clusters, the center of each group is calculated. So, for the new center for group one is.. And the new center for group two is.. We have, the new center is.. The distances between each point and the new center are computed, and is as follows. The distance matrix is recalculated, and is as follows. According to distance matrix, Y will be assigned to cluster one, and W, X, and Z will be assigned to cluster two. Third iteration. Here we have C2, distance two, and group two. As we can see, the grouping was the same as iteration two, concluding that the calculation of the center has stabilized and it's not necessarily to continue iterating. As a result, we can say that people W, X, Z share same characteristics and they belong to the cluster C2, and Y can be grouped apart. While we have identified the differences between predictive and descriptive tasks, we have also learned how the K-means algorithm works. Next session, we will continue with the ID3 and the K-nearest neighbor algorithms. See you soon.