Analytical operations in big data pipelines. After this video, you will be able to list common analytical operations within big data pipelines and describe sample applications for these analytical operations. In this lesson, we will be looking at analytical operations. These are operations used in analytics, which is the process of transforming data into insights for making more informed decisions. The purpose of analytical operations is to analyze the data to discover meaningful trends and patterns, in order to gain insights into the problem being studied. The knowledge gained from these insights ultimately lead to more informed decisions driven by data. Here are some common analytical operations that we will discuss in this lecture. Classification, clustering, path analysis and connectivity analysis. Let's start with classification. In classification, the goal is to predict a categorical target from the input data. A categorical target is one with discreet values or categories, instead of continuous values. For example, this diagram shows a classification task to determine the risk associated with a loan application. The input consists of the loan amount, applicant information such as income, age, debts, and a down payment. From this input data, the task is to determine whether the loan application is low risk or high risk. There are many classification techniques or algorithms that can be used for this problem. We will discuss a specific one, namely, decision tree in the next slide. The decision tree algorithm is one technique for classification. With this technique, decisions to perform the classification task are modeled as a tree structure. For the loan risk assessment problem, a simple decision tree is shown here, where the loan application is classified as being either low risk, or high risk, based on the loan amount. The applicant's income, and the applicant's age. The decision tree algorithm is implemented in many machine learning tools. This diagram shows how to specify decision tree from input data, KNIME. A graphical user-interface-based machine learning platform. Some examples of classification are the prediction of whether cells from a tumor are benign or malignant, categorization of handwritten digits as being zero, one, two, etc, up to nine. And determining whether a credit card transaction is legitimate or fraudulent, and classification of a loan application as being low-risk, medium-risk or high-risk, as you've seen. Another common analytical operation is cluster analysis. In cluster analysis, or clustering, the goal is to organize similar items in to groups of association. This diagram shows an example of cluster analysis in which customers are clustered into groups according to their preferences of movie genre. So, customers who like Sci-Fi movies are grouped together. Those who like drama movies are grouped together, and customers who like horror movies are grouped together. With this grouping, new movies, as well as other products, such as books, can be offered to the right type of costumers in order to generate interest and increase revenue. A simple and commonly used algorithm for cluster analysis is k-means. With k-means, samples are divided into k clusters. This clustering is done in order to minimize the variance or similarity between samples within the same cluster using some similarity measures such as distance. In this example, k is equal to three, and k-means divides the original data shown on the left into three clusters, shown as blue, green, and red on the chart on the right. The k-means clustering algorithm is implemented on many machine-learning platforms. The code here shows how to read in and parse input data, and perform k-means clustering on the data. Other examples of cluster analysis are grouping a company’s customer base into distinct segments for more effective targeted marketing, finding articles or webpages with similar topics for retrieving relevant information. Identification of areas in the city with rates of particular types of crimes for effective management of law enforcement resources, and determining different groups of weather patterns such as rainy, cold or snowy. Classification and cluster analysis are considered machine learning and analytical operations. There are also analytical operations from graph analytics, which is the field of analytics where the underlying data is structured as, or can be modeled as the set of graphs. One analytical operation using graphs as path analysis, which analyzes sequences of nodes and edges in a graph. A common application of path analysis is to find routes from one location to another location. For example, you might want to find the shortest path from your home to your work. This path may be different depending on conditions such as the day of the week, time of day, traffic congestion, weather and etc. This code shows some operations for path analysis on neo4j, which is a graph database system using a query language called Cypher. The first operation finds the shortest path between specific nodes in a graph. The second operation finds all the shortest paths in a graph. Connectivity analysis of graphs has to do with finding and tracking groups to determine interactions between entities. Entities in highly interacting groups are more connected to each other than to entities of other groups in a graph. These groups are called communities, and are interesting to analyze as they give insights into the degree and patterns of the interaction between entities, and also between communities. Some applications of connectivity analysis are to extract conversation threads. For example, by looking at tweets and retweets. To find interacting groups, for example, to determine which users are interacting with each other users, to find influencers, for example, to understand who are the main users leading to the conversation about a particular topic. Or, who do people pay attention to? This information can be used to identify the fewest number of people with the greatest influence. For example, for political campaigns, or marketing on social media. This code shows some operations for connectivity analysis on neo4j using the query language, Cypher, again. The first operation finds the degree of all the nodes in a graph, and the second creates a histogram of degrees for all nodes in a graph to determine how connected a node in a graph is, we need to look at its degree. The degree of a node is the number of edges connected to the node. A degree histogram shows the distribution of node degrees in the graph and is useful in comparing graphs and identifying types of users, for example, those who follow, versus those who are followed in social networks. To summarize and add to these techniques, the decision tree algorithm for classification and k-means algorithm for cluster analysis that we covered in this lecture are techniques from machine learning. Machine learning is a field of analytics focused on the study and construction of computer systems that can learn from data without being explicitly programmed. Our course on machine learning in this specialization will cover these algorithms in more detail, along with other algorithms used for classification and cluster analysis. As well as algorithms for other machine learning tasks, such as regression, association analysis, and tools for implementing and executing machine learning algorithms. As a summary of the Graph Analytics, the Path Analytics technique for finding the shortest path and the connectivity analysis technique for analyzing communities that we discussed earlier, are techniques used in graph analytics. As explained earlier, graph analytics is the field of analytics, where the underlying data is structured or can be modeled as a set of graphs. Our graph analytics course in the specialization will cover these and other graph techniques, and we'll also cover tools and platforms for graph analytics. In summary, analytic operations are used to discover meaningful patterns in the data in order to provide insights into the problem being studied. We looked at some of the examples of analytical operations for classification, cluster analysis, path analysis and connectivity analysis in this lecture.