Hello, everyone. Today, we're going to talk about the data warehousing, which is another important component in the data mining pipeline. So for this one, our goal is to be able to identify one of key characteristics, right of the data warehousing, and also be able to apply various data warehousing techniques for your specific data mining tasks. So a quick review of our data mining pipeline, right? So we start with the raw data, okay? And then we try to understand our data by looking at the general characteristics of our data set. And then data pre processing really tries to fix potential issues in your data set and transform, integrate, reduce your data set so that they are ready for analysis, right? So, usually we don't like to feed the data directly into a particular data mining or data modeling step, instead we were using the data warehousing at the middle step, right? Which is kind of manages all the different types of data and which then can be used to a selected right, depending on the specific data modeling approach you would like to do, okay? So, in terms of data warehousing, right, the first question of course is that why do we even need the data warehousing? Or like specifically, talking about how data warehouse is different from operational data. So that means you already have maybe your production system, right? And now you're trying to say okay, I need a separate data warehouse for my data mining tasks. We're also specifically to talk about Data cube and also on OLAP, which is online analytical processing. So this specifically focusing on this multi dimensional angle, right? How you model your data as a multidimensional data, how you manage it and be able to use that to support various kinds of data manipulation. Also, we will talk about data warehouse architecture, right? What are the key components that should be included in your data warehousing design and like how they linked to each other? So starting point. Why do we need the data warehousing, right? Why is it not sufficient that we just grab the data from our everyday sources and just go from there, right? So here is like key definition by William Inmon. So, he basically defined data warehouse as a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of a management's decision-making process, okay? So there are a few key words right, in this definition, okay? So first one to note is that this is a separate from the operation data, okay? We will put something aside, and this is dedicated for our data-driven decision support, okay? So let's look at some of the key characteristics, right, as noted on the previous slide. So, first one say that a data warehouse is subject oriented, okay? So this is a particularly, like important. Remember, we talk about data quality. We'll talk about the relevance, right, of your data set, right? Because the idea is that you want to be focused on things that are of interest for your particular decision-making scenario rather than trying to diss managing your data that may be needed for some kind of operational purposes, okay? So, think about the example you may be looking at, say, stores, customers, products, students or courses, right? As you can see, these are all like subject specific, right? If you are interested in looking at your students, you look for this student specific data set, right? If it's about courses, then you may have courses specific dataset, okay? So the idea is that you really want to be kind of design your data warehouse to support, right, efficient management of manipulation, right of specific subject related data rather than trying to be supporting all potential datasets. The next characteristics is about this notion about being integrated, right? So we talk about, for about data pre processing. There's one component where you're trying to integrate the data sources, right? So that is also kind of one key driving force of a lot of kind of data mining projects is that you are taking data from multiple sources, okay? Because by integrating data from multiple sources then you're able to find the patterns that may govern across the various sources. So, one example here, if you're trying to consider a new location, right, to open our store, then you will probably kind of bringing information that's related to customers, right? Potential customers in that area. What kind of business already exist in that area? And also it maybe traffic condition because that may also impact, right? How convenient for customers to come to your store, right? So as you can see, right, these are naturally relevant information but they may be available through different sources, right? So for your bill to pull this together is a key part in terms of data warehousing and also, of course, supporting your specific decision-making process. And because of it's coming from different data sources, then you are natural dealing with heterogenerous sources. And also there's the notion of the data cleaning and the data integration as we discussed that before. Always apply in the setting so that you can pull the data in, integrate them, but also prepare them, right? So that your data warehouse will be hosting this nicely integrated data set for your particular task. Now, the next one is about being time-variant. So this is also kind of a key kind of difference between operational data and data warehousing, right? Because operational data, of course, you as you say, it's operational, is about the daily production, right? Our daily operation of your system. So, then general, you're talking about current data, right? While whit decision support, right? Typically, you're trying to look for historical information, right, and also maybe current information. So you're basically trying to combine information that covers a much longer time period, right? [COUGH] And also naturally, then your data will be time stamped, right? As you're kind of pulling your data in, in your data warehouse, right? It is an important notion for you to keep all the data type, like time stamps. So you know when what data points has been generated, okay? So for example, I'm trying to look for kind of urban planning scenario, right? May be very useful for you to look at not just the current traffic condition, but you want to look back, say, over the multiple, like past the 5, 10 years, so that you have a better understanding of the overall pattern and how things have evolved over the past few years. The last characteristic is about being nonvolatile, okay? The notion, of course, just say within your data warehouse, you typically don't update your data that often or in particularly you don't update in place, okay? So the key idea is that you're trying to say, grab data. So you usually have this kind of initial loading face, right? And say, okay, I have that identify new data set, I like to incorporate into my data warehouse, right. So you have the initial loading, and then afterwards, you may have kind of this can append operation, that means you're trying to add in more information, right? So you say, okay, using my traffic, for example. So you may have loaded in information, say, for the past 10 years, but then every day goes by, or every month you may have an updated information. Right, traffic information that you're appending to your data warehouse, that's already in there. And also, then with that, the most of the analysis right, really involves reading your data rather than writing or updating your data, okay? So this is usually, again, a key difference between data warehouse and operational kind of like data management system. Because that one is usually talking about kind of regular changes or fine grained updates to your existing status or data, all right? So this also kind of relates to the two key term, which is one of the OLTP and another one is OLAP, okay? So the first one OLTP refers to online transactional processing, okay? So the name suggests, right, about kind of supporting kind of transactions on demand, right? This is really like what a live kind of operation like data system will be supporting. Think about customers making purchases or orders. They're like say bank transactions or students signing up for courses or submitting homework. So all those are transactions, right? So you're supporting this regular and also detailed updates and that then that is a lot of in terms kind of inserting information, updating information, deleting information. So those more like the daily operation scenarios, okay? And you go, of course, to support that kind of operation, right, in a timely fashion. Well, with OLAP, which is referred to as online analytical processing, right? So here, the idea is all about being able to kind of work with say the data warehouse. And then you're really trying to do a lot of kind of analysis, right on top of your data, right? Remember, here your data set, right, in the data warehouse is usually the aggregate across many sources. And they are organized certain ways, and they have a lot of the historical information. So the analysis, right, is more for identify new patterns and to support your decision-making, right? So if you look at the temporal scale, then it's different. In this case, it's more about in a way you want to take your time, right? Analyze, do various kinds of exploration and to in order to find some patterns and then to kind of validate potential kind of decision or policy that you consider. So, as you can see, right, is to actually quite different usage scenarios, right? Both are important, but just that how they're being used, right, is very different. And because of that, right, so there's some kind of significant differences, okay, between these two, okay? So think about this operational system versus the data warehouse, right? You can look at a remarkable dimensions, right? We can already mentioned a little bit about how the types of data or where you're getting the data may be different, right? How one may be more specific to supporting daily operations versus the analytical kind of processing, right? Or the kind of decision support, right? And also, the scale of the data is usually different, right? And you may on one side with the transaction support you have, the current data set, right? And you're supporting fairly fine-grained kind of operation that may touch a particular point of example. You're a particular customers, the bank account, right? Where with the decision support, or like the data warehouse, the scenario. Usually you're trying to apply your analysis across a much larger set, right, of data. So because of that, right, see how they're being carried, who used those kind of systems? And also what kind of a super model agency requirements may be very different, right? They're also this kind of the systems asked in terms of how you design your database, and also how you do this kind of backup recovery, this or quite different, okay? So the key idea is that when you know there's an OLTP are being used for different scenarios, right? Then you can reason about how you design them differently, how they're being used differently. And what then, of course, should be considered when you're trying to consider trade-offs between different matrices. So next, let's look at data warehouse model. So here the idea is that well, of course, data warehouses for managing data, right, for decisions over but still those data. So what kind of model you're using to represent your data? So we already said earlier, your data are represented as objects and the dimensions, right? But here there's two important notion when you talk about the data warehouse, right? So the first one is about the fact, okay? So fact is rejected at the point, the thing of interest, right? So I say if I'm really like the decision support process that you to kind of boost the sales of my products, right, okay? And your fact is about the sales, right, how the amount of sales you get, right, in certain kind of scenarios. And then you talk about the dimensions. The dimension those, of course, are specific attributes, right? So when we talk about the sales, of course, you talk about items. When you talk about particular stores, when you look at any particular time period. So those are kind of descriptive, right, dimensions that will be used to them kind of like, quantify the particular fact like that you're interested. Now, depending on how you kind of design it and depending how this will be used, right? There are a few different approaches in terms of modeling, okay,? So here we use the term like schema, right? So that's like typically, in the databases setting, right, using a schema to kind of capture the different types of kind of information that's been linked together. So the first one that's a simple one is called a star schema, okay? As the name suggests, it's a star, right? So you have something in the middle, and then you have things that are kind of linking out, okay? So this is a typical scenario. You have one fact table, and you have multiple dimension tables that are linked to or from the fact table, okay? And then you can add another layer instead of star, where you have this kind of one layer of linkage, you could have actually links and links. So you basically have a multiple layers linking. And that kind of gets into the snowflake kind of like scenarios. That means you started with one fact table in the center, right? And then this fact table will be connected to multiple dimensions. But those dimensions by themselves may also be linked out to other dimensions. So you have this kind of expanded view, right, of the various aspects. Another one that's even kind of a little bit more complex is called the fact constellation schema, okay? So here the idea is you can use each fact as a star shape. But now you're kind of linking multiple ones, so you have not just one fact table. You can have multiple fact tables. So those fact tables, of course, each fact table be linked to its dimensions. But that dimension may also be linked to other fact tables, okay? So let's look at some kind of concrete examples, okay? So the first one star schema rather let's start scheme, or you have one fact table in the middle, okay? And then it will be linked to marginal dimensions or dimension tables, okay, that's simple. So if you think about concrete example, and my fact table talking about the sales, right, I care about sales in my stores. Now and then you could okay, the cells may be associated with the customer, item, time, so those are individual dimensions, right? So if you go to say, your customer table, okay? So this is a dimension talking about the customers, right? So what would you define, describe the customers? The customers have the name information. You can define the first name, last name, and you have the address information, data, birth or some other kind of like price, purchases or whatever. So those are things that will be related to those customers, right? And then you can have another dimension, right? So the sales also could be associated with the timetable. So the time, of course, it defines what time, right? Can say which a year, which month, which date or specific hour and minute. So you can have actually very detailed information about the time, okay? But by organizing this, so as you can see, it's natural for you to one, focus on your point of interest, right? So that's your kind of like the fact table. But then you have all the related information describing that particular thing that you're interesting. So the Snowflake Schema, as we said, it's still a single fact table right in the middle, okay? But then instead of just one level of information that's connected to the fact table, you could have multiple levels, okay? So again, using our sales example, right? So the factor table is about sales, okay? And then it has multiple dimensions. So it's linking to customer, item, time, right? In this case, let's go to the customer dimension table, so customer will be defined by name, address or whatever. But here you can also have the card. This could be the credit card information or, club card or membership card or whatever, right? So once you get to the card level, okay? So then basically you're linking from the sales fact table to the customer dimension table. And then from that customer dimension table, you would also have a link to the card dimension table, okay? And the card, of course, will be associated with the name of the card owner, expiration date, and maybe CVC code you need. So these as you can see, allows you to kind of expand right from a single level to multiple levels. But the key here is that you still have a single fact table in the center, right? And then the dimensions are then linked out of it, okay? Then the third one is the Fact Constellation Schema. Right, so this is a case where you have multiple fact tables, okay? And also allows you to look at say not just the sales, but maybe you look at other things whatever this could be shipping, right? So here you could have, say, a sales fact table, which is linked to say, time, branch, item, location, this all dimensions, related to sales fact table. But then you can also see that my item dimension right, could also be related to shipping. Because apparently, if you have a shipping fact tables look at not only the sales, but also look at the shipping scenario. Then say each shipping information, of course, would be related to the item dimension table, right? So like for this particular shipment which items are being content, right? So you could have done link between. So from the sales fact table, you may be linked to the item dimension table, and then the item dimension table may also be linked to from the shipping fact table, okay? So that allows you again to kind of link the multiple types information in your data warehouse and then be able to easily navigate those linkage so that you get the specific information in need for your particular analysis. Now, with this data model, as I said, we use the schema to talk about the individual ones. But naturally, this is in the data warehouse scenario you are dealing with a multi dimensional data model, right? Typically your data will be represented by multiple dimensions or multiple attributes, okay? So, naturally we use this term of kind of data cube. Of course, cube is preside in the 3D spaces, it's a cube but we think about high dimensional space. Then you're really talking about how this covers multiple dimensions that are related to your particular facts. So here I have a simple example illustrating in a 3D space, right? I have again, the three dimensions, right? Those will be the cube attributes, right? I'll define, so which year, what kind of product? And also here I'm looking at what kind of color it's related, right? So you naturally have this three dimensions. And what do you care about, of course, is some kind of like numerical measure. In this case, could be the sales, which is by volume or by the actual value, okay? All right, so once we have defined a data cube at this kind of multi- dimensional data model, to represent what kind of data we have in the data warehouse, then we can think about how we can manipulate. Or what kind of operations, we can use to the select the kind of data you need for your particular analysis, okay? So here is a quick illustration of the different things you could do, right? So starting with your kind of initial kind of like data cube where you have the raw data right at the final granularity and across many dimensions, right? So what you can do so, for example, you can do this roll-up operation. So roll-up in a way just more like aggregating, right? Aggregating along certain dimensions, okay? So that of course, you have aggregated view across say which store, right? So you say I don't care about which store, so that's where you can add up along the multiple stores. Or you don't say I can add up along certain categories of items, right? So that I usually allow you to kind of have a coarser granularity to look at. And in reverse to that to roll-up then you will have the drill down. So that means I can take a particular information and then kind of dive, go deeper, right? So, for example, I have my annual information, but then I can actually go deeper and go to see the monthly information or even daily information, right? So that allows you to not only kind of go up to the coarser granularity, but also, you can come back to this final grand already, depending on your specific analysis, okay? There are other things that's like slicing and the dicing. So that just generally the idea is about why maybe along a particular dimension I don't need to consider everything, right? So, for example, if you talk about your annual sales that maybe I don't care about anything happened like five years ago. So instead, I only look at the recent five years, right? And then the dicing part, of course, at this kind of multi-dimensional cutting, right? So not only you're cutting along one dimension, saying that for this dimension only care about this piece or this range, and they can go to another dimension say, okay, I'm going to cut out certain ranges. So as a result, then you're looking at more like a sub cube, right over your overall cube. So that's the kind of the dicing scenario. And another approach that's very useful is about this pivot operations. So that basic kind of like transforms, right? It doesn't change your data per se, but it just tries the viewpoint. So instead of comparing, say, year versus the store location, now you may be looking at store location and also the categories of items, okay? Now, let's stop being and look at some concrete examples, okay? So the first one, roll up, as I said, this is really the aggregation kind of like operation, right? So you're picking out a particular data set, and then you say, okay, now I want to aggregate along certain dimensions, right? So here you can say, I can take my daily sales value and aggregated into monthly. The other example. Right, so that's the drill down. So that means now is kind of reversible up, right? So you may have aggregated to get the coaser granularity, but then you can also kind of drill down to get to the final granularity. Okay, so earlier you may have example where say I have aggregated sales data for all the stores within North America, right? And now you can say, maybe I want to go, I should go a little bit deeper out of the country level. So you can look out and say United States and Mexico, Canada or other things. As you can see, you can get to the final granularity, as you kind of figure out yeah, well, I mean, I want to actually go deeper in terms of a particular dimension. And then so, pivot is particularly useful in terms of visualization, right? So you want to kind of rotate your dimensions in a way that you can focus on certain dimensions, that are of particular interest. So in this case, you don't change your data points or the values, right? It's really more about how you're visualizing this specific dimensions. Slicing. So my example here is that I can pick out a certain values along particular dimension, right? So that, instead of considering all the possible values I only consider, one or multiple values along that dimension, okay? So if it's a nominal value, you may just pick up a certain kind of nominal values, right? So if you say, if I am considering the student majors, right? Maybe I don't consider all the students majors. I only consider students who are, say, in computer science, or some other related fields. So that's how you can easily select certain subset of the values along each dimension. And then if you do this on a remarkable dimension, then that's the dicing case, right? So I could say I could be selecting by country, and also I can select by time periods and then some other aspects. This basically allows you to easily manipulate your data cube, right? Because once you have the raw data kind of incorporated in your data warehouse, right? And then on top of that, you have this kind of very convenient operations that you can easily apply to your data cube so that you can identify certain types of information that you want to leverage