Welcome. In this short module we'll talk about information integration which refers to the problem of using many different information sources to accomplish a task. In this module, we'll look at the problems and solutions through a few use cases. So after this video, you'll be able to explain the data integration problem, define integrated views and schema mapping, describe the impact of increasing the number of data sources, appreciate the need to use data compression, And describe record linking, data exchange, and data fusion tasks. Our first use case starts with an example given at an IBM website for their information integration products. It represents a very common scenario in today's business world. Due to the changing market dynamics, companies are always selling off a part of their company or acquiring another company. As these mergers and acquisitions happen, databases which were developed and stored separately in different companies would now need to be brought together. Now take a minute to read this case. This is the case of an expanding financial services group that's growing its customer base in different countries. And all they want is a single view of their entire customer base. In other words, it does not matter which previous company originally had the customers, Suncorp-Metway want to consolidate all customer information as if they were in one single database. And in reality, of course, they may not want to buy a huge machine and migrate every subsidiary company's data into it. What they're looking to create is possibly a software solution which would make all customer-related data to appear as though they were together as a single database. This software solution is called an information integration system. This will help them ensure that they have a uniform set of marketing campaigns for all their customers. Let's try to see, hypothetically, of course, what might be involved in creating this combined data and what kind of use the integrated data might result in. So we first create a hypothetical scenario. Although Suncorp have a large number of data sources, we will take a much simpler situation and have only two data sources from two different financial service companies. The first data source, which is an insurance company that manages it's data with a relation of DBMS, this database has nine tables where the primary object of information is a policy. The company offers many different types of policies sold to individual people by their agents. Now as it's true for all insurance companies, policyholders pay their monthly dues, and sometimes people make claims against their insurance policies. When they do, the details of the claims are maintained in the database. These claims can belong to different categories, and when the claims have paid to the claimants, the transaction is recorded in the transactions table. As we have done several times now, the primary keys of the table are the underlined attributes here. The second company in our example is a bank, which also uses a relational database. In this bank, both individuals and businesses called corporations here, can have accounts. Now accounts can be of different types. For example, a money market account is different from a savings account. A bank also maintains its transactions in a table, which can be really large. But the dispute in a bank record case happens when the bank is charged a customer, or the customer has declined responsibility of the charge. This can happen, for example, if a customer's Internet account was hacked or a debit card got stolen. The bank keeps a record of these anomalies and fraudulent events in a disputes table, all right. Let's see what happens after the data from these two subsidiary companies are integrated. After the merger the company wants to do a promotional goodwill activity. They would like to offer a small discount to their insurance policyholders if they're also customers of the newly acquired bank. How do you identify these customers? Let's see. In other words, we need to use the table shown on the left to create the table shown on the right called discount candidates. One is to create a yellow tables from the insurance company database, and the blue table from the bank database, and then join them to construct the table with a common customer ID, and both the policyKey and bank account number. Now, this relation, which is derived that is computed by querying two different data sources and combining their results, is called an integrated view. It is integrated because the data is retrieved from different data sources, and it's called a view because in database terminology it is a relation computed from other relations. To populate the integrated view discount candidates, we need to go through a step called schema mapping. The term mapping means to establish correspondence between the attributes of the view, which is also called a target relation, and that of the source relations. For example, we can map the full address from individuals to the address attribute in discountCandidates, but this would only be true for customers whose names and addresses match in the two databases. As you can see, policyholders uses the full name of a customer, whereas individuals has it broken down into first name, middle initial, and last name. On the other hand, full address is a single field in individuals, but represented in full attributes in the policyholders relation. The mappings of account number and policyKey are more straightforward. Well, what about customer ID which doesn't correspond to anything in the four input relations? We'll come back to this later on. Okay, now we'll define an integrated relation. How do we query? For example, how do you find the bank account number of a person whose policyKey is known? You might think, what's the problem here? We have a table. Just say select account number from discount candidates where policyKey is equal to 4-937528734, and we're done. Well, yes, you can write this query, but how the query be evaluated? That depends on what's called the query architecture of the data integration system. The figure on the left shows the elements of this architecture. We'll discover it in more detail, but on this slide, we'll just describe the three axes of this cube. The vertical z axis specifies whether we have one data source or multiple data sources. Our interest is in the case where there are multiple data sources. The x axis asks whether the integrated data is actually stored physically in some place or whether it is computed on the fly, each time a query is asked. If it is all precomputed and stored, we say that the data is materialized. And if it is computed on the fly, we say it's virtual. The y axis asks whether there is a single schema or global schema defined all over the data integrated for an application or whether the data stay in different computers and it is accessed in a peer-to-peer manner at runtime. Thus, the seemingly simple select project query will be evaluated depending on which part of the cube our architecture implements. But for now, let's return to our example use case. An obvious goal of an information integration system is to be complete and accurate. Complete means no eligible record from the source should be absent in the target relation. Accurate means all the entries in the integrated relation should be correct. Now we said on the previous slide that a matching customer is a person who was in both databases and has the same name and address in the two databases. Now let's look at some example records. Specifically, consider the records marked by the three arrows. The two bank accounts and the policy record do not match for name or for address. So our previous method would discard them. But look at the records closely. Do you think they might all belong to the same customer? Maybe this lady has a maiden name and a married name, and has moved from one address to another. Maybe she changed her Social Security number somewhere along the way. So this is called a record linkage problem. That means we would like to ensure that the set of data records that belong to a single entity are recognized, perhaps by clustering the values of different attributes or by using a set of matching rules so that we know how to deal with it during the integration process. For example, we need to determine which of the addresses should be used in the integrated relation. Which of the two bank accounts? If the answer is both accounts 102 and 103, we will need to change the schema of the target's relation to a list instead of an atomic number to avoid creating multiple tuples for the same entity. As we saw, the schema of adding process is a task of figuring out how elements of the schema from two sources would relate to each other and determining how they would map the target schema. You also saw that this is not really a simple process, that we are trying to produce one integrated relation using a couple of relations from each source. In a Big Data situation, there are dozens of data sources, or more because the company's growing and each source may have a few hundred tables. So it becomes very hard to actually solve this correspondence-making problem completely and accurately just because the number of combinations one has to go through is really, really high. One practical way to tackle this problem is not to do a full-scale detail integration in the beginning but adopt what's called a pay-as-you-go model. The pay-as-you-go data management principle is simple. The system should provide some basic integration services at the outset and then evolve the schema mappings between the different sources on an as needed basis. So given a query, the system should generate a best effort or approximate answers from the data sources for a perfect schema mappings do not exist. When it discovers a large number of sophisticated queries or data mining tasks over certain sources, it will guide the users to make additional efforts to integrate these sources more precisely. Okay, so how does the first approximate schema mapping performed? One approach to do this is called Probabilistic Schema Mapping. We'll describe it in more detail next. In the previous step, we just decided to create the disk count candidates rrelation in an ad hoc way. And in a Big Data situation, we need to carefully determine what the integrated schema, also called mediated schemas, should be, and we should evaluate them properly. Since our toy company is trying to create a single customer view, it's natural to create an integrated table called customers. But how can we design this table? Here are some options. We can create the customer table to include individuals and corporations and then use a flag called customer type to distinguish between them. Now in the mediated schema then, the individuals first name, middle initial, last name, policyholder's name and corporation's name would all map to Customer_Name similarly. The individual's full address, the corporation's registered address, and the policyholder's address, plus city, plus state, plus zip would all map to customer address. Now we can enumerate all such choices of which attributes to group together and map for each single attribute in the target schema. But no matter how you'll do it, it will never be a perfect fit because not all these combinations would go well together. For example, should that date of birth be included in this table? Would it make sense for corporations? In Probabilistic Mediated Schema Design, we answer this question by associating probability values with each of these options. To compute these values, we need to quantify the relationships between attributes by figuring out which attributes should be grouped or clustered together? Now two pieces of information available in the source schemas can serve as evidence for attribute clustering. One, the parallel similarity of source attributes, and two, statistical properties of service attributes. The first piece of information indicates when two attributes are likely to be similar and is used for creating multiple mediated schemas. One can apply a collection of attribute matching modules to compute pairwise similarity. For example, individual names and policyholder names are possibly quite similar. Are individual names versus corporation names similar? Now, the similarity between two source attributes, EIN and EZ, measure how closely the two attributes represent the same real world concept. The second piece of information indicates when two attributes are likely to be different, and is used for assigning probabilities to each of the mediated schemas. For example, date of birth and corporation name possibly will never co-occur together. But for large schemas with large data volumes, one can estimate these measures by taking samples from the actual database to come up with reasonable similarity co-occurrence scores. To illustrate attribute regrouping, we take a significant re-simplified example. Here we want to create customer transactions as a mediated relation based upon the bank transactions and insurance transactions. Each attribute is given an abbreviation for simplicity. Below, you can see three possible mediated schemas. In the first one, the transaction begin and end times from bank transactions are grouped into the same cluster as transaction date time from the insurance transactions, because all of them are the same type. Similarly, transaction party, that means who is giving or receiving money, and transaction description are grouped together with transaction details. The second schema keeps all of them separate. And the third candidate schema groups some of them and not others. Now that we have multiple mediated schemas, which one should we choose? Now I'm presenting here a qualitative account of the method. The primary goal is to look for what we can call consistency. A source schema is consistent with a mediated schema if two different attributes of the source schema do not occur in one cluster. And in the example, Med3, that means related schema three, is more consistent with bank transactions because, unlike Med1, it keeps TBT, TET in two different clusters. Once this is done, we can count the number of consistent sources for each candidate mediated schema and then use this count to come up with a probability estimate. This estimate can then be used to choose the k best schemas. Should one ever choose more than one just best schema? Well, that's a hard question to answer in general. It is done when the top capability estimates are very close to each other.