Our third and final case is applicable to most companies that create customer-focused products. They want to understand how their customers are responding to the products, how the product marketing efforts are performing, what kind of problems customers are encountering, and what new features or feature improvements the customers are seeking, and so forth. But how does the company get this information? What kind of data sources would carry this information? The figure show some of these sources. They are in focused user surveys, emails sent by the customers, in blogs and product review forums, specialized groups on social media and user forums. In short, they are on the Internet or in material received through the Internet. Now, how many sources are there? Two. The number would vary. A new sites, a new postings, and new discussion threads would come up all the time. In all of these, the goal is to identify information that truly relates to the companies product, its features and its utility. To cast this as a type of big data problem, we look at a task that computer scientists called Data Fusion. Consider a set of data sources, S, as we mentioned on the last slide and a set of data items, D. A data item represents a particular aspect of a real world entity which in our case is a product of the company. For each data item, a source can, but not necessarily will, provide a value. For example, the usability of an ergonomically split keyboard can have a value good. The value can be atomic, like good, or a set, or a list or sometimes embedded in the string. For example, the cursor sometimes freezes when using the touchpad, is a string which has a value about the touchpad. The goal of Data Fusion is to find the values of Data Items from a source. In many cases, the system would find a unique true value of an item. For example, the launch data of a product in Europe should be the same true value regardless of the data source one looks at. In other cases, we could find a value distribution of an item. For example, the usability of our keyboard may have a value distribution. That's with Data Fusion, we should be able to collect the values of real world items from a subset of data sources. It is a subset because not all Data Sources will have relevant information about the Data Item. There are some other versions of what a Data Fusion is but for our purposes we'll stick with this general description. Now one obvious problem with the Internet is that there are too many data sources at any time, these lead to many difficulties. First, it is to be understood that with too many data sources there will be many values for the same item. Often these will differ and sometimes they will conflict. A standard technique in this case is to use a voting mechanism. However, even a voting mechanism can be complex due to problems with the data source. One of the problems is to estimate the trustworthiness of the source. For each data source, we need to evaluate whether it's reporting some basic or known facts correctly. If a source mentions details about a rainbow colored iPhone, which does not exist, it's trustworthiness reduces because of the falsity of the provided value of this data item. Accordingly, a higher vote count can be assigned to a more trustworthy source. And then, this can be used in voting. The second aspect is Copy Detection. Detecting weather once was has copied information from another can be very important for detail fusion task in customer analytics. If a source has copied information, it's such that discounted vote count can be assigned to a copy value and voting that means the copy in source will have less weight. Now this is especially relevant when we compute value distributions, because if we treat copies as genuine information, we will statistically bias the distribution. Now here is active research on how to detect copies, how to determine bias and then arrive at a statistically sound estimation of value distribution. But to our knowledge, these methods are yet to be applied to existing software for big data integration. It should be very clear by now but there are two kinds of big data situations when it comes to information. The first two uses cases that we saw requires an integration system to consider all sources because the application demand so. In contrast, problems where data comes from too many redundant, potentially unreliable sources like the Internet, the best results can be obtained if we have a way of evaluating the worthiness of sources before information integration. But this problem is called Source Selection. The picture on the right shows the result of a cost benefit analysis for data fusion. The x-axis indicates the number of sources used, and the y-axis measures the proportion of true results that were returned. We can clearly see that the plot peaks around six-to-eight sources, and that the efficiency falls as more sources are added. In a cost benefit analysis, the cost must include both the human and the computational costs, while the benefit is a function of the accuracy of the fusion result. The technique for solving this problem comes from economics. Assuming that cost and benefits are measure in the same unit, for example, dollars. They proposed to continue selecting sources until the marginal benefit is less than the marginal cost. Now recent techniques were performing this computation at quite scalable. In one setting, selecting the most beneficial sources from a total of one million sources took less than one hour. This completes our coverage of the big data integration problems.