In the previous lesson, we started mapping the existing processes that define how a particular problem is solved in a company or organization. In this lesson, we will pay attention to data. Data is the new oil, says the weekly magazine, The Economist on its cover from May 2017. Data should be treated as a commodity that an organization seeks out, minds, transforms and uses to generate value for itself and for its customers. While the phrase is catchy and reflects the perception that data is the world's most valuable resource, it has its pitfalls. For one, data unlike oil is an endless resource that can be easily shared and access by all in a democratic way. This raises the question of data ownership and privacy. Is the data about myself mine or does it belong to the company who tracks my every move to collect it. In contrast to data surveillance and the datafication of private life, there is a counter movement in the world of governments and researchers, but less so in the corporate universe. The drive towers open data. Another issue with the phrase data is the new oil is that it promotes a data centric worldview. However, not all solutions are best solved by data analysis alone. The best solution is not always to run 10,000 variations of an A-B test on the best design of a web button. It is probably cheaper and faster to hire the services of a graphic UX designer. As when we discussed model driven AI systems, human expertise and creativity are essential to our understanding of the world and to our problem solving skills. Data is valuable but not intrinsically. Its value comes from the way we use it to help attain the goals of the organization. In other words, data needs to have a purpose. Understanding the possibilities and the limits of data, we can now move to the second question that helps us evaluate the AI needs of a company or organization. This question is what data is available about the problem? What historical data do we have access to and in what condition is it to help us better understand the problem? If we go back to the digitalization process, the first step on that path was digitization. The move from analog to digital information. For any AI system to work, the availability of data is crucial. This is why digitalization is a precondition for the implementation of any digital technology. While digitization has become a rather straightforward process, one question remains. Which data must be tracked and how much data must be retained for the AI system to work properly. The answer to this question depends on the problem that needs to be solved and on the characteristics of your organization. If in the previous lesson, you identify processes in your organization that are complex and with fuzzy rules, then the system of choice is a data driven AI. In this case, data is essential as it is the primary ingredient for success, no matter if problem solving involves classifying documents, detecting patterns of behaviors for employees or for customers or finding abnormal activities that signal security breaches. In general, the principle to follow is to keep only the data that is directly relevant for the process you are trying to transform. Collecting everything just in case or for the long term is more wasteful than productive. Data collection within the organization should also be matched by proper mechanisms of data management and data governance. Secure data storage may be expensive especially as data accumulates. Data may also be the target of external attacks. A weak point in the security of the organization of the company or organization. Keeping this data safe and secure from both attacks and deterioration add another cost to the data management. Moreover, issues of data privacy and integrity must also be considered. In the case of model driven AI data quantity is less important but its quality than more so. For the model to work, it needs clean and structured data. The rule based model produces outcomes that modify the original data. So the organization must be prepared to put in place a dynamic data storage where the information is continuously updated. This also carries a certain economic cost and requires data governance just as much as the data driven approach. We will talk more in the third week of the course about the risks associated with data quality and data access. For now let us go back to the question of making an inventory of the data already available in your organization. In order to locate this data and to evaluate its usefulness for AI implementation, it is helpful to categorize the data according to its format. In this, I follow the star grading proposed by Tim Berner-Lee from the Open Data deployment. Even though the focus isn't on internal company data and not on public data, the sources and types of data are quite the same. One star. One star data is available in unstructured form. Typically this data is stored in pdf document files such as reports, handbooks, guidelines, etc. While this data is very rich covering the history of the organization's activity, it is also hard to access. The data is locked up in a document and you need special software to extract it. Two star. Data is available in a structured format, but still in a document. This data tends to be stored in tables such as Excel files. And thus can easily be machine read. However, the data is still locked up in a document and still needs to be transformed in another format to be useful as AI input. To make sense of the data, one needs to have access to the organizing principles of the document. Three star data. Data available in a well understood structured format. This type of data comes in a format that computers can easily understand, such as comma separated values or database like SQL. The organizing principles are still external to the format and need to be accessed and understood separately. Four star. Four star data is available via documented API. This is structured data available via a well documented and well maintained application programming interface API. This means that a team of information curators are available to provide information about the principles structuring the data and that the information is updated and secure. The best quality data is the five star data, which is data linked across sets to provide context. This data is available via API, but also is interlinked to other datasets within the organization. These linkages allow for a better contexthalization of the data. Knowing where to look for data and being able to identify the data quality of each format is helpful in charting the data availability within the company. When deciding where to start with the AI implementation, consider beginning with areas covered by the best data. Then the algorithms can really show off their power and performance and service encouragement and proof for the benefits of implementing AI. Now take a moment and consider your specific context. Which type of data can you easily identify in the processes you run? In which format does it come? Do you already have a five star quality data in a specific arena of your work? [MUSIC] Measuring your data resources and categorizing their quality and suitability for machine processing serves two primary purposes. It helps you decide where to start your AI implementation and it tells you in which areas of activity you need to get supplementary data by further digitization, better recordkeeping or even by purchasing external data. As we have seen, the 5-star data is linked. For AI systems to work at their best performance, they should be integrated into the larger context they operated. In the next lesson, we will try the processes that take place in an organization in parallel with AI implementation and decide if they are helping or hindering AI deployment. [MUSIC]