Hi, my name is Sam Meiselman, I'd like to welcome you to this section of the course. In this, we will talk about databases and data types that contribute to your understanding of health science informatics and the uses of data. My personal background is that of a data engineer and database administrator here at Johns Hopkins, and I've also been teaching database techniques to medical students here at the School of Medicine for the past eight years. I hope you enjoy this section of the course. So, in this introduction to databases, we'll explore a context that you might not have thought of. Many of us have encountered in one way or another database, but we're going to look at it both from a historical perspective and in informatics perspective, the importance of databases in the informatics and software stack. So, let's start off with the informatics definition of a database. The first and obvious but still needs to be discussed point is a database is a regular collection of data. Data often doesn't abide by the rules that we like to set out for it. For example, many people get married or they change and we'll explore some of it in a minute. But there has to be a regular collection of that data in a consistent way, because if data is not consistent, your conclusions that you draw from that data are fraught with error and mistakes. A database is a computerized record-keeping system, as opposed to a manual record-keeping system. Manual systems, some of which got extremely complex, are sometimes even more efficient at the intake of the information. However, they fail miserably when it comes to any type of analytics or conclusions to draw from it. A database is a collection of stored operational data and relationships used by application systems of an enterprise. What does that mean? All this data that is collected has to be in a place in a platform if you will, where people can get at it. So, an application system of an enterprise of a hospital, of an insurance company, of all of these things that might use clinical data, need to be able to access that central store of data even if they were not the system that collected the data. The relationships are stored also in a consistent manner. For example in our case, patients and visits. A patient identifier must be consistent across all the visits to properly link the visits to that patient for example. So, that collection of data and relationships must be consistent and stored in a central location. Next database is, it's a collection of data that we're talking about should be persistent, it should be their day-to-day, it shouldn't be a hit or miss chance of locating and interpreting your data, should be logically coherent in that there shouldn't be a hidden or implicit rules of trying to understand the database only through tradition or something, but rather it should be very clear to understand the patient is a patient, a visit is a visit, a lab result is a lab result, things like that. It should be inherently meaningful as well. It should not be difficult to interpret what the entities are in the database. Sometimes there are situations, I can think of one example where there's a database tracking student activity and the entity wasn't a student, it was a student instance of that particular student which is a very abstract concept, it's very hard to work with those kinds of entities. So, ultimately in database design which is beyond the scope of this discussion, inherently meaningful data and relationships are the goal of a database. Last, it should be relevant to some aspects of the real world, tracking and storing trivial information or information that may or may not be relevant has decreasing value. Of course, as storage and databases get bigger and bigger there's much more trivial information that is tracked, and because someday it might be useful but what this relevance is for the people consuming the data should understand how to draw conclusions from that data. I'm about to show you some slides in the development of the history in database technology, and just one caveat because as I prepared these slides and lectured about them before, I realize I'm not doing all that history a thorough service, I'm only clarifying for practical context to give over some context of how and why this technology evolved, it's not meant to be exhaustive. There are many good sources for more exhaustive history of computer science that I would direct you to. Why databases? So, what was the historical and essentially need, business need, enterprise need, that led to the creation of databases. Well, as we mentioned the informatics discussion, regular collection of data in computerized record-keeping evolved because paper just can't do it, paper systems ultimately fail. In this picture from the 1940 US Census, you see a number of clerks hard at work filling out file cards. These file cards were tabulated by various machines, and they could do some very very simple counter aggregation analytics. But essentially that was it and you had to hire armies of people to tabulate this data, and it was actually quite a difficult task. It was such a demand to improve this analytics capability that people paid in today's dollars astronomical sums for even simple computations through the evolution of the first computing machines. So, at the beginning of these computing machines evolving really with IBM, International Business Machines, counting machines and through the 1920s, '30s, and into the '40s, they were far more of a computing device not a storage device. They were not, the technology had not yet evolved that data could be stored on the device. So, these type of index cards evolved where a key punch or index card punch machines would indicate data, various types of data, on a punch card like this. When I was a kid in the 1970s, I was able to get a tour of some mainframe computers where they were still using large amounts of these cards. I remember stacks and stacks and stacks of these cards going into card readers and what the purpose of that was, was that this computing device since it had very little storage, the entire computer program would have to be loaded first with instructions on the cards, very, very basic instructions, and then, the data set would be loaded next. So, essentially, in today's terminology if you would think about a laptop computer, for example, the entire operating system would have to be installed, the entire software package have to be installed, the entire data that had been computed would be installed, and then the entire thing will be wiped out every single time you would use this computer. This is highly inefficient by obviously today's standards, but that was the best things the technology could bear at that time. So, data in the 1940s through the mid '60s was really subordinate to the computation and application. So, each application, for example, there was an application during World War II that was trying to calculate complex trajectories of artillery for the war effort. So, that had nothing to do with a another computer program that might calculate economic data. They had nothing to do, they stored notes of same data, they sort no context. The entire software had to be loaded every single time, and there was no way to relate data to one application or another, everything was completely siloed. So, it quickly evolved really in the '50s and early '60s, the need to separate storage from the user application, the actual computation. Now, we take this for granted, but allowing common storage for multiple applications was a tremendous leap forward in technology. For example, there was an ENIAC machine in the 1940s that before this leap was made of separating data and storage, ENIAC had no system to store as memory. These punch cards could be used for external memory storage. So, if you would say, where is your data, you would hold up a box of these cards. Input was only possible through this card reader, and IBM card punch was used for output. So, there was very, very, very primitive data in and out. This ENIAC in today's standards was a hundreds of millions of dollars, but still the computations that rendered were helpful at the time. By the mid '60s and '70s, the technology had evolved as illustrated by the IBM 360 Mainframe, where they were able to separate data storage from the application, but in a very, very limited way. The data storage at that time with this massive machine was a whopping seven megabytes, which we, in our comfortable, modern world would giggle because seven megabytes isn't even a song download today, but at that time that was a tremendous advance. So, as data and leading computer science mind started thinking about the difference between data and applications, a leading paper was written by a man Edgar Codd or E.F. Codd. In 1970, he published a groundbreaking paper called the Relational Model for Databases, and this led to the concept that was based on set theory, mathematical set theory, and that abstraction really set the architectural framework for modern database development. He matured this model, and by 1985, had published the 12 Rules for Relational Databases. His set theory allowed the concept of data being stored much like we had talked about in a coherent, persistent manner, but to be able to join different data sets together and be able to access them as relational entities, and then that gave birth to many modern technologies that we have have used frequently, whether we're aware of it or not. The first main implementation of EF Codd's concepts were done in 1974 by D. Chamberlain and R. Boyce, they created something called the Structured English Query Language. At the first, that was called S-E-Q-U-E-L, SEQUEL, but then it got shortened to SQL or Structured Query Language. SQL is still the predominant way databases function and we'll cover a bit of that in the following slides. By 1986 and 1987, SQL had become so centralized that both ANSI and ISO published standards on SQL. So, what were some of the different implementations of SQL, this Structured Query Language? Well, originally in the mainframe world, the IBM platform led to DB2 and System R, and this was a database that could be shared across several different applications. As a PC computers took off in the early 1980s, a file-based relational model came up in various what was called xBase or dBase or FoxPro, Clipper these were programs were very popular in the 1980s and allowed a rapid deployment of database techniques and technology to even much simpler end users for a tremendous cost savings over centralized mainframes. Towards the mid and end of the 1980s, a Client Server technology evolved which is still for a large part, the predominant technology today. So, these led to the initial SQL champion of Oracle, Larry Ellison dealt Oracle in the late seventies. Then there was a company called Sybase, that merged with Microsoft, merged its code with it or there were a variety of different events that I won't cover right now. Microsoft launched SQL Server in the late 80's and has become a solid competitor with Oracle. More recently, say in the 90's and early 2000's, so open-source MySQL is a implementation of SQL that many websites use today. The latest or the subsequent technology platform is cloud-based. Many people say we live in a post relational world and there are many databases like MongoDB and Amazon services and other Azure, Microsoft Azure, there are many cloud-based services. There are many other options now for data storage than a purely relational model, but for clinical informatics and clinical data, the relational model is still quite predominant. There are research clouds and research based implementations of broader mix of cloud and relational data, but most hospital operations are still based on relational models, that is of this video. Perhaps if you see this in a few years it will be different. So, let's take a more detailed look on what are some of the components for example of a SQL server of a typical relational server setup. These technology components will have a big part to play in any informatics planning project or implementation. So, a server anatomy you would see these basic components. Server anatomy really means the layers of software that are installed on a SQL server. First and foremost there's the operating system. Speaking for example about Microsoft, we would say that an example would be a Windows Server Operating System, but there are many other operating systems that support databases, Linux and Unix and other platforms. The SQL Server itself is a type of software that is installed on top of the operating system and it essentially receives requests for data, structures data, and returns data back to whoever requested it. Within that breakdown of that SQL Server software, it organizes data into what it calls a database. So, each database is a collection of objects primarily something called tables, and other code and relationships as we said in the informatics discussion that database stores data and its relationships. This relationships are enforced through referential integrity type of code and also implicit rules that are built into the database. Next, there's a layer of security data as it grows and gets more complex, has more complex security rules. The security rules are not just enforced at the operating system, but they're also enforced within the database. There are logins and different database users and roles, and there's very complex security that can be implemented in a database. The client is the person or process that makes a request to the server, again it's client server in this platform. So, a client might be a webpage or might be an Excel or a Google Sheets or something like that, that goes and connects to a database and makes a request to that database. Most people have encountered this frequently. If you go to a airline website, when you type in, I'd like to see flights to Florida in this particular timeframe, this is just a theoretical discussion I'm sure no one has ever done that. That is actually making a request to a centralized database and returning back data from a database, and then formatting it in a web page. So, that would be an example of a client. So, clients can use a variety of common protocols that are systemwide, OLEDB or JDBC. There are other types of connection protocols and they are used again by these applications, either SQL Server has a client, there's a SQL client that you can write SQL code in called SQL Server Management Studio, or there are a number of other coding platforms that you can write your own SQL requests, websites and services as we've just discussed, and then for example Excel or Google Sheets something like that. These can all make requests to the database. So, in summary, we've touched on the context of database evolution in historical perspective. Databases grew out of the need to centralize and share data among applications and being able to draw analytics out of them, to answer questions, use the data. These points will help you moving forward as we go into more detail about data types and how data is stored. Databases are persistent, they remain there, they are logically coherent, they're understandable like in inherently meaningful data relevant to some aspect of the real world. Relational database were design in terms of relating sets of data together, patients visits, lab results et cetera. We talked a little bit specifically focusing on client-server topology, and will be basically for most of our discussions of the most common platform relational databases that you'll encounter in clinical informatics. Again SQL, Structured Query Language, while this course will not go into actual details and teach you about how to write SQL, will borrow discussions and concepts about it. Thank you.