[MUSIC] Welcome back. I wanna talk a little bit about data models as a run up to talking about databases over the next couple of segments. So, this is a data science course, so we know have data. So, the first question to ask is just where is this stored, how do we store data? And so one way of interpreting this question is just to talk about technology. So, what technology do we use? Well, we use magnetic media, and we might use solid state drives more recently. So, both of these have the property that they persist even when the power goes off, the data is safe even when the power is not On. So, nonvolatile storage. But another way of interpreting the question is, a little different, a little more of the logical way we store the organization of the data. And so you might ask, one way of interpreting this question is what is the data model we're using. So, it's not just bits on a disk or bits in a file. What do we do? Well, in your personal computer or maybe even your work computer, you might store data sort of hierarchically, or arrange in these kind of nested folders. That's one organization of data. So, the data model here is kind of tree-like. Another way is rows and columns. And this is what we'll talk a lot about in this course. So, in this case, it's an ASCII file, and these are hits from a biological database, matches in a biological database for a particular sequence. And of course, you might have spreadsheets that are a little funny. Maybe they look a little bit like rows and columns, or maybe they don't. So, you have here sort of an embedded table within a spreadsheet, and so on. So, you need to sort of, the idea is to think about what data model is being applied whenever you think about data. So, it could be a tree, it could be a table, it could be something a grid like this, unstructured, or it could be a graph and we'll talk a little about that in the future. So in general, what is a data model? There's gonna be three components that you should remember. One is that there's gonna be some notion of a structure. So, in the case of tables, it's rows and columns. There's gonna be some notion of constraints. What are the legal structures you're allowed to create? So for example, typically, if you think about a tabular data model, all the rows will have the exact same number of columns. There's a sort of a constraint. You might also have more constraints on the values themselves, such as this field must be, every value in this column is an integer. And you can even have other kinds of more semantic constraints, such as every value in this column must be within a certain range of numbers because it represents, say, days of the year. And then the third one is the operations. And so this is sometimes thought of as independent of these three, but I really like to call the data model all three of these things. The structures, the constraints to define valid structures, valid instantiations of these structures, and then the operations you can actually that these structures support. So, let's see some examples. So, your structures might be as we mentioned rows and columns, nodes and edges if it's a graph model, key-value pairs has been popular with the NoSQL movement. Just a sequence of bytes that might be the structures you have if you're just working with bare files. And the constraints, you might imagine are all rows have the same number of columns, as I mentioned, all values in one column must have the same type. For a hierarchical view, you might have a child cannot have two parents. So this would define, for example, one file cannot be in two folders at the same time in your data model of your file system. And then, the operations that are supported. Well, maybe you can look up the value given key x. And these key-value pair data models, that's one of the primary operations, is if I give you a key, you give me back the value. For a tabular data model, you might say, well, find me all the rows where a particular column has a particular value. In this case, a column last name is equal to the value Jordan. And with a file, there is not too many operations you can think about. It's essentially, get the next N bytes, moved to another position within the file. And then you can open and close the file. And that's not all the operations are supported. So, I think in any case you see data, especially on nonvolatile storage you can think about what operations are supported. What constraints are there over the structure, and so on. And these gives you an idea of the data model. So, what is a database? Well, one definition that I think is pretty adequate is this one, which is a collection of information organized to afford efficient retrieval. So, this is a very pretty general definition. Doesn't say anything about tables, doesn't say anything about relations, so when you think about database, don't necessarily assume relational databases. It's perfectly adequate to talk about a database that has nothing to do with relations. But it is just not a pile of data either, it's organized to afford efficient retrieval. So, another view of a database is, this idea of a schema. And so, Jim Gray and this Fourth Paradigm book that I've, that we talked about in the eScience segment a little bit ago has this quote, when people use the word database, fundamentally what they are saying is that the data should be self-describing and it should have a schema. That's really all the word database means. And so, this goes back to this notion of a data model. There's a structure there, and in fact, and some constraints, and some operations. [LAUGH] And all three of these are things you can intuit by looking at the data itself, it needs to be self-describing. So, if I have a file of data that's organized into rows and columns, somewhere I'm able to inspect which columns it has, and how many rows there are, and so on. I need to be able to understand how to read this data just by looking at the data itself. And so in a database for example, there will be a catalog, there will be be an explicit schema. [MUSIC]