So the NoSQL movement comes out of these second-generation, highly complex, pure cloud systems like Facebook, and then a whole bunch of startups that wanted to capitalize on this notion to operate at cloud scale. Everyone wanted to have a single application, which is not a bad thing. But the ACID vendors, who had been doing this for 15, 20 years, some from the 60s and on, they were like, wait a second, you know you upstarts think you know something about database. We actually did learn something in 20 years of database systems. And so they saw in 2013 and 2014 that if they sat down and just said, we're better, that they would lose market share, and there's no question about it. I mean, everybody thought NoSQL was the answer to everything in 2013 and 2014. But the problem is also in 2013 and 2014, those who were using BASE-style databases started to complain because every once in a while, they wanted to do a JOIN. And every once in a while, they would want to do something like a transaction, to do a new account or some billing or whatever. And so what happened was is the NoSQL vendors were hearing complaints. And a lot of vendors tried NoSQL and then ended up with sort of negative results or costly results. And so what happened was is you saw like a rush to ACID plus BASE, right? And if you go back to the database scaling, ACID vendors have had kind of a BASE-like variation, with master and read replica. And the technology didn't stand still either in that decade between 2009 and 2019. Amazon now can, I don't know how they do it, but they can sell you a 32-CPU system with a whole bunch of RAM and talk to slow disks really, really, really fast, and you don't have to own it. So if you want to vertically scale an application, you can. It's not just a carpet cluster anymore. Amazon has done amazing things to give you a wide range, a buffet, of wonderful hardware that you can buy. And the change from spinning disk drives to solid state disk drives is changing everything. I've already mentioned the scatter/gather as a notion, right? Like the Google scatter/gather. Disk drives can now do scatter/gather. And so you can, who knows, by the time you're watching it may be more than 32. But some disk drives, you can say, I want this block and this block and this block and this block and this block, bundle that up into effectively in one message, fire it at the disk drive, and the disk drive is like boop boop boop boop boop, as fast as it can give you those things back, you get all those things back. Now think about a database that has blocks and all the stuff we've talked about how databases work. You can ask for 32 blocks. If you have a little bit of in-memory information and you say give me these 32 blocks, it's as if you asked for one block. It's just like holy mackerel. So the ability to scale relational ACID-based databases has greatly improved in that decade. And so it seemed like every time like a Cassandra or a Mongo would be better, you'd just like, well, I just bought new hardware, and my old-fashioned Postgres is pretty fast. And so they also had these ACID-style database vendors reacting. They're like, this is kind of easy. We could add this to our stuff pretty easily. And so Oracle added JSON columns, MySQL added JSON columns. Oracle has some kind of a NoSQL database, I don't know as much about Oracle that I don't want to know. But MySQL 8 really is the addition of really cool and flexible JSON columns. I told you why I'm not using MySQL, because I fear that Oracle will bend away from open source. Postgres is pure open source. Postgres has slowly over the years, 2008 through 2014, have been adding. And now we're lucky here that their JSONB, which is by far their most sophisticated sort of aggregate column, came out in 2014. So it's at least five years old. And if we were teaching this course in 2014, we might like say, well, you might not want to use JSONB. But, I mean, today let's use JSONB. And just as an example of this is Amazon has built a data backend, it's just a data lake, basically, called Redshift. And it's like based on Postgres. It's not based on Cassandra. Cassandra, remember, came from Facebook. But Amazon's like, I'm going to build a data lake, and I'm going to grab Postgres. And another thing that's a little less well known is that a little while back, Amazon got rid of all of its Oracle, presumably replacing it with Postgres. Which means that Amazon's own infrastructure, its own billing and its own tracking, uses in effect 100% open source technology. And I think that greatly benefits we in Postgres. Now, Amazon Redshift is based on 8.0, which is before HSTORE, JSON, and JSONB. But I'm going to guess that what they really did was they advanced some of those features separately from the open source community, which more slowly put them in in a way that we can use. To some degree that, you know, and I'm sure Amazon Redshift is a very efficient master-slave replica kind of a thing, because it's really, the Redshift is aimed at kind of a pulling in lots of data simultaneously and a lot of simultaneous readers. And so they probably just tweaked everything. But you still kind of pretend it's SQL. So it turns out that it has been easier for the old-line ACID vendors to add BASE features than it was for the new BASE folks that started distributed to come up with kind of this non-distributed way. And part of it had to do with the master-slave replica model. So the other thing that's happening is it used be that the reason NoSQL doesn't have so much meaning anymore is SQL does not imply ACID. The concept of begin transaction, select for update, etc., yes, that is ACID. To implement that correctly, you have to have ACID. But just the select and the insert and the update, that's neither ACID nor BASE. That's just a syntax. And so the folks got, there was a feeling like SQL was going to die, and that's why they called it NoSQL. And then after they realized SQL was not going to die and that SQL is probably better for 90 to 95% of the applications, they kind of realized, well, let's just kind of join the SQL movement and have a different set of semantics that underlie it in the BASE. So you can either have SQL with ACID semantics or SQL with BASE semantics. And so the runtime is what changes, right? And so you're seeing a trend towards, yeah, here's this completely distributed database, but we got SQL. But don't do a begin transaction, because that won't work very well. because it's really distributed underneath, right? It's shared and distributed underneath. But it makes it so developers can move back and forth. And so I think most applications really these days should start out as ACID-based applications, using something like Postgres, and then they could move to something. But why not, so everyone's going to learn SQL, which was not a foregone conclusion in 2012, that we might, SQL might become like a dinosaur and go away. But it isn't going to be a dinosaur. So now you might as well take these BASE-style database vendors. And for them to support basic SQL is not all that hard, right? They're supporting a subset, and they're clear on what that subset is. But it makes it easier for developers to go back and forth between these things. And so you can kind of imagine that, you know, you've got some hypothetical thing that it might be like Amazon Redshift. I literally don't know anything about this. But you've got some really big vertically scaled ACID master with a transaction log. And you route the SQL transactions that are going to make changes, SQL statements that are going to make changes to that master. And you just use old-school techniques. If you want to do a multi-master, you want to hide that. And then have a bunch of read replicas. And then you could even have a BASE database living right next door that if you're doing just, if you're saying this is a table that doesn't need any ACIDness, and we'll talk about how you might indicate that a table doesn't need ACID capabilities, you might have a completely separate implementation under the covers for inserts and updates into certain tables where you're saying, I'm willing to tolerate eventual consistency, right? And so you can do this all at the SQL level, where you really have, in effect, three databases in the back, and they're talking to one another. And you just create a table, and sometimes it's stored in a ACID database, and with ACID semantics, and sometimes tables are in a store that has BASE semantics. So here's some examples from a really cool blog post that I recommend that you take a look at. And that's how to think in a BASE-like way when using an ACID database. And this is where I'm going to tell you all the rules that I told you before, like not true anymore. So don't normalize. Replicate, right? Don't use SERIAL. It breaks my heart to even tell you these things. Don't use SERIAL. Don't use auto-increment key, right? Because that's a place of commonality that across 10,000 systems, there's no way you can do that. But that's okay, use GUIDs. You randomly pick them in each server, and they're guaranteed to be different, right? Because they include time stamp and a bunch of other random things. So it's like the likelihood of GUID collision is so low that you just basically create the primary key for your new documents that you're putting in. Fewer columns, not more columns. You have one column called JSONB, which is, you know, a bunch of key-value pairs, which is great. The only, you just like have like a key, an ID, which is a GUID, and then a body, which is JSON. And you'll see some of the stuff I have you do is just like id, JSON. Two things. Now, that doesn't mean you can't have more columns, right? Some columns that are extracts of the JSON. The only time you make a column is for indexing, right? And so now you it's okay to go into Postgres and have exactly two columns, the GUID and the document, or the JSONB document, right? Don't use foreign keys. Or if you do, don't mark them as such, right? Don't, and so you can say this is an integer or this is a GUID that points to. So I'm in like a post, and I want to point to its owner. I could have a column called owner, which is a GUID, not an ID, but don't mark it as a foreign key in the CREATE TABLE. Which means then it doesn't feel it has any need to kind of synchronize across that. It's just a string within the context of this table. And so you have to make sure there's no ON DELETE CASCADE, and there's no ON UPDATE CASCADE, none of that. You don't do that. That's beautifully magic, but it requires ACID for it to work. And so you just don't tell it to do that. You just have some other like process that once you've deleted this thing over here, that like at midnight, it goes and checks to see, it points to a thing that doesn't exist. So can we delete that? And then you delete it, right? So you have maintenance tasks rather than ACID semantics, right? And design your indexes and queries so that you're getting one and exactly one row, right? So it's like document store. There's a lot in that row. You get it all, and then you work with it. And you might say, you might get the row, and then if there's kind of this kind of a pseudo foreign key, you might have to go get that too. And you might do two transactions. But you don't necessarily use a JOIN to do that. You just do two SELECT statements. And that starts to be kind of the semantics of BASE at that point. And you're relaxing what you're demanding of the database. So when you make a SERIAL column, you're demanding something of the database. You're making a contract with the database that the database has got to comply with. But if you say, no, it's just a GUID, and I'll tell you what it is in my application, I'll make it up in my application and tell you, you take that responsibility away. So the database doesn't have to sort of put that lock up, to put that like barrier that says, Oh, sorry. I've got to wait for everybody else. Hang on, everybody. You've got to wait until this one gets through, right? It's kind of like going down from many lanes to one lane. Well, the SERIAL is going from many lanes to one lane, but the GUID is just many lanes going straight through, and so you just don't have to. So you see? And so the the blog post that's so cool is, we used NoSQL, and then we stopped. We used NoSQL, and then we used BASE-style thinking inside of Postgres, and it turned out to be faster and better. And the fact that Postgres handles really large data sets and has for so long is way better than some of these upstart things that just haven't seen scale. And the problem is now, more and more folks are just using something like Postgres as their NoSQL database. They're careful as they design things, but then they release the constraints. And now what's happening is fewer and fewer folks. And like Amazon Redshift goes to, is Postgres, which just blows my mind in a really good way. So things like software migration, don't use ALTER TABLE. Because ALTER TABLE is itself trying to be transactional. So it's like, lock, lock, lock, lock, lock, lock, lock. No, no, no. Just write a thing that loops through and reads all these things and checks to see if the other thing is there. It might take way longer, and it might be loops instead of SQL statements, but then you're not triggering kind of the ACID nature when you don't need it. Query for record by primary key or index column. Don't read the whole thing, because it's a document store. It's gigantic. Don't use, and again, because you're using the index and the indexes are small, like GUID or something, you're going [SOUND] in like three disk hits or something, and you've got it. Don't use JOINs, even if you have to manually retrieve the other documents. And this is what NoSQL does, right? I mean, you're matching the good and the bad of NoSQL. You've got parallel updates, parallel reads. You've got to read more stuff sometimes. You can't just send a beautiful SQL query. And don't use aggregations. Although I am not sure, I think that, I think I would say this, but I don't have any experience to back me up. I would say don't use aggregations other than COUNT, because, don't use aggregations that. COUNT can be done. If I was building it, I could build COUNT using an index-only scan, versus like an average, which actually has to retrieve the data. So somehow, my instinct, with no backing, my instinct says COUNT aggregations are different than a max, min, or average aggregation. But someone smarter than me will have to tell you that. So NoSQL is fine, right? NoSQL is doing great. They're realizing that they're more specialized. And there are times when that fire hose of updates is essential. There is less conversation about the end of the SQL language, and more conversation about BASE-style databases adopting it. There is far less, I would almost say no breathless, like, oh, NoSQL is great, great, great, at least people over 30 don't care about it. There is a learning curve. Any time you pick a NoSQL technology, understand that you're going to have to learn. And if you don't know what you're doing and you're starting a company, you're so much better off starting out with a relational database and then adding NoSQL bits to it, rather than just saying I'm NoSQL because I don't feel like learning how to model data. That's the thing that ticks me off the most, is like, if you don't know how to model data and you're going to choose NoSQL because you don't want to learn how to model data. What I want you to do is I want you to learn how to model data, and then say, this is great, except. And then know that not modeling data when it's the right thing to do. Not just because it's lazy and you don't want to learn it. Because modeling data is so beautiful, elegant, and simple. Sorry. SAAS vendors like Amazon and Google and Azure are really changing the game, because these are multi-tenant natural. The cost can be really much lower than doing your own stuff. And so the idea of hosting your own NoSQL database is, hosting your own database, I'm sorry, your own ACID database, is a normal thing. You know how to manage it, we've done that for decades now. But like figuring out how to make it work, ah, heck, just let Amazon or Google or Microsoft figure that out. Go ahead and Google like move from Mongo to Postgres or move from Mongo to MySQL and you will find that lots of mid-tier pure cloud applications are leaving. Are leaving. They tried it, they built it, they scaled it, and they said, I'm sorry, this just doesn't scale as well as something like Postgres when you don't demand ACID semantics. When you create tables and you use those tables in a BASE-style way, and then you do things like read replicas. So I call your attention to why it is that I picked Postgres for this class. And that is, Postgres is one of the better NoSQL databases, right? And so it, I couldn't have said that in 2012. But now in 2020 and beyond, Postgres is a good choice for NoSQL applications. That doesn't mean it's the choice for all NoSQL applications, or document or key-value store. So Postgres is great. You've learned a lot already about JSON and Postgres and I hope that helps as you go forward making good choices. Cheers. PRODTE9
310
00:17:40,998 --> 00:17:48,498
[MUSIC]