Hello learners. And welcome back to another installment in our databases for data scientists specialization here on Coursera. This is lesson one of Module four in our third course. So welcome you know so far in our database courses, we've learned a lot about relational databases. We learned how to design them and we learned how to query them using the structured query language. It's true that relational databases even after more than 40 years, the relational database system still retain a very large portion of the market share out there in industry for database software. Maybe 60-70% of all the databases out there are still relational. So it's not something we can ignore. The relational database software has been around for a long time because it's really good. It works well and in some cases it's very affordable. However there's a problem. And you've recently learned about the explosion of big data. And with that big data explosion we can see that relational database systems really struggle to handle the volume, velocity and variety of big data. And so there's a problem. And really that kind of states the problem right there, that relational databases really struggle to handle big data. When I try to use my relational database system to handle huge amounts of unstructured data, I'm going to run into a lot of questions. Like how am I going to collect and process all this data? How can I get that processing done quickly? Because there's so much data. How can I analyze this data that's coming in so fast? Perhaps I need better tools better software. If I've already invested thousands, even millions of dollars for my organization in building out relational based systems. I want to be able to leverage the investment that I've already made. So I would like to try to handle big data using these systems that I have already invested in. And I'd like to be able to handle big data using the existing staff that I've got. I have perhaps spent years training my people to be very effective using relational database systems. So how can I justify the cost of retraining or replacing my staff? Not only that, but if I switch away from relational database systems and start using something new, like no sequel systems, I'm going to have to rewrite a lot of software. Because the application software was written to use relational database systems. If I replaced that with a no sequel database system, I've got to rewrite that application software, which could be very expensive. The problem can be understood this way the relational database software demands structure. I have to put my data into tables with rows and columns. I've gotta define primary keys. I've got to normalize my data and put it in third normal form. Gotta create indexes. And in my relational database software, there's a lot of attention paid to being able to provide acid transaction compliance to keep my data consistent across transactions. There's a bit of overhead introduced into my database processing when I'm trying to maintain asset transaction compliance. It takes processing to keep my data consistent and that extra processing to ensure acid transaction compliance slows down my throughput. So these are some of the challenges that I face if I'm going to try to keep using my relational database software with big data. What if my big data is really unstructured? How can I possibly create a table design that's going to handle unstructured data? That is a big problem. So much of big data is unstructured and it's a big challenge for us. And what if I'm involved in a business activity that requires very rapid response time. And I need to maximize the throughput of my database processing because my customers demand speed over data consistency. So some of the things [LAUGH] that we have to think about as we're looking at this relational problem. And how do we solve this problem. So it comes down to this question. Do I keep using my relational database systems or do I trans transition over to a newer no sequel database system that is specifically designed to handle big data? It's a tough call. And so it's worth it for us to take a little time to dig into this and understand those kind of decisions that database professionals are faced with. So one approach that I have to solve this relational problem is, I can decide to keep using my relational database systems. So I don't have to hire new people. I don't have to rewrite all my application software. But I can enhance my relational systems so that they can better handle the demands of big data. One of the ways I can do this is by getting more processing power in my database servers. And we call that scaling, I want to be able to expand my database server capacities. So I can scale my database servers in two different directions. I can scale them up or I can scale them out. Scaling a server up means I add more CPUs to the server. I may add more memory to the server, I can add disk storage to my server. And in that way I can empower that server to handle more data and handle it faster. However, scaling a database server up is costly. Because I gotta buy a pretty costly server to begin with so that it has expansion slots in the motherboard. So I can add CPUs and add memories. So those servers are going to cost more to start with. And then when I do buy more CPUs or buy more memory, they're expensive. So scaling up a servers to make it handle more data more quickly can be very costly. And so a lot of folks have opted for the horizontal scaling approach which is scaling out not up, right? And how do I do that? I add server nodes to a cluster. So my database server becomes not a single server, but a series of servers that are connected together and work together as one. And that's what clustering is all about. If I'm doing clustering, I can add nodes to my database cluster using pretty cheap commodity hardware. Now, commodity just means something that I buy and I don't really care about the brand. I just want to get the lowest possible cost. So, if I do choose the horizontal scaling approach and scale out instead of up those nodes in the cluster have to all be able to talk to each other. And so that introduces a little bit of inter node communication overhead. But that's part of the cost of scaling out. It's just that I need to understand that now multiple server nodes in the cluster have to talk to each other. It's just something to be considered. So here, on slide number seven, I've got a picture of what a database server cluster might look like. So let's take a look at this. So I've got a server, node one, node two, node three, node four. Each server has got a copy of an operating system. The database software and it's got a buffer cache where all the disks, data is moved into the server's memory in the buffer cache and written from their back out to the disk. In a scenario like this where I've created a horizontally scaled four Node Server Cluster. These four nodes work together as one. And I can then quadruple my database processing capacity by spreading the workout over four nodes instead of having it all take place on one database server. There is a term in this business called linear scalability. Linear scalability just means as I add more notes to my cluster, does it reduce the amount of time it takes for me to do the processing that I have to do? For example, if I double the number of nodes in my cluster, will the work done get done in half the time. Something to think about. As we talk about scaling out horizontally and building clusters to handle big data. We look to google because they, more than any other organization have really been the pioneers and understanding the ins and outs of clustering in order to handle huge amounts of data. And I have a couple of slides here where I'm recommending you to go out and watch some YouTube videos. The links are built into these slides and there's a copy of the slides as a pdf file out there for your course. So you could open up that pdf file and find these links and go out and watch these YouTube videos about google. It's very, very enlightening to understand the challenges that google faced. And how they solve them and from what google has done, we've all learned a lot about successful server clustering. So this first video from 2009 talks about how google initially started building out server clusters with thousands of nodes in shipping containers. It's very fascinating so that each shipping container was one data center filled with thousands of server nodes working together as a cluster. And they faced a lot of problems trying to figure out how to make this work. Like managing heat, you know, as electricity travels through a CPU, it generates a lot of heat. That's why inside of a computer there's a little exhaust fan pulling air across the CPU to try to keep it cool. Well multiply that times thousands of nodes in a cluster and you've got a lot of heat that needs to be collected and removed. Google had to face questions about server maintenance. So if I've got a cluster of many many nodes and one of those nodes goes down, it crashes for some reason or another. What do we do? Do we send a repair person out to go and pull that server out of [INAUDIBLE] and fix it and put it back in? Actually, no we don't. What we do is the software that does the clustering is smart enough to know that note is down, I'm going to quit using it. And how can we do that? It's because we've built in redundancy so there's no node that's a single point of failure. They had to figure out about power because these server nodes in a cluster demand, a lot of electricity. Google had to figure that out. Now on slide number nine. There are three more links to three more Google videos that are seven years newer than the previous one. And I urge you to go watch these videos. As a part of your understanding of the challenges one faces setting up database servers as horizontally scalable clusters filled with thousands of nodes that all work together as one. So it's enlightening for you. So go watch those google videos. Now from here on I want to talk a little bit about the techniques that we can use to help my relational database handle big data. And there are three topics that we're going to dive into replication, parallelization and sharding.