Hello and welcome to this course on Big Data Analysis in Scala and Spark. My name is Heather Miller, I'm your research scientist at EPFL, the executive director of the Scala Center at EPFL and an assistant clinical professor at Northeastern University in Boston. This course is all about taking some of the concepts you've picked up in earlier courses in the Scala specialization. So in particular, the functional programming in Scala course. And then applying some of these skills that you've learned to massive data sets using a popular framework written in Scala called Spark. So far throughout the series of Scala courses, we've focused on the basics of functional programming. And we focused on particular on sort of learning the fundamentals, and slowly but gradually building up more and more interesting programs from these fundamentals and this was in the courses number one and two. So principles of functional programming in Scala and functional programming design in Scala. Then we moved on to the Parallelism course where we started to focus on the underlying executions of our computations in a parallel setting. Now in this course, we're going to continue that trend. We're going to begin to think about applying some of these fundamental functional concepts across many machines rather than many processors. We'll also begin to shift our thinking towards a class of applications unlike we've done before. So we're really going to start looking at analyzing large amounts of data, which is typically the focus of data scientists. That said, it's important to note that this isn't the machine learning or data science course. Rather than focusing on machine learning algorithms or designing and tuning models, we will instead focus on how to map some of the functional abstractions that you've learned in previous Scala courses to computations on multiple machines over massive data sets. That is, we will see first-hand how the functional abstractions that we've covered in the previous Scala courses makes it easier and more user-friendly to scale computations over large clusters. Or easier, per se, than scaling computations on imperative frameworks, imperative systems for distributed computation. However as alluded to earlier in our program exercises, we're always going to focus on analyzing large data sets. That is you'll be challenged to think about common data science tasks like K-means functionally, such as that they can be adopted to and implemented in the context of Spark. A functionally oriented framework for large scale data processing that's implemented in Scala. Before we go any further, you might be asking well, if we're going to be focusing on a lightweight data sciencey flavor of the processing tasks, then why are we bothering with Scala and why are we bothering with Spark? After all, if you want to learn data science in the classroom off of statistics professor's favorite languages or frameworks like R or Python or Octave and/or MATLAB. So then why should one bother running Scala or Spark which are both arguably very unlike R, Python, Octave and MATLAB? The answer is that these language and frameworks are good for data science in the small. Algorithms on data sets that are perhaps just a few hundred megabytes or even a few gigabytes in size. However, once the dataset becomes too large to fit into main memory on one computer, it suddenly becomes much more difficult to use one of these languages or frameworks alone. In short, if your small dataset grows into a much larger data set than these languages and frameworks like R, Python, MATLAB, etc. They won't allow you to scale, you'll need to start completely from scratch reimplementing all of your algorithms using a system like Hadoop or Spark anyway. We'll need to manually figure out how to distribute your problem over many machines without the help of such a framework. Which is kind of a bad idea if you're not already an expert in building distributed systems. And, yeah, there's also this whole huge massive industry shift towards data-oriented decision making. Nowadays, many companies across many different industries have realized that by looking more closely at the data they're collecting from device logs to health or genetic data, they can innovate in ways that were impossible before. For example, now we have all of these devices surrounding us, collecting information and attempting to provide all kinds of insights to enrich our day-to-day lives. Or instead, imagine hundreds of thousands of users of some device, say a smartphone or some wearable or something. And imagine as part of your job, you're responsible for providing some analysis or insight behind all of the data that's collected. They're providing insights to the smartphone's manufacturer about how the smartphone is operating. It would be nice, for example, if your smartphone manufacturer was able to catch an updated glitch before it became a big problem for you. Or analysis of your level of physical activity relative to the average activity of other users using the same wearable as you for example. In both of these cases, the language or the framework that you might've learned in a statistics class couldn't be used as you learned it. These are data science problems in the large that I'm talking about, and they can't be solved in a single compute node alone. On all of these kinds of problems, scale far beyond the tech industry alone, you'll find similar problems in medical research. Lately, there's a number of initiatives focused on developing personalized treatments for disease for example. You'll also find similar problems in finance, manufacturing and many other areas of industry. In short, almost every industry has moved towards improving their business or products using some kind of data science, very often using data science in the large. So then let me ask you again, why Scala and why Spark? So okay, we established R and MATLAB, as you learn it in schools, isn't going to work for these data science in the large situations that have increasing importance across industries. But by using the language like Scala, it's easier to scale your problem to the large with Spark whose API is almost one-to-one with Scala's collections. That is by working with Scala in a functional style, you can quickly scale your problem out from one to tens, hundreds, or even thousands of nodes by leveraging Spark. A successful in performing large-scale data processing framework that looks a lot like Scala Collections. So let's start by touching on a few reasons to learn Spark. When it comes to dealing with large data sets, Hadoop is also a popular choice, so why would anybody bother with Spark? Well, there are few very strong reasons. One, Spark is more expressive. Spark's APIs are modeled after Scala's collections, which mean distributed computations in Spark are like immutable lists in Scala. You can use higher-order functions like map, flatMap, filter, and reduce, to build up rich pipelines of computation that are distributed in a very concise way. Whereas Hadoop on the other hand is much more rigid. It forces map then reduce computations without all of these cool combinators, like flatMap and filter, and it requires a lot more boilerplate to build up interesting computation pipelines. The second reason is performance. By now, I'm sure you've heard of Spark as being super fast. After all, Spark's tagline is Lightning-Fast Cluster Computing. Performance brings something very important to the table that we haven't had until Spark came along which is interactivity. Now it's possible to query your very large distributed data set interactively. So that's a really big deal. And also Spark is so much faster that Hadoop in some cases that jobs that would take tens of minutes to run, now only take a few seconds. This allows data scientists to interactively explore and experiment with their data, which in turn allows for data scientists to discover richer insights from their data faster. So something huge, which I feel isn’t off dimension when people are talking about the pluses of Spark is that it really, really improves developer productivity. This is a really big point here, Spark improves the productivity of data analysts. And finally, Spark is good for data science. In fact, it's much better for data science than Hadoop and it's not just due to performance reasons. Iteration is required by most algorithms in the data scientist's toolbox. That is, most analysis tasks require multiple passes over the same set of data. And while iteration is indeed possible in Hadoop with really quite a lot of effort, you have a bunch of boilerplate that you have to do and a required external libraries and frameworks that just degenerate a bunch of extra map reduce phases in order to simulate iteration. It's really, on the other hand, the downright simple to do in Spark. There's no boilerplate required whatsoever, you just write what feels like a normal program, which includes a few passes over the same dataset. You just have basically, something that looks like a for loop and say hey, until this condition is met iterate. Which is night and day when compared with Hadoop, because it's almost not possible in Hadoop. Another anecdotal reason that Spark and Scala together are both super interesting to learn at the moment is that Spark and Scala skills are in extremely high demand. If you don't know about it, there's actually this really cool developer survey that Stack Overflow does every year. And the 2016 survey results came out a few months ago and something really cool happened in these results. Let's just go to the technology section of the survey and look at the top paying tech. And as we can see, the top paying tech in the US is actually both Spark and Scala together. So indeed, Spark and Scala skills are both in high demand. And it's good thing you've already taken the Functional Programming Principles in Scala course, and now you're taking this Spark course? So what are we going to learn in this course? Well, we'll see how the data parallel paradigm that we learned in the parallel programming course can be extended to the distributed case using Spark. So that's where we're going to start in some of the next lectures. Then we're going to cover Spark's programming model in depth. We're then going to go into distributing computation, how to do it and how the cluster is actually laid out in Spark? We're then going to move on and spend quite a bit of time learning about how to improve performance in Spark. Looking at stuff like data locality and how to avoid recomputation and especially data shuffles in Spark. We're going to spend quite a bit of time trying to understand when data shuffles will occur. And finally, we're going to spend quite a bit of time diving into the relational operations available in Spark SQL module. To learn how to use the relational operations on DataFrames and Datasets and also all of the benefits that they bring you, as well as a handful of limitations. So Spark is popular, because it allows programmers to take advantage of massive resources from a handful to hundreds or even thousands of compute nodes. And it allows programmers to do so in a way where writing distributed programs feels like writing regular programs. As we'll see throughout this course, this detail, writing and programs that are distributed but with field sequential is very powerful. But it can also be the source of many headaches as we'll discover. Assumptions that you've learned to make as a programmer while working on single machines, have no longer hold on a cluster. And Spark does its best to make all kinds of smart decisions for us in order to ensure that our jobs run correctly all the way through. But as we'll see, we'll often need to understand a bit more about what's actually happening under the hood in Spark to get good performance out of it. So that's going to be a major theme of this entire course. So what are the prerequisites for this course? Well, it builds on the material taught in the previous Scala courses. So it would really be best if you've already taken the previous three Scala courses in this series. Those three courses are Principles of Functional Programming in Scala, Functional Program Design in Scala, and Parallel Programming in Scala. Though at minimum, some familiarity with Scala is required, whereas it's actually designed to connect to the Parallel Programming course that Viktor Kuncak and Alex Prokopetz taught. So many of the concepts that were introduced there, I pick up on them and I extend those concepts to the distributed case. So its also helpful to at least have some familiarity with sort of the concepts that we're taught in the parallel programming course. What about books and other resources that could help you out in this course? Well, in the past year or so several books covering Spark have emerged and most of them teach Spark in Scala. A good book covering the basics is O'Reilly's Learning Spark written by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia. Matei's now a professor who created Spark when he was a graduate school at Berkeley. Another book that does a good job at covering the basics of Spark is Spark in Action which was published in 2017. It's full of examples and it's another good resource for sort of getting into Spark for the first time if you're looking for a good book. Go for a book that goes in to more detail on how to achieve good performance. The O'Reilly book called High Performance Spark which is currently in develop by Holden Karau and Rachel Warren is an excellent resource. This book goes into more depth about, precisely, how Spark executes jobs and how one can use that knowledge to squeeze better performance out of your Spark jobs. Then we'll cover in this course. So if you're looking for more performance, that's a great book to pick up. And finally, for a more data science focused look. At Spark, O'Reilly's Advanced Analytics with Spark book, written by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills is a good book to pick up. This book is full of example data science applications implemented in a functional style with the Spark, and many of its major modules. So modules like live graphics, which we're not going to cover in this course. Further, the authors don't just stop at implementing the core algorithms. They go into detail about how to prepare and clean data, and how to tune models in order to get good results. So it's really a good resource for general data science as well. I highly recommend this book if you're interested in mapping some of the things you learned in this class to concepts and algorithms from machine learning. And finally if really, really zooming in on Spark is what you're after. There's this book published on GitBook called Mastering Apache Spark 2 by Jacek Laskowski, it covers Spark internals in great detail and it's constantly updated. It's a work in progress, but it's full of very cool information and all kinds of examples. So if you're really into the nitty-gritty, have a look at this book. As far as tooling goes, there are a couple of tutorials that you're going to find. As far as to tooling goes, there are a couple of few tutorials that you're going to find. As far as tooling goes, there's a couple of tutorials that you're going to find in the getting started section on the first week. If you've taken the other courses, there's nothing new here. The only required tools for this course are some IDE or text editor of your choice and sbt. There's another optional tool which you may be interested in using called Databricks Community Edition. Which is a hosted version of Spark that has some in-browser notebook that lets you interact with this hosted version of Spark. So that means, you wouldn't have to setup Spark anywhere, you could just go to a website and play with it. Here's just a quick glimpse of this Databricks Community Edition that I was telling you about. So you can get access to this platform for free by going to the Databricks website. Once you log in, this is what you have in front of you. But here, you have a dashboard and you could open notebooks and create your own notebooks. But there's all kinds of cool notebooks that have lots of example programs and of course, they let you run your own program. So this is a really cool platform that you may find useful while learning. Like all of the other Scala courses, this course comes with autograders. There are tutorials in the getting started section of the first week to teach you show to submit your solutions to our automatic graders and how to check your results. And this course features three auto-graded assignments that require you to do analyses on real-life data sets. So we didn't give you fake datasets, we gave you real datasets. So you can actually play around with them beyond but we ask you to do as part of your assignments in the course and there's all kinds of interesting insights to find. And that's all I have to say about logistics. So let's just dive in and get our hands dirty with Apache Spark and Scala.