Welcome back. There are a number of ways to connect to remote data. In this lesson, we'll start with connecting to two common data sources, BLOB stores, and JDBC connections, which is the common protocol for connecting to databases. BLOB stores are a common way of storing large amounts of data. JDBC is an application programming interface or API for Java environments. This allows us to query most databases. Spark works well with JDBC, including a functionality called predicate push down, where aspects of SQL queries such as filter operations, can be done by the database itself transferring that data into the Spark cluster. This saves us a lot of network transfer. By the end of this lesson, you will have read data from a BLOB store and read data in both serial and parallel from a JDBC connection. Let's take a look at accessing data in Spark. In this walk-through, we're going to read data from a BLOB store, we're going to read data in serial from a JDBC connection, and we're going to read data in parallel from a JDBC connection as well. First thing I'm going to do is go ahead and attach to my cluster, and I'm going to run this classroom setup script. That's just make sure that we have all the connections that we need in order to be able to run this notebook. First let's talk a little bit about BLOB stores. So, DBFS or the Databricks file system is backed by S3. S3 is Amazon simple storage service. This offers us a scalable storage solution, that's the backbone of a lot of different data lakes. So in this case, what we're going to do is we're going to want to melt an S3 bucket. This allows us to read and write from this highly scalable data source. So S3 offers us a scalable way of reading and writing data. The way that we access this within Spark is we mount that data. So the first thing I'm going to do is I'm going to go ahead and run this line here. This code we can really only execute in Python. But what this allows us to do is it allows us to define our access keys to AWS, and then mount that. So here this is the actual name of the AWS bucket, and then this is the way that it's going to appear to us within the Databricks environment. So now to mount this bucket, once again I have a little bit of Python code, and this is just boilerplate code, so you can copy and paste and fill in your own credentials as needed. But here, I'm going to be using dbutils, which is a Databricks utility library, that gives me a number of different functions that allow me to access my data. Now that this has run, I should have my data accessible to me. Here you can see that I passed in my AWS access key, secret key, bucket name, and mount name. You can see the Amazon Web Services documentation in order to be able to generate this yourself so that you can link to your own AWS S3 bucket. Next, I can go ahead and explore this mount. So here I'm going to use the percent fs magic key plus Is and then the name of the directory. So if I go ahead and run this, I see that I have three different directories, I have one called DataFrames, one called fire calls, and one called fire incidents. Fire calls is what we're going to be using for the majority of this course. In practice, be sure to keep your AWS credentials secret. The credentials that you see on the screen are just generic read-only credentials. So there's really not much harm in sharing them. So if you've keys that have more permissions associated with them, be sure to keep those relatively secret to make sure that you don't leak any of your data. You can go ahead and do percent fs mounts, and you can see that now we have that dataset mounted. We can also see that we have this slash mnt davis mount here as well. This is what was handled from that classroom setup script that we ran in the beginning. So now we have access to a theoretically unlimited store of data, I'm going to go ahead and unmount this temporary mount so that it is not hanging around. The next thing that we're going to do is we're going to look at JDBC reads. JDBC is an application programming interface or API that defines database connections in Java environments. So Spark is written in Scala. This runs on the JVM or the Java virtual machine. Since these are Java environments, JDBC is the preferred method for connecting to data whenever possible, and a lot of different data technologies including databases, run on JVM, making the access between these two ecosystems quite easy. Databases have a number of different functionalities. One functionality that's particularly helpful here is something called predicate push down. Predicate push down allows us to push the predicate part of a SQL query down to the database itself. This is for instance a filter operation. So in this case, we might want to filter out a bunch of different data before we take it into Spark, with predicate push down, the only data that we're transferring across the wire, is data that's already been sorted through by the database. This allows for significantly more performant operations. So the first thing I'm going to do is I'm just going to run this Scala code. This makes sure that we're using the best Postgres driver because we are connecting to a Postgres database. The next thing we need to do is create the JDBC string. So here, we're going to create this table called Twitter JDBC, and we're going to pass in the driver here. We're also going to pass a number of different options, and what these options look like really depend on how your database is setup. But here we specified the driver, we specify the JDBC URL, in the endpoint of this URL is server1.databricks.training, and this is on Port 4532. This is our username and password, and we also have a table associated with this as well. So now we can go ahead and run this code. So here we can take a look at what's actually in this table, and we can see that this is a bunch of Twitter data. So we have user IDs, screen names, locations, and a number of other data points. One other aspect of Spark is that we can also add specific subqueries to this. So in this case, if we run this query here, we have this additional subquery where we will only be looking at data from the Philippines. Now, let's take a look at this. Let's take a look at our new Twitter Philippines JDBC table. Now we can see that the only location we have is in the Philippines. So this is an example of predicate push down, where the subquery was actually passed on to the database, and so the only data that was transferred from our database to the Spark cluster was already prefiltered for only data from the Philippines. So far, what we've seen are serial reads. So this creates one connection back to the database, and reads our data in serial. Now, let's take a look at how we can parallelize that operation. The first thing I want to do is I want to figure out what the general distribution of my data is. So here I have ID's that I can work with. This query shows me the minimum ID and the maximum ID. The reason why I want this is Spark needs some way to determine what each parallel read is going to actually be responsible for. So the first thing I'm going to do is I'm going to make sure I dropped Twitter parallel JDBC if it exists, and then I'm going to run this command here. So as you can see, this looks very similar to the query that we run a moment ago. However, I have a few other different parameters that I'm passing into this query, I have a lower bound, I have an upper bound, and I have a number of partitions. The lower bound says what's the lowest expected value of the column user ID. The upper bound says what's the highest value there, and the number of partitions designates the number of parallel reads that I have. You can see that I denoted the partition column here. The reason why I'm using quotes like this is to ensure that the capitalization that I passed to the database is ensured. Otherwise, Spark might drop the capitalisation and assume that I really just want lowercase user ID, in that case, I wouldn't find the column that I'm looking for. Now, let's go ahead and do a select star. So this looks very similar to the serial read that we just had. But let's compare how the two actually perform. This a little bit of Python code. What this allows me to do is it allows me to compare the performance of these two different read operations. So I'm going to run both of these cells and then we can compare the results. The first cell is just from my Twitter JDBC table, this is my serial read. The second cell is my parallel read. As you can see with the first cell, I ran this code three times, and the best performing was 4.78 seconds. In the serial read, you can see that this took about 4.78 seconds per loop. In the second one, it took about 4.22 seconds per loop. Now, this is an improvement, but it's not a drastic improvement. As the size of my data continues to grow, the benefits from using parallel reads rather than serial reads is going to become increasingly pronounced.