Welcome back. On the job, you'll see a number of different file formats and compression options. You're likely most used to seeing CSV files which are row-based files of data. The ideal file format in distributed systems is Parquet. Instead of being row-based, Parquet is column-based, and there are a number of different performance advantages to that. In this lesson, you will examine both of these as well as different compression options. By the end of this lesson, you'll be able to determine the best options to use under a variety of different conditions. In this lesson, we're going to compare different file formats and compression types. We're also going to take a closer look at Parquet. Parquet use the de facto standard file format for big data environments because it allows for good compression and parallel reads and writes. First thing I'm going to do is I'm going ahead and attach to my cluster, and I'm going to run this classroom setup script. Now, let's take a look at a colon delimited file sitting on S3. This is our file path here, so we can use percent fs ls in order to confirm that it's there and take a look at its size. Now this size is in bytes, and so what this means is that our file is about 1.8 gigabytes large. Now, let's take a look at the first few lines of this file. Here, we can see that we have our header at the beginning and we have a colon in between each of our values. Now, let's go ahead and create a temporary view. Let's call this fireCallsCSV. We're going to pass it the path, we're going to set header to true, and we're going to give it this separator to tell it that even though this is a CSV, it's actually a colon rather than a comma that's delimiting our values. Now, let's take a look at the data types. Let's see how we imported this. You can see the column name and the data types are all strings. This can be problematic. So let's take a look at how to resolve this. Since we have all string types here, we might pass in this inferSchema and set it to be true instead. Now, we sped the video ahead quite a bit because this took a lot longer to run than our last command. If we scroll up when we didn't infer the schema, it took about half a second to run. Also notice that in terms of the Spark jobs that had executed, we really only had one Spark job. Now, if we scroll down, this command took about 51 seconds to run when we were inferring the schema. We also noticed that we have an extra Spark job here, that was the job that was actually doing the inference over our schemas. Now, if we take a look, we see that we have a number of different data types including ints and strings. So it took us a little while to figure out what the schema was for this file. Let's try the same thing with a couple of different compression formats, Gzip and Bzip, in particular. Notice that Bzip is the most compact. Let's make sure. Let me run this first. Let's take a look at each of these files and see how large they are. As we noted before, we have about 1.8 gigabytes of a file for our colon delimited text file. For Gzip that's quite a bit smaller. Now, let's take a look at Bzip. So the Bzip file is even smaller than the Gzip file. Now, let's take a look at how they perform when we read these files. Here, when I read this file, I can go ahead and just pass in the path, Spark is going to know how to deal with that Gzip compression format. So that took about 47 seconds to run. Depending on your Spark cluster, it might take a little bit more or a little bit less than the uncompressed file. But regardless, it took quite a bit of time in order to infer that schema. Even though it took up less storage space, there's still a lot more computation we had to run across that file. Now, if we want to take a look at how we actually imported this, let's look at the number of partitions we have. This is the number of different segments of our data that we have. With Gzip, we just have a single partition. Now, let's compare that with a Bzip file instead. In this case, we can see that Bzip is taking a little bit longer than Gzip. Bzip is a splittable format. However, here, we can see it's about on par with what we're seeing with Gzip. When we take a look at the number of underlying partitions, we see that we actually have eight partitions here. So generally speaking, Bzip is going to be a preferred choice over Gzip, even if we see some variance in these read times. Now, let's go ahead and compare this back to Parquet. Here, I'm going to read from a single partition Parquet file. When I call describe on it, you can see that I already have the data types that I'm looking for. This is because Parquet captures some of the metadata that's associated with my data. Now, Parquet was a lot faster. This is for a number of reasons. One is because it's able to compress our data so we have less data actually moving across the network. It also stores enough of the metadata so that we don't have to worry about the schema inference which was part of the big cost associated with reading this data. Now, let's compare the performance between these four file types. Here, we're going to use this timeit function to allow us to compare the four. Now that these queries have finished running, we can compare the different options. So you can see, Parquet is the fastest at about 7.6 seconds. Using a flat CSV file instead, our query took about 16 seconds. The Gzip took about 25 seconds and Bzip was just a little bit faster at about 24 seconds. Now, let's talk a little bit more about what Parquet is. Parquet is a columnar storage format, which means it's column-based rather than row-based. So you can see here a standard CSV is just using a row format. What Parquet allows us to do is use a column format instead. This is particularly helpful for a number of different reasons. One of them is due to compression, so say for instance in this ID column, we have multiple IDs repeated over and over again. Instead of storing the actual ID, we can just store a pointer that allows us to denote that that ID exists multiple times across our data. This is one of the ways that Parquet is compressing our data. One of the other reasons why we want to use Parquet is because it's highly splittable, Spark and read and write from Parquet in parallel. So when we're actually saving to Parquet, what that'll look like on S3 is usually a single directory that consists of a number of different small files. So if you take a look at the underlying data under this fire CallsParquet, you'll see that this is actually split. But let's compare the performance between these different options. Now, here, if we compare CSV to Parquet, the CSV read took about 15 seconds and the Parquet read took about one second. But let's see if we can get this any faster. So here, we're going to use a Parquet file that was partitioned into eight different segments. So here, you can see a number of different pieces of metadata, about how that file was actually written, and you can see the different parts of the file here. This allows Spark to make multiple connections to S3 in parallel in order to both read and write. Now, if we create a temporary view from this partitioned Parquet file, now, when we run that same query, let's see how fast that is. So now, it cut down from a little over a second to 0.9 seconds instead. So it's a pretty decent improvement, relatively speaking. In the rest of this lesson, you'll have the information on a number of different file types that you'll see. I'll leave this for you to read over on your own.