Hi! In this lesson, I want to discuss why we might want to use the file type JSON, and how you use it in Spark. In the process, we'll discuss schemas and types, two important questions in distributed computing. By the end of this lesson, you'll be able to choose the best use cases for JSON, in addition to applying schemas to JSON data. In this walk-through, we're going to take a look at schemas and types, and we're going to focus on JSON data which is pretty common in big data applications. First, let's start with why schemas matter. So schemas are really at the heart of data structures within Spark. Let's discuss what a schema is. A schema describes the structure of your data, it does this by naming columns and declaring the types of the data that are in those columns. So Spark's really fast for a couple of different reasons. First, it does in-memory computation which is a lot faster than reading and writing to disk, but also Spark knows what types of data you're working with. This allows for a number of different optimizations under the hood, and so schemas are really at the heart of optimizations within Spark. Now, let's take a look at JSON data. A lot of the data that you've been working with in the past is tabular data, that's data that's arranged into columns and rows, so this is really common within CSV files, it's common within relational databases as well. Semi-structured data is more common in big data environments. One of the benefits of semi-structured data, is that it allows you to have your schema evolve over time. In this case we don't need to do all of that upfront work creating a relational model, and then having to update our tables later on if our schemas do change. So first, let's take a look at what JSON data looks like. I have this file here call file calls truncated, and this is just the JSON version of the data that we've already been working with in the rest of this course. Let's take a look at the first few lines. Here you can see that this looks a lot different from normal CSV files. Instead of having a number of different columns and rows, we have these key value pairs. So for instance, we have call number and the corresponding number. For each of these data points we don't need to have each key-value pair represented, so this allows that schema to evolve over time. One of the other benefits of JSON data is it allows for more complex types. So here we'll talk about this in a moment, but we'll see more than just primitive types, we can have arrays and maps nested within our JSON data as well. So first, let's read the JSON file. I can go ahead and use this, using JSON command, and I can just pass in the path of the file. Now let's describe it to see what it looks like. Here you can see that Spark imported the data types as well, you can see that this command took about three and a half seconds to run, and this is in large part because Spark is trying to infer our schema, it's trying to figure out which types are actually within our data, that way we can do some optimizations later on. Finally, let's just take a quick look at our table, you can see that this looks very similar to the other ways that we've imported data in the past. Now let's talk about user-defined schemas. So in this case we can provide the schema upfront, so that Spark doesn't have to do the work to figure out what the different data types we're working with are. This is a very costly operation, especially as our data gets larger and larger. This is the syntax that we're going to be using. I define the column name and I defined the datatype that's within those columns. Now you can see that this just took 0.12 seconds in order to run, this is because Spark doesn't have to do all those computations figuring out what types are in my data. Assuming we define these schemas, we can also use non primitive types instead. So the most common forms of non primitive types are arrays and maps. You can think of an array as a list of a number of different values rather than just one individual into a string. Maps allow for key value pairs rather than a list. So the important takeaways from this lesson are that, when we're working with different datatypes, providing the schema upfront, avoids extra jobs, and providing an accurate schema allows for a number of different optimizations under the hood.