We're going to cover some advanced features of Delta Lake. Including how to partition our table by giving columns for faster query performance, as well as how to evolve the schema of our table. And my absolute favorite feature of Delta Lake, the ability to travel back in time. And lastly we'll delete records that we no longer need. As always let's get started with our classroom setup. And then we'll go ahead and create a temporary view called firecallsparquet as we did in the previous lesson. This time when we create our delta table, we're going to make one tweak. We're going to add in this line partitioned by city. And you'll notice that city is in parentheses because you can actually specify multiple columns to partition off of, for example, year and month. And what this partitioning allows us to do, is if it's a column that we typically filter off of, we don't actually have to go through every single record trying to find those values. We can instead is go directly to the directory that contains those values, significantly speeding up our query time. However, you're not actually required to partition the columns of your delta table. But if you do decide to partition your columns, there's two general rules of thumb to follow, for deciding which column to partition by. The first one is the cardinality, of the column or the number of unique records for that column. So for example, if you're trying to partition based off of incident id, something that is unique for each record, you're going to have lots and lots of small files. So it's not actually going to give you any speed up because you have the overhead of all those small files. The other thing to consider is the amount of data you need to partition. You should partition by a column If you expect data on that partition to be at least one gigabyte. Now, our entire data set isn't even one gigabyte, so we're not actually going to see any real speed up from partitioning by city. But we want to demonstrate this technique so you can apply it to much larger datasets. Let's go ahead and create this delta table now partitioned by city. And so once this command completes, we can take a look at how these underlying files are partitioned on disk. Here we can see that there is a different folder for each of the different cities. And so if we want to filter for just one city for example, DALY city, it actually only needs to read the content from this one directory. It doesn't need to go through all of the other directories to find the content for DALY city. So Delta Lake enforces schema on our data set, as well as schema validation to ensure data quality. And it uses this technique called schema validation on right, which means that all new rights the table are checked for compatibility with the targets table schema at right time. And if they're not compatible, it will throw an exception. So let's go ahead and take a look at the schema of our table. We can see that we have these various columns and data types. But one column that I've really wanted to change is this neighborhood's analysis boundaries. I want to go ahead and create a new column called neighborhoods, to add to our data. So I'm going to use this command insert overwrite table, where we're going to keep all of our columns, but we're going to create a new column called neighborhoods. And unfortunately we get the schema mismatch error. So this is actually a good thing. It's trying to tell us, hey, you're writing new data to this current delta table, but the schema is not the same as what we expect. So what we can do is we can tell it to automatically merge the schemas if they are compatible. So we're going to set this configuration called auto-merge. Now let's try the same command again. And so once this finishes, we can see that we will now have a second version of our delta table or version one. And if we scroll to the right over here, we can see that we have our new column created called neighborhoods. However, we can always travel back in time to an earlier state of the delta table by specifying either a version number or a given time stamp. In this case, Let's go back to version zero by specifying version as of zero. If we take a look at the data in this version, we won't see that column called neighbourhoods all the way to the right, because it didn't exist at version zero. However, if we specify version AS OF1, aka the latest version, If we scroll to the right, we should see the neighborhoods present in this table, and there it is our new neighborhoods column. Lastly the feature that I want to highlight next is the ability to delete individual records. And this is becoming more and more important with laws such as GDPR in Europe, or CCPA in California to have the ability to delete records from individuals. So just for the sake of example, let's say we want to delete the record corresponding to incident number 14055109, and let's go ahead and take a look what this record looks like. So you can see we have a single record here. In order to get rid of this record, we simply can just call the delete command. And what this will do is it will go in and it will delete the records where this records showed up. It doesn't actually need to read in the entire data set and write it back out, it's able to modify the least number of files possible. So rewriting one file for delete operation, because this record only showed up within one file. And so now when we try to select it from our fireCallsDelta table, where incident number equals 14055109, it no longer exists. Because we have now successfully deleted it from our delta table. So in this notebook we have covered, how you can create delta tables partitioned by columns, how you can time travel. Evolve the schema and delete records that you no longer want to keep anymore.