Welcome back. We'll continue our lesson one, and the introduction to HDFS. So next we look at the performance envelope of HDFS. Basically to determine the number of blocks for a given file size, see the key HDFS and system components that are affected by the block size. An impact of using a lot of small files on HDFS. So recalling from the first video of this lesson, HDFS works by distributing data in a file over lots of data nodes. Say you have a file and now in HDFS the default block size is 64 megabytes. So if you have a 10GB file, and you get broken up into 160 blocks. So, you can almost see there's, right away, is, the performance depends on how big your file is, essentially. Because if you have a large file you'll use all of the blocks optimally and then you should be in good shape, but if you have lots of small files it might cause problems and we'll see more about this in the upcoming slides. So why is the number of blocks important? First consideration is a NameNode memory usage, so every block that you create basically every file could be a lot of blocks as we saw in the previous case, 160 blocks. And if you have millions of files that's millions of objects essentially. And for each object, it uses a bit of memory on the NameNode, so that is a direct effect of the number of blocks. But if you have replication, you're gonna have 3X the number of blocks, so you're have more storage, USH. And this if you recall from the earlier design, the NameNode actually uses memory for a lot of these stories. This has a direct impact on the modern name reader NameNode is using. The other thing that's impacted is the number of map tasks. If you recall from earlier models of the map produced framework the number of maps typically depends on the number of blocks being processed. And so, that's the other importance of having the number of blocks of a file. Now what does this mean if you have a lot of small files? We'll take the memory usage example. Typically, the usage is around 150 bytes per object. Now, if you have a billion objects, that's going to be like 300GB of memory. So you can see this can become an issue. Now, if you have a lot of blocks, the NameNode is always checking with the datanode on the status of these blocks and it's getting block repulsion the datanodes. So if you have a lot of blocks, you're going to have a huge network load as a results and all the datanodes sending a lot of block information. So in both cases, you're essentially stressing the system limits with this. What does this do the number of map tasks? So, say you have 10GB of data to process and you have them all in lots of 32k file sizes? So, you will end up with three hundred and 327000 map tasks. Now typically, a cluster might have 100 nodes or 1000 nodes, and maybe four or eight slots per node. So you can see this will end up with a huge list of tasks that are queued. The other impact of this is the map tasks, each time they spin up and spin down, there's a latency involved with that because you are starting up Java processes and stopping them. So it's inefficient to have a lots of map tasks. A better thing would be if you have fewer map tasks and you're processing the same amount of data but in bigger chunks. So the latency is herded essentially. The other issue is, when you do lots of small reads and writes, it's actually not efficient on spinning disk. It almost becomes like a random access, which actually, the spinning disks are pretty poor at. So in all aspects, having too many map tasks is not good. So the key takeaway is try to avoid lots of small files when you are doing these kinds of processing. There are lots of solutions. People merge Concatenate files before they put files into HDFS. There's a concept of sequence files so if you already have files in HDFS you can create the sequence files which essentially has the file name in a key. The store of the data in the file itself and the value. And then you can basically put together a lot of files. You could use things like HBase and HIVE with appropriate configurations to handle your data and that does it more optimally. And there is a concept of the CombineFileInputFormat class, which actually optimizes maps if you have a lot of files. So, this is just to give you an idea of how files are broken up into blocks, and what impact it has on performance aspects. Now, we'll go further into the details of how reads and writes actually work at the lower level on HDFS so you get more clarification of what's going on and we'll go into more detail in the next video. Thanks.