Google Cloud Storage is the essential storage service for working with data, especially unstructured data, as you can see in the machine learning world later on. Let's do a deep dive into why Google Cloud Storage is a popular choice to serve as a Data Lake. Data and Cloud Storage persists beyond the lifetime of virtual machines or clusters. It's persistent and it's also relatively inexpensive compared to the cost of compute. So for example, you might find it more advantageous to cache the results of previous computations inside of cloud storage for archiving. Or, if you don't need an application running all the time, you might find it helpful to save this state of your application in the cloud storage, and then shut down the machine that's running or when you don't need it. Google Cloud Storage is an object store, so it just stores and retrieves binary objects without regard to what data is contained in the objects. However to some extent, it also provides file system compatibility and can make objects look like and work like as if they were files, so you can copy files in and out of it. Data stored in cloud storage will basically stay there forever meaning that it's durable, but it's available instantly or it's strongly consistent. You can share data globally, but it's encrypted and completely controlled and private, if you want it to be. It's a global service, you can reach the data from anywhere, that means it offers global availability. But the data can also be kept in a single geographic location if you need that too. Data is served up with moderate latency and high throughput. As a data engineer, you need to understand how cloud storage accomplishes these apparently contradictory qualities, and when to employ them in your solutions. A lot of cloud storage amazing properties have to do with the fact that ultimately it's an object store, and that all the other features are built on top of that base. The two main entities in cloud storage our buckets and objects, buckets are containers which hold objects, and objects exist inside of those buckets and not apart from them. So, buckets are containers for our purposes for data, buckets are identified in a single globally unique name space. So that means, once a name is given to a bucket, it can't be used by anyone else until that bucket's deleted and that name is released. Having a global name space for buckets greatly simplifies locating any particular bucket. When the bucket is created it's associated with a particular region or multiple regions, choosing a region close to where the data will be processed will reduce latency. And if you're processing the data using cloud services within that region, it will see you on network egress charges. When an object is stored, cloud storage replicates that object, it'll then monitor the replicas and if one of them is lost or corrupted it'll replace it automatically with a fresh copy. This is how cloud storage get many of the nines of durability, from multi-region bucket the objects are replicated across regions. And for a single region bucket as you might expect, the objects are replicated across zones within that one region. In any case, when the object is retrieved it's served up from the closest replica to the requester. And that's how the low-latency happens, multiple requesters could be retrieving the objects at the same time from different replicas, and that's how high throughput is achieved. Finally, the objects are stored with metadata, metadata is information about that object. Additional cloud storage features use the metadata for purposes such as, access control, compression, encryption and lifecycle management of those objects and buckets. For example, cloud storage knows when an object was stored, it can automatically set up to delete it after a period of time. This feature uses the object metadata to determine when to delete that object. When you create a bucket, you need to make several decisions. The first is the location of that bucket, location is set when a bucket is created and it can never be changed. If you need to move a bucket later, you'll have to copy all the contents to the new location and pay for the network egress charges. So choose your location very carefully, their location could be a single region such as Europe north one or Asia South one. It might be multiple regions, like EU or Asia, this means that the object is replicated in several regions within the European Union or Asia respectively. The third option is to have the location to be a dual region bucket. For example, North America for means that the object is replicated in the US Central one and US East one. So how do you choose a region? Well, let's say if all of your computation and all of your users are in Asia, you can choose an Asian region to reduce network latency. But beyond that, how do you choose between Asia South one and Asia multi-region? You can select one region and the data will be replicated to multiple zones within this region. This way a single zone might go down but you'll still have access to the data, different zones within the same region provide isolation from most types of physical infrastructure and infrastructure software service failures. But if the entire region goes down such that there's like a flood in the region, for example, you won't be able to access that regional data. If you want to make sure that the data is available during a natural disaster. You could select multi-region or dual region, in which case is the replicas will be stored in physically separate data centers. And you can read more about this in the current SLAs online, I'll provide a link to it. Third, you need to determine how often you need to access or change your data. You can get deep discounts on storing the data if you're willing to pay a lot more when you need to access it. The discount starts to make sense, if you'll access the data no more than once a month or once a quarter, what are some scenarios? Well, good examples are archival storage, backups or disaster recovery, and the discount really works in your favor. If you access the data only once a quarter or once a year, these are called storage classes. Look at the link to see the SLAs and the costs associated with each of these storage classes. Cloud storage uses the bucket name and the object name to simulate a file system. This is how it works, the bucket name is the first term in the URI, a forward slash is appended to it, and then it's concatenated with the object name. The object name allows the forward slash character as a valid character in the name, the very long object name with forward slash characters in it. Looks like a file path system even though it's just a single name, in the example shown the bucket name is de class. The object name is de/modules/O2/script.sh, the forward slashes are just characters in the name. If this path were in a file system, it would appear to be shown on the left with a set of nested directories beginning with d class. Now for all practical purposes, Google Cloud Storage works like a file system, but there are some key differences. For example, imagine that you wanted to move all the files in the 02 directory to the 03 directory inside of the modules directory. In a file system, you'd have an actual directory structure or structures and you would simply modify the file system metadata, so the entire move is atomic. But in an object store simulating a file system, you'd have to search through all of the objects contained in the bucket for names that had 02 in the right position in the name. Then you'd have to edit each object name and rename them using 03. This would produce apparently the same result though moving the files between directories. However, instead of working with a dozen files in a directory, the system had to search over possibly thousands of objects in the bucket to locate the ones with the right names and then change each of them. So the performance characteristics are a little different. It might take longer to move a dozen objects from the directory 02 to the directory 03, depending on how many other objects are stored within that bucket. During the move there will be a list inconsistency with some files in the old directory and some in the new directory. A best practice is to avoid the use of sensitive information as part of bucket names, because bucket names are in a global namespace. Bucket names not the data in the buckets, the data in the buckets can be kept private if you need it to be. Google Cloud Storage can be accessed using the file access method, that allows you for example to use a copy command from the local file directory to Google Cloud Storage. You'll use the tool gsutil or Google storage utility to do this, cloud storage can also be accessed over the web. The site for it is storage.cloud.google.com, and it uses TLS or HTTPS to transport your data, which protects the credentials as well as the data in transit. Cloud storage has many object management features. For example, you can set a retention policy on all of the objects in a bucket like, the object should be expired after 30 days automatically. You can also use versioning, so you can have multiple versions of an object and they're each tracked and available if necessary, you might even set up lifecycle management. So automatically move objects that have been accessed in 30 days to nearline storage class or after 90 days to coldline storage to help optimize your costs. Let's take a look at how you can manage these object lifecycles a bit more programmatically, to help optimize the use of your cloud storage buckets, and save on storage costs.