Cloud Storage is the essential storage service for working with data, especially on structured data in the Cloud. Let's do a deep dive into why Google Cloud Storage is a popular choice to serve as a data lake. Data in Cloud Storage persists beyond the lifetime of VMs or clusters i.e. it is persistent. It is also relatively inexpensive compared to the cost of compute. For example, you might find it more advantageous to cache the results of previous computations and Cloud Storage. If you don't need an application running all the time, you might find it helpful to save the state of your application into Cloud Storage and shut down the machine that it's running on when you don't need it. Cloud Storage is an object store, so it just stores and retrieves binary objects without regard to what data is contained in the objects. However, to some extent, it also provides file system compatibility, and can make objects look like and work as if they were files so you can copy files in and out of it. Data stored in Cloud Storage will basically stay there forever. In other words, it is durable, but it is available instantly. It is strongly consistent. You can share data globally, but it is encrypted and completely controlled, and private if you want it to be. It is a global service, and you can reach the data from anywhere. In other words, it offers global availability, but the data can also be kept in a single geographic location, if you need that. Data is served up with moderate latency and high throughput. As a data engineer, you need to understand how Cloud Storage accomplishes these apparently contradictory qualities, and when and how, to employ them in your solutions. A lot of Cloud Storage's amazing properties have to do with the fact that it is an object store, and other features are built on top of that base. The two main entities in Cloud Storage are buckets and objects. Buckets are containers for objects, and objects exist inside of buckets and not apart from them, so buckets are containers for data. Buckets are identified in a single global unique namespace. That means once a name is given to a bucket, it cannot be used by anyone else unless and until that bucket is deleted and the name is released. Having a global namespace for bucket simplifies locating any particular bucket. When a bucket is created, it is associated with a particular region or with multiple regions. Choosing a region close to where the data will be processed will reduce latency, and if you are processing the data using Cloud services within the region, it will save you on network egress charges. When an object is stored, Cloud Storage replicates the object. It monitors the replicas, and if one of them is lost or corrupted, it replaces it with a fresh copy. This is how Cloud Storage gets many nines of durability. For a multi-region bucket, the objects are replicated across regions. For a single region bucket, the objects are replicated across zones. In any case, when the object is retrieved, it is served up from the closest replica to the requester, and that is how low latency occurs. Multiple requesters could be retrieving the objects at the same time from different replicas, and that is how high throughput is achieved. Finally, the objects are stored with metadata. Metadata is information about the object. Additional Cloud Storage features use the metadata for purposes such as access control, compression, encryption, and lifecycle management. For example, Cloud Storage knows when an object was stored, and it can be set to automatically delete after a period of time. This feature uses the object metadata to determine when to delete the object. You may have a variety of storage requirements for a multitude of use cases. Cloud Storage offers different classes to cater for these requirements, and these are based on how often data is accessed. Standard storage is best for data that is frequently accessed, also referred to as hot data, and/or stored for only brief periods of time. When used in a region, co-locating your resources maximizes the performance for data-intensive computations, and can reduce network charges. When used in a dual-region, you still get optimized performance when accessing Google Cloud products that are located in one of the associated regions, but you also get the improved availability that comes from storing data in geographically separate locations. When used in a multi-region, standard storage is appropriate for storing data that is accessed around the world, such as serving website content, streaming videos, executing interactive workloads, or serving data supporting mobile and gaming applications. Nearline storage is a low cost, highly durable storage service for storing infrequently accessed data. Nearline storage is a better choice than standard storage, in scenarios where a slightly lower availability, a 30-day minimum storage duration, and cost for data access are acceptable trade-offs for lowered at rest storage costs. Nearline storage is ideal for data you plan to read or modify on average once per month or less. Nearline storage is appropriate for data backup, long-tail multimedia content, and data archiving. Coldline storage is a very low cost, highly durable storage service for storing infrequently accessed data. Coldline storage is a better choice than standard storage or nearline storage in scenarios where slightly lower availability, a 90-day minimum storage duration, and higher costs for data access are acceptable trade-offs for lowered at rest storage costs. Coldline storage is ideal for data you plan to read or modify at most once a quarter. Archive storage is the lowest cost, highly durable storage service for data archiving, online backup, and disaster recovery. Archive storage has higher costs for data access and operations, as well as the 365-day minimum storage duration. Archive storage is the best choice for data that you plan to access less than once a year. For example, cold data storage, such as data stored for legal or regulatory reasons and disaster recovery. Cloud Storage is unique in a number of ways. It has a single API, millisecond data access latency, and 11 nines durability across all storage classes. Cloud Storage also offers Object Lifecycle Management, which uses policies to automatically move data to lower cost storage classes, as it has access less frequently throughout its life. Cloud Storage uses the bucket name and object name to simulate a file system. This is how it works. The bucket name is the first term in the URI, a forward slashes appended to it. Then it is concatenated with the object name. The object name allows the forward slash character as a valid character in the name. The very long object name with forward slash characters in it looks like a file system path, even though it is just a single name. In the example shown, the bucket name is declass. The object name is de/ modules/02/scripts.sh. The forward slashes are just characters in the name. If this path we're in a file system. It would appear as a set of nested directories beginning with declass. Now for all practical purposes, it works like a file system, but there some differences. For example, imagine that you wanted to move all the files in the 02 directory to the 03 directory inside the modules directory. In a file system, you would have actual directory structures, and you would simply modify the file system metadata so that the entire move is atomic. But in an object store simulating a file system, you'd have to search through all the objects contained in the bucket for names that had 02 in the right position in the name. Then you'd have to edit each object name, and rename them using 03. This would produce apparently the same result, moving the files between directories. However, instead of working with a dozen files in a directory, the system has to search over possibly thousands of objects in the bucket to locate the ones with the right names and change each of them. So the performance characteristics are different. It might take longer to move a dozen objects from directory 02 to directory 03, depending on how many other objects are stored in the bucket. During the move, there will be least inconsistency with some files and the old directory and some in the new directory. A best practice is to avoid the use of sensitive information as part of bucket names, because bucket names are in a global namespace, the data in the buckets can be kept private if you need it to be. Cloud Storage can be accessed using a file access method, that allows you, for example, to use a copy command from a local file directly to Cloud Storage, use the tool gsutil to do this. Cloud Storage can also be accessed over the web. The site storage.cloud.google.com uses TLS, HTTPS to transport your data, which protects credentials as well as data in transit. Cloud Storage has many object management features. For example, you can set a retention policy on all objects in the bucket, for example, the objects should expire after 30 days. You can also use versioning so that multiple versions of an object are tracked and available if necessary. You might even set up lifecycle management to automatically move objects that haven't been accessed in 30 days to nearline and after 90 days to coldline.