Welcome. This unit covers data center level considerations when deploying AI clusters. Topics covered in this unit include; infrastructure provisioning and management, orchestration and job scheduling, cluster management and monitoring, power and cooling considerations, and lastly, the DGX-ready data center co-location program. Starting with infrastructure provisioning and management. First, we'll discuss infrastructure provisioning and management for compute, network, and storage. Next, we'll focus on how to boot and install compute nodes. Before discussing how to provision infrastructure for AI, it is important to understand the benefits of a cluster. Utilization can be increased by setting up a multi-node, multi-user environment to enable scale-out to support ongoing research within a company. With the same cluster, resources can be reprovisioned to scale up when users have a large job that requires multiple nodes to run in parallel. Understanding the profiles of the users will also assist with how to manage expectations for cluster sizing and access to the environment. For example, a typical user may use one or two GPUs, or as a power user may use 4-8 or more. Understanding the hardware is key for performance, ease of startup, and logistical purposes. Also, understanding how much power can be allocated per rack is key in how to disperse the systems within the data center, or even if co-location services should be considered. The best way to ensure a successful deployment is by reviewing the configuration, making sure that all the hardware has enough memory, GPUs, and storage. Doing compute with GPUs requires a balance of processing cores and speed, where speed is most important. With regards to system memory, it is recommended that there be twice the amount of GPU memory per node. Using the latest GPUs offered and the minimum requirement for network interface cards ensures optimal performance. Management and admin node serve as installation servers for the compute nodes in the cluster. Some of this functionality may be combined in a single node depending on the size of the cluster. Multi-user clusters will typically have a separate head node for interactive logins. With regards to networking, it is best to have a separate network for out-of-band management so that there is no interference with throughput required for the cluster. Also, it is important that the cluster network has enough throughput for cluster communication. Two hundred gigabits Ethernet or InfiniBand has become the de facto recommendation as large amounts of data are being moved fast, requiring high IO profiles. As for storage, typical block storage-based appliances are not suited for this environment. Scale-out NFS or parallel FS should instead be considered. Here are the three basic steps for getting a cluster up and running. First, create the admin node and configure it to act as an installation server for the compute nodes in the cluster. This includes configuring the system to receive the pre-built execution environment or PXE client connections, as well as setting it up to support automated kickstart installations. Next, boot the compute nodes one by one, connecting to the admin server, and launching the installation. Lastly, when all the nodes are up and running, install the job queue system on them, so they may work together as a high-performance cluster. Deploying a deep learning platform is an involved process with many steps. To get a basic, unoptimized, non-tuned system up and running manual install of drivers, libraries, primitives, and packages is necessary. This involves an inventory of documentation that is over 380 pages long. Some of the steps are listed here. With DGXA 100s systems, this can be greatly simplified. Here we see the Linux pre-installed requirements for the CUDA toolkit and driver. First, verify that the system has a CUDA-capable GPU. Verify that the system is running a supported version of Linux. Verify the system has GCC installed. Verify the system has the correct kernel headers and development packages installed. Download the NVIDIA CUDA toolkit. Lastly, handle any conflicting installation methods. To use CUDA on your system, you will need a CUDA-capable GPU and a supported version of Linux with the GCC compiler and toolchain. If using an OEM Server, ensure that the server supports the version of Linux that will be running in the environment. Now, the CUDA toolkit can be downloaded from the link shown here. The CUDA development environment relies on tight integration with the host development environment, including the host compiler and see runtime libraries, and is only supported on the distributions that have been qualified for the CUDA toolkit release. The run file installation installs the NVIDIA driver, the CUDA toolkit, and CUDA samples via an interactive standard-based command line interface. Distribution-specific instructions for disabling the Nouveau driver, which is a free and open-source graphics device driver for NVIDIA video cards, and the steps for verifying device node creation are provided. Finally, the advanced options for the installer and the uninstallation steps are also available. The run file installation does not include support for cross-platform development. For cross-platform development, please refer to the CUDA cross-platform environment section. Shown here are the key components for success with clustered compute, storage, and networking environments from an open-source community support model to an enterprise-level support model.