Optimizing Infrastructure Efficiency for Scalable Deep Learning and LLM Training: Insights from Storage

James Coomer, Senior VP of Product, DDN

In this talk, James Coomer, Senior VP of Product at DDN, discusses about Optimizing Infrastructure Efficiency for Scalable Deep Learning and LLM Training.

In the realm of AI, the focus often lies on GPUs and CPUs, but today, let's consider the viewpoint of a storage system. As a company specializing in building large, fast storage systems for supercomputers and superpods worldwide, including notable clients like Nvidia and Scaleway, we're invested in making the process of developing large language models more efficient.

When you examine the typical spending breakdown for a large AI infrastructure, the majority is allocated to compute resources, with a small fraction dedicated to storage. Our challenge, as a 5% segment of the budget, is to competitively enhance efficiency throughout the entire system. Delving into the AI training process, we recognize key data patterns that demand optimization. For instance, during the training of extensive language models, data ingestion, preparation, training, and distribution involve intricate read and write patterns. As a storage system, our mission is to find ways to make these processes more efficient.

Let me share two stories that highlight our efforts to optimize data patterns. In collaboration with Nvidia, we addressed the repetitive movement of data during training epochs (In the context of deep learning and training models, an epoch is one complete pass through the entire training dataset. During each epoch, the model learns from the entire dataset, updating its parameters to improve performance.) By implementing automatic caching on DGX systems(a series of systems designed by Nvidia, known as Nvidia DGX systems. These are specialized hardware solutions for AI and deep learning workloads, often used for training large neural networks) with SSDs, we reduced network traffic significantly, saving 3% of runtime and contributing to a more productive infrastructure.

Furthermore, contrary to common perception, AI is not solely a read-intensive problem; it involves substantial write activities, particularly in the form of checkpoints(Checkpoint: A checkpoint in the context of deep learning refers to a snapshot of the model's parameters and state at a specific point during training. Checkpoints are saved to disk periodically, allowing the model training process to resume from a known state in case of interruptions or failures.) We assist in making these writes more efficient by serving them rapidly. Through well-optimized, parallel file systems, we can potentially reduce the time spent on checkpoint activities from 43% to 5-10%, translating to 5-12% more useful training time compared to non-specialized systems like NFS (Network File System): NFS is a standard protocol used for sharing files and directories among computers in a network. It allows a system to access files over a network as if they were on its local disks.

In essence, investing in the right storage, such as well-architected parallel file systems, can yield substantial gains in infrastructure efficiency, paying dividends that go beyond the initial 5% allocation in the AI budget. Our ongoing dialogue with customers and competitors fuels our commitment to advancing the efficiency of AI infrastructure.