Overview

Distributed inference and training across multiple nodes are essential due to the exponential increase in the size and complexity of deep learning models and the scarcity of computational resources capable of processing them. The essence of both of these tasks lies in the necessity to handle vast computational loads more efficiently and to reduce the latency involved in generating predictions and updating model parameters. This approach uses the collective computational resources and memory available across several processing nodes. It makes it possible to utilize larger models or increase batch sizes without a proportional increase in inference or training time. One of the critical aspects of efficient distributed inference and training is partitioning the computational graph of any neural network. The computational graph represents all the operations and data flows within the model from input to output. Partitioning this graph effectively means dividing the model’s computations in such a way that they can be processed in parallel or in sequence across different nodes.

During training, this distributed approach also allows for the parallelization of gradient computation and parameter updates, significantly accelerating the training process. Strategies to minimize communication overhead are critical as well. Specifically, efficient communication and synchronization mechanisms are essential for updating model parameters across nodes without incurring significant delays. We use different strategies to minimize the communication overhead between the computational units. After processing its assigned portion of the graph, each node must send its outputs to the next in the sequence. However, in standard distributed approaches, this inter-node communication often happens over slower channels, which can become a significant bottleneck.

In the following of sections, we introduce Nesa's innotive approaches in making AI model inference decentralized:

Model Partitioning and Deep Network Sharding and Dynamic Sharding of Arbitrary Neural Networks describe our approach to splitting large AI model computational graphs into small chunks to be distributed across node runners
Additionally, we provide designs to further improve the efficiency of the Nesa system via Cache Optimization to Enhance Efficiency, BSNS with Parameter-efficient Fine-tuning via Adapters, Enhanced MTPP Slicing of Topological Order, Swarm Topology

PreviousDecentralized Inference NextModel Partitioning and Deep Network Sharding

Last updated 1 year ago