Dynamic Sharding of Arbitrary Neural Networks

While the sequential method provides a structured and heuristic-driven approach to partitioning, it operates under the constraints of a linear, predetermined exploration path. This may not fully capture the dynamism and complexity of modern neural network architectures, where computational and memory demands can vary significantly across different layers and segments of the network. Given an arbitrary neural network, our objective is to partition the network's computation graph for optimal execution across multiple nodes. This computation graph, $G=(V, E)$ , consists of computational unit operations $V$ and data flow edges $E$ , with each operation $v \in V$ outputting a tensor consumed by downstream operations $v$ , forming edges $(u,v) \in E$ . The graph represents the entirety of the model's computational workload which ranges from simple arithmetic operations to layer-specific matrix multiplications, each associated with specific computational and memory requirements i.e. the running time $\text{work}(v)$ , the memory footprint of the model parameters $\text{sizeparam}(v)$ , and the size of the operation's output $\text{sizeout}(v)$ .

Partitioning this graph involves dividing $V$ into $k$ distinct blocks such that each block can be processed on a different node in a swarm under the constraint that the induced quotient graph of $G$ remains acyclic. This division aims to maximize throughput while minimizing inter-node communication subjected to the bandwidth $B$ between nodes, with the I/O cost from node $S$ to node $T$ given by:

\text{io}(S,T) = \frac{1}{B} \sum_{v \in N^-(T) \cap S} \text{sizeout}(v),

where $N^-(T)$ represents the set of nodes whose outputs are consumed by block $T$ .

The core challenge lies in efficiently distributing the model's parameters and activations across the available fast memory (e.g., SRAM) of each node. Parameters not fitting in fast memory must be streamed from slower storage which introduces additional latency. The overflow cost which represents the time to stream parameters exceeding the fast memory limit $M$ is calculated as:

\text{overflow}(S) = \left(\text{sizeparam}(S) + \text{peak}(S) - M\right) + \frac{\text{peak}(S)}{B},

where $\text{peak}(S)$ denotes the peak memory requirement for activations within block $S$ .

The overall block cost, $f(S)$ , combines the costs of receiving input tensors, executing the block's operations (including any overflow cost from streaming parameters), and sending output tensors downstream:

f(S) = \text{io}(V\setminus S,S) + \sum_{v \in S} \text{work}(v) + \text{overflow}(S) + \text{io}(S,V\setminus S).

The goal of partitioning, defined by the Max-Throughput Partitioning Problem (MTPP), is to minimize the maximum cost across all blocks, optimizing the throughput of the entire pipeline. Formally, MTPP seeks a partition $P^*$ that minimizes the bottleneck cost:

P^* = \text{argmin}_{P \in P_k(G)} \left\{ \max_{i\in[k]} f(P_i) \right\},

where $P_k(G)$ denotes the set of all possible partitions of $G$ into $k$ blocks, and $\text{cost}^*$ is the minimum achievable bottleneck cost across these partitions.

PreviousModel Partitioning and Deep Network Sharding NextCache Optimization to Enhance Efficiency

Last updated 1 year ago