Enhanced MTPP Slicing of Topological Order

The Max-Throughput Partitioning Problem (MTPP) is fundamental to our goal of optimizing the partitioning of computation graphs for distributed deep neural network inference. Acknowledging MTPP's NP-hardness which initially is established through a reduction from the minimum makespan scheduling problem highlights the impracticality of seeking a fully polynomial time approximation scheme. This emphasizes the need for a heuristic approach to achieve high throughput pipelined inference. We integrate a heuristic algorithm that leverages the strengths of Kahn's algorithm for topological sorting which is enhanced by dynamic programming and segment trees. This approach aims to balance the computational load across the nodes and minimize the communication overhead, which is critical for maintaining high throughput in distributed systems. Kahn's algorithm is chosen for its effectiveness in ensuring a valid topological order, which is essential for acyclic partitioning. Moreover, dynamic programming offers an optimal partition strategy across these ordered segments, and segment trees are utilized for their efficiency in querying segment costs which enables rapid adjustments to partitioning as we explore the solution space for an optimal configuration.

In our approach, the introduction of the segment cost data structure is crucial, as it enables the efficient computation of costs for any segment within the topological order. Initialized with a computation graph GG and a topological order π\pi, it calculates costs that include both computational workload and communication overhead. The SliceGraph algorithm plays a central role in our approach, leveraging the segment cost data structure to partition the computation graph into segments that optimize throughput while minimizing latency. It dynamically partitions the graph into a specified number of blocks, ensuring equitable workload distribution and minimizing inter-node communication.

The initialization time complexity of the segment cost data structure is O(n2)O(n^2) for a graph with nn nodes, enabling the pre-computation of costs associated with all possible segments. This setup is crucial for the efficient execution of the SliceGraph algorithm. The overall time complexity of the SliceGraph algorithm, before applying the convex hull trick, is O(n2k+mlog2(n))O(n^2k + m\log^2(n)) where kk is the number of blocks into which the graph is partitioned, and mm is the number of edges in the graph. This complexity takes into account the dynamic programming process and the cost queries facilitated by the segment cost data structure. By applying the convex hull trick, we optimize the dynamic programming aspect of the SliceGraph algorithm, reducing its time complexity to O(nklogn)O(nk\log n). This optimization significantly enhances the efficiency of partitioning, allowing for quicker determination of the optimal graph partitioning strategy.

Biased Random-Key Genetic Algorithm

The integration of the Biased Random-Key Genetic Algorithm (BRKGA) into our approach addresses the inherent limitations of the above heuristic-based methods by introducing a stochastic and evolutionary mechanism. This allows for an expansive exploration of partition configurations, crucial for solving the NP-hard Max-Throughput Partitioning Problem (MTPP) efficiently. BRKGA's capability to evolve partitioning strategies over generations enables the discovery of high-quality solutions that balance the computational load and minimize communication overhead, which deterministic methods alone may not identify, ensuring optimal throughput in distributed deep neural network inference.

BRKGA improves this process by evolving vectors of node priorities (x[0,1]n\mathbf{x} \in [0,1]^n), with each vector's partitioning quality undergoing evaluation. This evaluation creates a direct link between node priorities and partitioning quality, allowing for precise adjustments to optimize partition outcomes. It initiates with generating a starting population of random priority vectors, which are then utilized to establish topological orders, feeding into the SliceGraph algorithm. The partitioning quality of each vector, judged on throughput optimization and latency reduction, determines its fitness. Through iterative processes of selection, crossover (new=αparent1+(1α)parent2\mathit{new} = \alpha \mathit{parent}_1 + (1 - \alpha) \mathit{parent}_2), and mutation, BRKGA refines the population towards more efficient partitioning strategies. This evolution continues until convergence is achieved, which indicates the discovery of an optimal or near-optimal set of node priorities for our partitioning requirements.

The BRKGA approach to partitioning computation graphs for neural network inference offers several compelling advantages over the sequential sharding methods discussed earlier. Firstly, the sequential approach, while systematic and based on clear heuristic rules, may not adequately explore the vast space of possible shard configurations. It inherently follows a predetermined path through this space, which might overlook more efficient configurations that could significantly enhance computational throughput and minimize inter-node communication latency. In contrast, BRKGA introduces a level of stochastic exploration combined with evolutionary strategies, which allows for a broader and more dynamic search. This adaptability is particularly crucial in distributed environments where network topology and node capabilities vary widely.

Last updated