LogoLogo
  • Nesa Docs
    • Introduction to Nesa
    • Overview of the Nesa System
      • AI Models: Repository, Standardization, Uniformity
      • Users: Why Do We Need Private Inference?
      • Node Runners: Doing Inference and Earning $NES
    • Organization of the Documentation
  • Technical Designs
    • Decentralized Inference
      • Overview
      • Model Partitioning and Deep Network Sharding
      • Dynamic Sharding of Arbitrary Neural Networks
      • Cache Optimization to Enhance Efficiency
      • BSNS with Parameter-efficient Fine-tuning via Adapters
      • Enhanced MTPP Slicing of Topological Order
      • Swarm Topology
      • Additional: Free-Riding Prevention
    • Security and Privacy
      • Overview
      • Hardware Side: Trusted Execution Environments (TEEs)
      • Software/algorithm Side: Model Verification
        • Zero-knowledge Machine Learning (ZKML)
        • Consensus-based Distribution Verification (CDV)
      • Software/algorithm Side: Data Encryption
        • Visioning: Homomorphic Encryption
        • Implementation: Split Learning (HE)
      • Additional Info
        • Additional Info: Trusted Execution Environments (TEEs)
        • Additional Info: Software-based Approaches
    • Overview of $NES
      • $NES Utility
    • The First Application on Nesa: DNA X
    • Definitions
    • Additional Information
      • Dynamic Model Versioning and Fork Management
      • Nesa's Utility Suite
      • The AI Kernel Market
      • Privacy Technology
        • Trusted Execution Environment (TEE)
        • Secure Multi-Party Computation (MPC)
        • Verifiable Random Function (VRF)
        • Zero-Knowledge Proof (ZKP)
      • The Integration of Evolutionary AI to Evolve the Nesa Ecosystem
      • Interoperability and Nesa Future Plans
  • Using Nesa
    • Getting Started
      • Wallet Setup
      • Testnet Nesa Faucet
    • Via Web
      • Your Nesa Account
      • Selecting an AI Kernel
      • Submitting a Query
    • Via SDK
    • Via IBC
    • Via NESBridge
      • On Sei
  • Run a Nesa Node
    • Prerequisites
    • Installation
    • Troubleshooting
    • FAQ
  • Links
    • nesa.ai
    • Nesa Discord
    • Nesa Twitter
    • Nesa dApp: dnax.ai
    • Nesa dApp: DNA X Docs
    • Terms of Service
    • Privacy Policy
Powered by GitBook
On this page
  1. Technical Designs
  2. Decentralized Inference

Dynamic Sharding of Arbitrary Neural Networks

PreviousModel Partitioning and Deep Network ShardingNextCache Optimization to Enhance Efficiency

Last updated 1 year ago

While the sequential method provides a structured and heuristic-driven approach to partitioning, it operates under the constraints of a linear, predetermined exploration path. This may not fully capture the dynamism and complexity of modern neural network architectures, where computational and memory demands can vary significantly across different layers and segments of the network. Given an arbitrary neural network, our objective is to partition the network's computation graph for optimal execution across multiple nodes. This computation graph, G=(V,E)G=(V, E)G=(V,E), consists of computational unit operations VVV and data flow edges EEE, with each operation v∈Vv \in Vv∈V outputting a tensor consumed by downstream operations vvv, forming edges (u,v)∈E(u,v) \in E(u,v)∈E. The graph represents the entirety of the model's computational workload which ranges from simple arithmetic operations to layer-specific matrix multiplications, each associated with specific computational and memory requirements i.e. the running time work(v)\text{work}(v)work(v), the memory footprint of the model parameters sizeparam(v)\text{sizeparam}(v)sizeparam(v), and the size of the operation's output sizeout(v)\text{sizeout}(v)sizeout(v).

Partitioning this graph involves dividing VVV into kkk distinct blocks such that each block can be processed on a different node in a swarm under the constraint that the induced quotient graph of GGG remains acyclic. This division aims to maximize throughput while minimizing inter-node communication subjected to the bandwidth BBB between nodes, with the I/O cost from node SSS to node TTT given by:

io(S,T)=1B∑v∈N−(T)∩Ssizeout(v),\text{io}(S,T) = \frac{1}{B} \sum_{v \in N^-(T) \cap S} \text{sizeout}(v),io(S,T)=B1​v∈N−(T)∩S∑​sizeout(v),

where N−(T)N^-(T)N−(T) represents the set of nodes whose outputs are consumed by block TTT.

The core challenge lies in efficiently distributing the model's parameters and activations across the available fast memory (e.g., SRAM) of each node. Parameters not fitting in fast memory must be streamed from slower storage which introduces additional latency. The overflow cost which represents the time to stream parameters exceeding the fast memory limit MMM is calculated as:

overflow(S)=(sizeparam(S)+peak(S)−M)+peak(S)B,\text{overflow}(S) = \left(\text{sizeparam}(S) + \text{peak}(S) - M\right) + \frac{\text{peak}(S)}{B},overflow(S)=(sizeparam(S)+peak(S)−M)+Bpeak(S)​,

where peak(S)\text{peak}(S)peak(S) denotes the peak memory requirement for activations within block SSS.

The overall block cost, f(S)f(S)f(S), combines the costs of receiving input tensors, executing the block's operations (including any overflow cost from streaming parameters), and sending output tensors downstream:

f(S)=io(V∖S,S)+∑v∈Swork(v)+overflow(S)+io(S,V∖S).f(S) = \text{io}(V\setminus S,S) + \sum_{v \in S} \text{work}(v) + \text{overflow}(S) + \text{io}(S,V\setminus S).f(S)=io(V∖S,S)+v∈S∑​work(v)+overflow(S)+io(S,V∖S).

The goal of partitioning, defined by the Max-Throughput Partitioning Problem (MTPP), is to minimize the maximum cost across all blocks, optimizing the throughput of the entire pipeline. Formally, MTPP seeks a partition P∗P^*P∗ that minimizes the bottleneck cost:

P∗=argminP∈Pk(G){max⁡i∈[k]f(Pi)},P^* = \text{argmin}_{P \in P_k(G)} \left\{ \max_{i\in[k]} f(P_i) \right\},P∗=argminP∈Pk​(G)​{i∈[k]max​f(Pi​)},

where Pk(G)P_k(G)Pk​(G) denotes the set of all possible partitions of GGG into kkk blocks, and cost∗\text{cost}^*cost∗ is the minimum achievable bottleneck cost across these partitions.