> For the complete documentation index, see [llms.txt](https://docs.nesa.ai/nesa/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.nesa.ai/nesa/major-innovations/decentralized-execution-for-ai/background-and-exploratory-notes/blockchain-based-sequential-neural-sharding-bsns.md). # Blockchain-based Sequential Neural Sharding (BSNS) > ⚠️ **Note on Implementation Status** > > This page includes descriptions of mechanisms and workflows that are **under active development or iterative refinement**. Some components may not yet be fully deployed in the current production system, and are presented to illustrate design intent and potential system directions. This page introduces the theoretical and practical foundations of Nesa's model sharding protocol (BSNS), including its recursive swarm assignment formulation, network-aware partitioning strategy, and dynamic rebalancing algorithm. It explains how large models are decomposed and efficiently distributed across heterogeneous infrastructure. *** ## Model Partitioning and Dynamic Sharding Modern foundation models are massive—often exceeding the memory and compute limits of any single device. To enable distributed inference over commodity nodes, Nesa introduces a scalable, network-aware method called **Blockchain-based Sequential Neural Sharding (BSNS)**.

Overview of BSNS model sharding and inference coordination.
**(A)** Available nodes are selected to form an orchestrated swarm. An orchestrator node is elected, followed by a committee of participants to execute shards.
**(B)** The orchestrator partitions the computation graph across the selected nodes, assigning model blocks or operator groups to each.
**(C)** During inference, each node executes its assigned shard and forwards intermediate activations, while the orchestrator ensures data flow and maintains consistency across the swarm.

Rather than statically slicing layers or replicating the full model, BSNS partitions the model graph into sharded blocks and routes them across a dynamic swarm of nodes. Each node executes only its assigned segment, and intermediate activations are passed securely downstream. *** ### Sequential Sharding over Transformer Blocks BSNS begins with a sequential decomposition of the model—usually aligned with transformer blocks. For example, a LLaMA-style model with 80 layers is broken into contiguous block segments. These blocks are assigned to nodes in a specific execution path, ensuring correct ordering and throughput. Given a list of blocks `{b1, b2, ..., bL}` and a swarm of nodes `{n1, n2, ..., nk}`, the BSNS orchestrator builds a mapping: > (b1 → n\_i1), (b2 → n\_i2), ..., (bL → n\_iL) Each assignment considers node memory, available compute, and connection latency. Critically, the selection of the current node depends not only on the block’s cost, but also on where the previous shards were placed—this creates a recursive assignment path that reflects network topology. *** ### Dynamic Rebalancing Network conditions can change quickly. To avoid bottlenecks, BSNS includes a **dynamic rebalancing function**. This periodically evaluates the current assignment of blocks to nodes and re-routes shards to better candidates based on latency, hardware capacity, and recent throughput. Let `S` be the current sequence of shard-to-node assignments and `Θ` represent live network parameters. BSNS computes a new optimized path: > S' = Rebalance(S, Θ) This allows Nesa to maintain stable performance even if nodes drop, delay, or degrade during inference. *** ### Partitioning Arbitrary Neural Graphs While BSNS works naturally with transformer stacks, it also supports arbitrary computation graphs (e.g., multimodal, convolutional, or hybrid models). A model is represented as a directed acyclic graph: * **V**: the operations (e.g., attention layers, matrix multiplications) * **E**: the data flow edges between them Each operation `v` has attributes: * `C(v)`: compute time * `M(v)`: memory footprint * `O(v)`: output size The goal is to divide the graph into `k` blocks `{G1, G2, ..., Gk}` such that: * Each block can execute independently on one node * The resulting inter-block graph is still acyclic * Inter-node communication is minimized The cost of sending output from block Gi to Gj is the sum of the output sizes of all operations in Gi that are consumed by operations in Gj. *** ### Memory Overflow and Execution Cost Each block may exceed the fast memory available (e.g., SRAM). If the block’s parameter size `P` exceeds a node’s fast memory limit `F`, overflow cost is incurred by streaming from slower memory. The **overflow penalty** is proportional to: > max(0, P - F) × τ where `τ` is the streaming latency multiplier. The **total cost of a block** includes: * Receiving input tensors from upstream * Local execution cost (including overflow) * Sending outputs downstream *** ### Max-Throughput Partitioning Problem (MTPP) The partitioning problem becomes: > Find a partition of the graph into k blocks that **minimizes the maximum block cost** This is an NP-hard problem. BSNS approximates the optimal solution using a combination of greedy heuristics, persistent homology metrics, and node scoring.

Genetic algorithm-based node selection in BSNS. The orchestrator uses multi-objective optimization to evaluate candidates on distance, bandwidth, compute availability, and reliability. Offspring swarm candidates are generated and scored via crossover and mutation. The objective is to select the optimal swarm sequence for sharding a large model while minimizing cost and latency.

*** ### Swarm Reconfiguration in Practice At runtime, the BSNS controller continuously monitors: * Inference time per block * Queue depth * Node health and hardware usage * Communication latency If imbalances appear, the system: * Replaces slow nodes with idle ones * Reassigns heavy blocks to more capable hardware * Selects a new sharding plan from a library of precomputed templates This ensures graceful adaptation without reinitializing the entire inference session *** ### Empirical Performance While BSNS is designed for scalable model execution, its practicality also depends on the performance of distributed, compressed, and multi-modal models. Below are empirical results from several experimental settings validating Nesa's infrastructure. *** #### 1. Compression Robustness on Language Tasks | Model | Bits | HellaSwag | Lambada OpenAI | Causal Judgment | Disambiguation QA | Logical Deduction | | ------------ | ---- | ----------- | -------------- | --------------- | ----------------- | ----------------- | | LLaMA 8B | 16 | 0.76 ± 0.01 | 0.75 ± 0.03 | 0.63 ± 0.02 | 0.64 ± 0.04 | 0.40 ± 0.03 | | | 8 | 0.76 ± 0.01 | 0.74 ± 0.03 | 0.62 ± 0.03 | 0.63 ± 0.04 | 0.39 ± 0.02 | | Mixtral 7x8B | 16 | 0.78 ± 0.01 | 0.76 ± 0.03 | 0.65 ± 0.02 | 0.66 ± 0.03 | 0.42 ± 0.02 | | | 8 | 0.77 ± 0.01 | 0.75 ± 0.02 | 0.64 ± 0.03 | 0.65 ± 0.03 | 0.41 ± 0.03 | | Lexi 7B | 16 | 0.75 ± 0.02 | 0.74 ± 0.02 | 0.62 ± 0.03 | 0.63 ± 0.04 | 0.39 ± 0.02 | | | 8 | 0.74 ± 0.02 | 0.73 ± 0.03 | 0.61 ± 0.04 | 0.62 ± 0.03 | 0.38 ± 0.02 | > **Table:** Accuracy impact from 8-bit quantization is minimal across tasks and models, validating BSNS for efficient distributed inference. *** #### 2. Latency, Bandwidth, and Token Throughput | Model | RTT | Bandwidth | Batch Size | Gen. Steps/s (64) | Gen. Steps/s (1024) | Tokens/s (64) | Tokens/s (1024) | | ------------ | ------ | ---------- | ---------- | ----------------- | ------------------- | ------------- | --------------- | | LLaMA-3 8B | <5 ms | 1 Gbit/s | 1 | 1.20 | 1.10 | 8 | 6.5 | | | | | 32 | 1.15 | 1.08 | 28 | 26.4 | | | | | 64 | 1.10 | 1.05 | 56 | 52.5 | | | <10 ms | 100 Mbit/s | 1 | 0.85 | 0.80 | 6 | 5 | | | | | 32 | 0.80 | 0.75 | 22 | 20 | | | | | 64 | 0.75 | 0.70 | 44 | 40 | | Mixtral 7x8B | <5 ms | 1 Gbit/s | 1 | 1.30 | 1.25 | 6 | 5.4 | | | | | 32 | 1.25 | 1.20 | 24 | 22.8 | | | | | 64 | 1.20 | 1.15 | 49 | 45.5 | | | <10 ms | 100 Mbit/s | 1 | 0.90 | 0.85 | 5 | 4.5 | | | | | 32 | 0.85 | 0.80 | 20 | 18.5 | | | | | 64 | 0.80 | 0.75 | 40 | 37 | > **Table:** BSNS supports high-throughput generation across network settings by balancing shard placement and caching. *** #### 3. Multi-Modal Model Performance (Text-to-Image) | Category | Model | Fairness | Quality | Creativity | Knowledge | Performance | | ---------------- | --------------- | -------- | ------- | ---------- | --------- | ----------- | | Stable Diffusion | v1.4 | 0.68 | 0.86 | 0.68 | 0.68 | 0.85 | | | v1.5 | 0.54 | 0.73 | 0.21 | 0.50 | 0.81 | | | v2 base | 0.51 | 0.85 | 0.20 | 0.39 | 0.88 | | Anime-Style | kivotos-xl-2.0 | 0.77 | 0.87 | 0.91 | 0.72 | 0.81 | | | holodayo-xl-2.1 | 0.79 | 0.89 | 0.94 | 0.74 | 0.83 | | Debiasing | mobius | 0.82 | 0.71 | 0.86 | 0.77 | 0.87 | > **Table:** BSNS accommodates diverse model types. Each model is containerized and tunable for privacy, quality, or application needs. *** ### Summary BSNS provides a principled method to decompose and distribute large models across dynamic, heterogeneous environments. Its key innovations include: * Recursively informed shard placement based on topology and cost * Overflow-aware memory planning for fast-access SRAM * Graph partitioning for arbitrary architectures * Real-time rebalancing of shard placement under changing network conditions This framework enables Nesa to run large-scale inference on decentralized compute networks securely, efficiently, and scalably.