Blockchain-based Sequential Neural Sharding (BSNS)

This page introduces the theoretical and practical foundations of Nesa's model sharding protocol (BSNS), including its recursive swarm assignment formulation, network-aware partitioning strategy, and dynamic rebalancing algorithm. It explains how large models are decomposed and efficiently distributed across heterogeneous infrastructure.


Model Partitioning and Dynamic Sharding

Modern foundation models are massive—often exceeding the memory and compute limits of any single device. To enable distributed inference over commodity nodes, Nesa introduces a scalable, network-aware method called Blockchain-based Sequential Neural Sharding (BSNS).

Overview of BSNS model sharding and inference coordination. (A) Available nodes are selected to form an orchestrated swarm. An orchestrator node is elected, followed by a committee of participants to execute shards. (B) The orchestrator partitions the computation graph across the selected nodes, assigning model blocks or operator groups to each. (C) During inference, each node executes its assigned shard and forwards intermediate activations, while the orchestrator ensures data flow and maintains consistency across the swarm.

Rather than statically slicing layers or replicating the full model, BSNS partitions the model graph into sharded blocks and routes them across a dynamic swarm of nodes. Each node executes only its assigned segment, and intermediate activations are passed securely downstream.


Sequential Sharding over Transformer Blocks

BSNS begins with a sequential decomposition of the model—usually aligned with transformer blocks. For example, a LLaMA-style model with 80 layers is broken into contiguous block segments. These blocks are assigned to nodes in a specific execution path, ensuring correct ordering and throughput.

Given a list of blocks {b1, b2, ..., bL} and a swarm of nodes {n1, n2, ..., nk}, the BSNS orchestrator builds a mapping:

(b1 → n_i1), (b2 → n_i2), ..., (bL → n_iL)

Each assignment considers node memory, available compute, and connection latency. Critically, the selection of the current node depends not only on the block’s cost, but also on where the previous shards were placed—this creates a recursive assignment path that reflects network topology.


Dynamic Rebalancing

Network conditions can change quickly. To avoid bottlenecks, BSNS includes a dynamic rebalancing function. This periodically evaluates the current assignment of blocks to nodes and re-routes shards to better candidates based on latency, hardware capacity, and recent throughput.

Let S be the current sequence of shard-to-node assignments and Θ represent live network parameters. BSNS computes a new optimized path:

S' = Rebalance(S, Θ)

This allows Nesa to maintain stable performance even if nodes drop, delay, or degrade during inference.


Partitioning Arbitrary Neural Graphs

While BSNS works naturally with transformer stacks, it also supports arbitrary computation graphs (e.g., multimodal, convolutional, or hybrid models). A model is represented as a directed acyclic graph:

  • V: the operations (e.g., attention layers, matrix multiplications)

  • E: the data flow edges between them

Each operation v has attributes:

  • C(v): compute time

  • M(v): memory footprint

  • O(v): output size

The goal is to divide the graph into k blocks {G1, G2, ..., Gk} such that:

  • Each block can execute independently on one node

  • The resulting inter-block graph is still acyclic

  • Inter-node communication is minimized

The cost of sending output from block Gi to Gj is the sum of the output sizes of all operations in Gi that are consumed by operations in Gj.


Memory Overflow and Execution Cost

Each block may exceed the fast memory available (e.g., SRAM). If the block’s parameter size P exceeds a node’s fast memory limit F, overflow cost is incurred by streaming from slower memory.

The overflow penalty is proportional to:

max(0, P - F) × τ

where Ï„ is the streaming latency multiplier.

The total cost of a block includes:

  • Receiving input tensors from upstream

  • Local execution cost (including overflow)

  • Sending outputs downstream


Max-Throughput Partitioning Problem (MTPP)

The partitioning problem becomes:

Find a partition of the graph into k blocks that minimizes the maximum block cost

This is an NP-hard problem. BSNS approximates the optimal solution using a combination of greedy heuristics, persistent homology metrics, and node scoring.

Genetic algorithm-based node selection in BSNS. The orchestrator uses multi-objective optimization to evaluate candidates on distance, bandwidth, compute availability, and reliability. Offspring swarm candidates are generated and scored via crossover and mutation. The objective is to select the optimal swarm sequence for sharding a large model while minimizing cost and latency.

Swarm Reconfiguration in Practice

At runtime, the BSNS controller continuously monitors:

  • Inference time per block

  • Queue depth

  • Node health and hardware usage

  • Communication latency

If imbalances appear, the system:

  • Replaces slow nodes with idle ones

  • Reassigns heavy blocks to more capable hardware

  • Selects a new sharding plan from a library of precomputed templates

This ensures graceful adaptation without reinitializing the entire inference session


Empirical Performance

While BSNS is designed for scalable model execution, its practicality also depends on the performance of distributed, compressed, and multi-modal models. Below are empirical results from several experimental settings validating Nesa's infrastructure.


1. Compression Robustness on Language Tasks

Model
Bits
HellaSwag
Lambada OpenAI
Causal Judgment
Disambiguation QA
Logical Deduction

LLaMA 8B

16

0.76 ± 0.01

0.75 ± 0.03

0.63 ± 0.02

0.64 ± 0.04

0.40 ± 0.03

8

0.76 ± 0.01

0.74 ± 0.03

0.62 ± 0.03

0.63 ± 0.04

0.39 ± 0.02

Mixtral 7x8B

16

0.78 ± 0.01

0.76 ± 0.03

0.65 ± 0.02

0.66 ± 0.03

0.42 ± 0.02

8

0.77 ± 0.01

0.75 ± 0.02

0.64 ± 0.03

0.65 ± 0.03

0.41 ± 0.03

Lexi 7B

16

0.75 ± 0.02

0.74 ± 0.02

0.62 ± 0.03

0.63 ± 0.04

0.39 ± 0.02

8

0.74 ± 0.02

0.73 ± 0.03

0.61 ± 0.04

0.62 ± 0.03

0.38 ± 0.02

Table: Accuracy impact from 8-bit quantization is minimal across tasks and models, validating BSNS for efficient distributed inference.


2. Latency, Bandwidth, and Token Throughput

Model
RTT
Bandwidth
Batch Size
Gen. Steps/s (64)
Gen. Steps/s (1024)
Tokens/s (64)
Tokens/s (1024)

LLaMA-3 8B

<5 ms

1 Gbit/s

1

1.20

1.10

8

6.5

32

1.15

1.08

28

26.4

64

1.10

1.05

56

52.5

<10 ms

100 Mbit/s

1

0.85

0.80

6

5

32

0.80

0.75

22

20

64

0.75

0.70

44

40

Mixtral 7x8B

<5 ms

1 Gbit/s

1

1.30

1.25

6

5.4

32

1.25

1.20

24

22.8

64

1.20

1.15

49

45.5

<10 ms

100 Mbit/s

1

0.90

0.85

5

4.5

32

0.85

0.80

20

18.5

64

0.80

0.75

40

37

Table: BSNS supports high-throughput generation across network settings by balancing shard placement and caching.


3. Multi-Modal Model Performance (Text-to-Image)

Category
Model
Fairness
Quality
Creativity
Knowledge
Performance

Stable Diffusion

v1.4

0.68

0.86

0.68

0.68

0.85

v1.5

0.54

0.73

0.21

0.50

0.81

v2 base

0.51

0.85

0.20

0.39

0.88

Anime-Style

kivotos-xl-2.0

0.77

0.87

0.91

0.72

0.81

holodayo-xl-2.1

0.79

0.89

0.94

0.74

0.83

Debiasing

mobius

0.82

0.71

0.86

0.77

0.87

Table: BSNS accommodates diverse model types. Each model is containerized and tunable for privacy, quality, or application needs.


Summary

BSNS provides a principled method to decompose and distribute large models across dynamic, heterogeneous environments. Its key innovations include:

  • Recursively informed shard placement based on topology and cost

  • Overflow-aware memory planning for fast-access SRAM

  • Graph partitioning for arbitrary architectures

  • Real-time rebalancing of shard placement under changing network conditions

This framework enables Nesa to run large-scale inference on decentralized compute networks securely, efficiently, and scalably.

Last updated