Blockchain-based Sequential Neural Sharding (BSNS)
This page introduces the theoretical and practical foundations of Nesa's model sharding protocol (BSNS), including its recursive swarm assignment formulation, network-aware partitioning strategy, and dynamic rebalancing algorithm. It explains how large models are decomposed and efficiently distributed across heterogeneous infrastructure.
Model Partitioning and Dynamic Sharding
Modern foundation models are massive—often exceeding the memory and compute limits of any single device. To enable distributed inference over commodity nodes, Nesa introduces a scalable, network-aware method called Blockchain-based Sequential Neural Sharding (BSNS).

Rather than statically slicing layers or replicating the full model, BSNS partitions the model graph into sharded blocks and routes them across a dynamic swarm of nodes. Each node executes only its assigned segment, and intermediate activations are passed securely downstream.
Sequential Sharding over Transformer Blocks
BSNS begins with a sequential decomposition of the model—usually aligned with transformer blocks. For example, a LLaMA-style model with 80 layers is broken into contiguous block segments. These blocks are assigned to nodes in a specific execution path, ensuring correct ordering and throughput.
Given a list of blocks {b1, b2, ..., bL}
and a swarm of nodes {n1, n2, ..., nk}
, the BSNS orchestrator builds a mapping:
(b1 → n_i1), (b2 → n_i2), ..., (bL → n_iL)
Each assignment considers node memory, available compute, and connection latency. Critically, the selection of the current node depends not only on the block’s cost, but also on where the previous shards were placed—this creates a recursive assignment path that reflects network topology.
Dynamic Rebalancing
Network conditions can change quickly. To avoid bottlenecks, BSNS includes a dynamic rebalancing function. This periodically evaluates the current assignment of blocks to nodes and re-routes shards to better candidates based on latency, hardware capacity, and recent throughput.
Let S
be the current sequence of shard-to-node assignments and Θ
represent live network parameters. BSNS computes a new optimized path:
S' = Rebalance(S, Θ)
This allows Nesa to maintain stable performance even if nodes drop, delay, or degrade during inference.
Partitioning Arbitrary Neural Graphs
While BSNS works naturally with transformer stacks, it also supports arbitrary computation graphs (e.g., multimodal, convolutional, or hybrid models). A model is represented as a directed acyclic graph:
V: the operations (e.g., attention layers, matrix multiplications)
E: the data flow edges between them
Each operation v
has attributes:
C(v)
: compute timeM(v)
: memory footprintO(v)
: output size
The goal is to divide the graph into k
blocks {G1, G2, ..., Gk}
such that:
Each block can execute independently on one node
The resulting inter-block graph is still acyclic
Inter-node communication is minimized
The cost of sending output from block Gi to Gj is the sum of the output sizes of all operations in Gi that are consumed by operations in Gj.
Memory Overflow and Execution Cost
Each block may exceed the fast memory available (e.g., SRAM). If the block’s parameter size P
exceeds a node’s fast memory limit F
, overflow cost is incurred by streaming from slower memory.
The overflow penalty is proportional to:
max(0, P - F) × τ
where Ï„
is the streaming latency multiplier.
The total cost of a block includes:
Receiving input tensors from upstream
Local execution cost (including overflow)
Sending outputs downstream
Max-Throughput Partitioning Problem (MTPP)
The partitioning problem becomes:
Find a partition of the graph into k blocks that minimizes the maximum block cost
This is an NP-hard problem. BSNS approximates the optimal solution using a combination of greedy heuristics, persistent homology metrics, and node scoring.

Swarm Reconfiguration in Practice
At runtime, the BSNS controller continuously monitors:
Inference time per block
Queue depth
Node health and hardware usage
Communication latency
If imbalances appear, the system:
Replaces slow nodes with idle ones
Reassigns heavy blocks to more capable hardware
Selects a new sharding plan from a library of precomputed templates
This ensures graceful adaptation without reinitializing the entire inference session
Empirical Performance
While BSNS is designed for scalable model execution, its practicality also depends on the performance of distributed, compressed, and multi-modal models. Below are empirical results from several experimental settings validating Nesa's infrastructure.
1. Compression Robustness on Language Tasks
LLaMA 8B
16
0.76 ± 0.01
0.75 ± 0.03
0.63 ± 0.02
0.64 ± 0.04
0.40 ± 0.03
8
0.76 ± 0.01
0.74 ± 0.03
0.62 ± 0.03
0.63 ± 0.04
0.39 ± 0.02
Mixtral 7x8B
16
0.78 ± 0.01
0.76 ± 0.03
0.65 ± 0.02
0.66 ± 0.03
0.42 ± 0.02
8
0.77 ± 0.01
0.75 ± 0.02
0.64 ± 0.03
0.65 ± 0.03
0.41 ± 0.03
Lexi 7B
16
0.75 ± 0.02
0.74 ± 0.02
0.62 ± 0.03
0.63 ± 0.04
0.39 ± 0.02
8
0.74 ± 0.02
0.73 ± 0.03
0.61 ± 0.04
0.62 ± 0.03
0.38 ± 0.02
Table: Accuracy impact from 8-bit quantization is minimal across tasks and models, validating BSNS for efficient distributed inference.
2. Latency, Bandwidth, and Token Throughput
LLaMA-3 8B
<5 ms
1 Gbit/s
1
1.20
1.10
8
6.5
32
1.15
1.08
28
26.4
64
1.10
1.05
56
52.5
<10 ms
100 Mbit/s
1
0.85
0.80
6
5
32
0.80
0.75
22
20
64
0.75
0.70
44
40
Mixtral 7x8B
<5 ms
1 Gbit/s
1
1.30
1.25
6
5.4
32
1.25
1.20
24
22.8
64
1.20
1.15
49
45.5
<10 ms
100 Mbit/s
1
0.90
0.85
5
4.5
32
0.85
0.80
20
18.5
64
0.80
0.75
40
37
Table: BSNS supports high-throughput generation across network settings by balancing shard placement and caching.
3. Multi-Modal Model Performance (Text-to-Image)
Stable Diffusion
v1.4
0.68
0.86
0.68
0.68
0.85
v1.5
0.54
0.73
0.21
0.50
0.81
v2 base
0.51
0.85
0.20
0.39
0.88
Anime-Style
kivotos-xl-2.0
0.77
0.87
0.91
0.72
0.81
holodayo-xl-2.1
0.79
0.89
0.94
0.74
0.83
Debiasing
mobius
0.82
0.71
0.86
0.77
0.87
Table: BSNS accommodates diverse model types. Each model is containerized and tunable for privacy, quality, or application needs.
Summary
BSNS provides a principled method to decompose and distribute large models across dynamic, heterogeneous environments. Its key innovations include:
Recursively informed shard placement based on topology and cost
Overflow-aware memory planning for fast-access SRAM
Graph partitioning for arbitrary architectures
Real-time rebalancing of shard placement under changing network conditions
This framework enables Nesa to run large-scale inference on decentralized compute networks securely, efficiently, and scalably.
Last updated