Blockchain-based Sequential Neural Sharding (BSNS)

This page introduces the theoretical and practical foundations of Nesa's model sharding protocol (BSNS), including its recursive swarm assignment formulation, network-aware partitioning strategy, and dynamic rebalancing algorithm. It explains how large models are decomposed and efficiently distributed across heterogeneous infrastructure.

Model Partitioning and Dynamic Sharding

Modern foundation models are massive—often exceeding the memory and compute limits of any single device. To enable distributed inference over commodity nodes, Nesa introduces a scalable, network-aware method called Blockchain-based Sequential Neural Sharding (BSNS).

Rather than statically slicing layers or replicating the full model, BSNS partitions the model graph into sharded blocks and routes them across a dynamic swarm of nodes. Each node executes only its assigned segment, and intermediate activations are passed securely downstream.

Sequential Sharding over Transformer Blocks

BSNS begins with a sequential decomposition of the model—usually aligned with transformer blocks. For example, a LLaMA-style model with 80 layers is broken into contiguous block segments. These blocks are assigned to nodes in a specific execution path, ensuring correct ordering and throughput.

Given a list of blocks {b1, b2, ..., bL} and a swarm of nodes {n1, n2, ..., nk}, the BSNS orchestrator builds a mapping:

(b1 → n_i1), (b2 → n_i2), ..., (bL → n_iL)

Each assignment considers node memory, available compute, and connection latency. Critically, the selection of the current node depends not only on the block’s cost, but also on where the previous shards were placed—this creates a recursive assignment path that reflects network topology.

Dynamic Rebalancing

Network conditions can change quickly. To avoid bottlenecks, BSNS includes a dynamic rebalancing function. This periodically evaluates the current assignment of blocks to nodes and re-routes shards to better candidates based on latency, hardware capacity, and recent throughput.

Let S be the current sequence of shard-to-node assignments and Θ represent live network parameters. BSNS computes a new optimized path:

S' = Rebalance(S, Θ)

This allows Nesa to maintain stable performance even if nodes drop, delay, or degrade during inference.

Partitioning Arbitrary Neural Graphs

While BSNS works naturally with transformer stacks, it also supports arbitrary computation graphs (e.g., multimodal, convolutional, or hybrid models). A model is represented as a directed acyclic graph:

V: the operations (e.g., attention layers, matrix multiplications)
E: the data flow edges between them

Each operation v has attributes:

C(v): compute time
M(v): memory footprint
O(v): output size

The goal is to divide the graph into k blocks {G1, G2, ..., Gk} such that:

Each block can execute independently on one node
The resulting inter-block graph is still acyclic
Inter-node communication is minimized

The cost of sending output from block Gi to Gj is the sum of the output sizes of all operations in Gi that are consumed by operations in Gj.

Memory Overflow and Execution Cost

Each block may exceed the fast memory available (e.g., SRAM). If the block’s parameter size P exceeds a node’s fast memory limit F, overflow cost is incurred by streaming from slower memory.

The overflow penalty is proportional to:

max(0, P - F) × τ

where τ is the streaming latency multiplier.

The total cost of a block includes:

Receiving input tensors from upstream
Local execution cost (including overflow)
Sending outputs downstream

Max-Throughput Partitioning Problem (MTPP)

The partitioning problem becomes:

Find a partition of the graph into k blocks that minimizes the maximum block cost

This is an NP-hard problem. BSNS approximates the optimal solution using a combination of greedy heuristics, persistent homology metrics, and node scoring.

Swarm Reconfiguration in Practice

At runtime, the BSNS controller continuously monitors:

Inference time per block
Queue depth
Node health and hardware usage
Communication latency

If imbalances appear, the system:

Replaces slow nodes with idle ones
Reassigns heavy blocks to more capable hardware
Selects a new sharding plan from a library of precomputed templates

This ensures graceful adaptation without reinitializing the entire inference session

Empirical Performance

While BSNS is designed for scalable model execution, its practicality also depends on the performance of distributed, compressed, and multi-modal models. Below are empirical results from several experimental settings validating Nesa's infrastructure.

1. Compression Robustness on Language Tasks

Model

Bits

HellaSwag

Lambada OpenAI

Causal Judgment

Disambiguation QA

Logical Deduction

LLaMA 8B

0.76 ± 0.01

0.75 ± 0.03

0.63 ± 0.02

0.64 ± 0.04

0.40 ± 0.03

0.76 ± 0.01

0.74 ± 0.03

0.62 ± 0.03

0.63 ± 0.04

0.39 ± 0.02

Mixtral 7x8B

0.78 ± 0.01

0.76 ± 0.03

0.65 ± 0.02

0.66 ± 0.03

0.42 ± 0.02

0.77 ± 0.01

0.75 ± 0.02

0.64 ± 0.03

0.65 ± 0.03

0.41 ± 0.03

Lexi 7B

0.75 ± 0.02

0.74 ± 0.02

0.62 ± 0.03

0.63 ± 0.04

0.39 ± 0.02

0.74 ± 0.02

0.73 ± 0.03

0.61 ± 0.04

0.62 ± 0.03

0.38 ± 0.02

Table: Accuracy impact from 8-bit quantization is minimal across tasks and models, validating BSNS for efficient distributed inference.

2. Latency, Bandwidth, and Token Throughput

Model

RTT

Bandwidth

Batch Size

Gen. Steps/s (64)

Gen. Steps/s (1024)

Tokens/s (64)

Tokens/s (1024)

LLaMA-3 8B

<5 ms

1 Gbit/s

1.20

1.10

6.5

1.15

1.08

26.4

1.10

1.05

52.5

<10 ms

100 Mbit/s

0.85

0.80

0.75

0.70

Mixtral 7x8B

<5 ms

1 Gbit/s

1.30

1.25

5.4

1.25

1.20

22.8

1.20

1.15

45.5

<10 ms

100 Mbit/s

0.90

0.85

4.5

0.85

0.80

18.5

0.80

0.75

Table: BSNS supports high-throughput generation across network settings by balancing shard placement and caching.

Summary

BSNS provides a principled method to decompose and distribute large models across dynamic, heterogeneous environments. Its key innovations include:

Recursively informed shard placement based on topology and cost
Overflow-aware memory planning for fast-access SRAM
Graph partitioning for arbitrary architectures
Real-time rebalancing of shard placement under changing network conditions

This framework enables Nesa to run large-scale inference on decentralized compute networks securely, efficiently, and scalably.

PreviousWhy Decentralized AI?NextInference Acceleration with MetaInf

Last updated 4 months ago

Model Partitioning and Dynamic Sharding

Sequential Sharding over Transformer Blocks

Dynamic Rebalancing

Partitioning Arbitrary Neural Graphs

Memory Overflow and Execution Cost

Max-Throughput Partitioning Problem (MTPP)

Swarm Reconfiguration in Practice

Empirical Performance

1. Compression Robustness on Language Tasks

2. Latency, Bandwidth, and Token Throughput

3. Multi-Modal Model Performance (Text-to-Image)

Summary