# Consensus-based Distribution Verification (CDV)

Last updated

Last updated

In addition to hardware-level TEE-based solutions for model verification, we will discuss our algorithmic approach to ensure model integrity. In decentralized inference systems, verifying that each node accurately executes the intended model is crucial for maintaining the integrity and reliability of the system. This verification process ensures consistency in inference results across the network, safeguards against malicious modifications of the model, and ensures adherence to privacy protocols and regulatory compliance. Moreover, it optimizes the use of computational resources, preventing wastage on incorrect or unauthorized computations, and helps manage operational costs effectively.

Given the computational and scalability challenges associated with Zero-knowledge proofs for verifying the integrity of LLMs in decentralized systems, Nesa proposes a consensus-based distribution verification (CDV) strategy. This strategy leverages the collective agreement of multiple nodes to ensure the correctness and integrity of model execution without revealing sensitive data. To provide a clear picture, we will present the idea by iterating over it step by step.

**Consensus-based Verification**: Consider a decentralized network with $N$ nodes, where each node $i$ executes the same inference model $\mathcal{M}$ with parameters $\theta$, on a given input $x$. The output of the model on node $i$ is denoted by $y_i = \mathcal{M}(x; \theta_i)$. The goal is to ensure that all nodes accurately execute the model $\mathcal{M}$, yielding consistent outputs.

The process can be formalized in the following steps:

**Redundant Execution**: A subset of the network nodes, $\{1, 2, \ldots, k\} \subseteq N$, independently computes the output $y_i$ for the same input $x$.$y_i = \mathcal{M}(x; \theta), \quad \forall i \in \{1, 2, \ldots, k\}$**Output Collection**: The outputs $\{y_1, y_2, \ldots, y_k\}$ are collected for consensus evaluation. This collection phase requires secure and efficient communication protocols to protect the integrity of the transmitted data.**Consensus Determination**: Utilizing a consensus algorithm $\mathcal{C}$, the system evaluates the collected outputs to determine the agreed-upon result $y_{\text{con}}$. The consensus result is considered valid if it satisfies a predefined criterion, such as majority agreement or a more sophisticated decision rule based on the specific properties of the outputs.$y_{\text{con}} = \mathcal{C}(\{y_1, y_2, \ldots, y_k\})$**Verification and Finalization**: If the consensus result $y_{\text{con}}$ aligns with the outputs from a sufficiently large subset of nodes, the model's execution is verified. Otherwise, discrepancies indicate potential integrity issues, triggering further investigation or corrective measures.

This consensus-based approach not only facilitates the verification of model integrity across decentralized nodes but also introduces a robust mechanism to detect and mitigate the impact of faulty or malicious nodes. By leveraging mathematical rigor and algorithmic precision, Consensus-based Verification offers a viable solution to ensuring the integrity and correctness of decentralized LLM inference, complementing hardware-based protections and filling the gaps left by the impracticality of ZKPs for LLMs within Nesa's innovative ecosystem.

**Verification in the Context of Sharding**:

This step introduces computational redundancy, where multiple independent computations of the same shard aim to fortify the verification process by cross-verifying results among nodes computing the same shard.

**Selective Random Verification**: To optimize the CDV process with model sharding, Nesa employs a strategic, probabilistic method for selecting nodes for verification, termed selective Random verification (SRV). Instead of exhaustively verifying the outputs from all sharded model parts across the network, SRV focuses on a randomly chosen subset of nodes. This significantly reduces the computational overhead and network traffic involved in the verification process, making it more scalable and efficient, particularly suitable for large-scale deployments. The SRV process can be formalized as follows:

**Consensus-based Distribution Verification**: Building upon traditional consensus mechanisms, the CDV strategy introduces an advanced layer of verification by assessing the statistical distribution of model outputs across a decentralized network. This approach is ideally suited for scenarios where the model is not monolithic but is instead distributed as shards across multiple nodes.

CDV is based on the understanding that while individual outputs from model shards might exhibit slight variability due to the stochastic nature of ML models and the complexity of input data, the collective output distribution should maintain consistency. This consistency holds true provided that the model and its inputs remain unchanged. By evaluating the aggregated statistical characteristics of these outputs, CDV furnishes a sophisticated and robust framework for affirming the uniformity and integrity of the model's behavior, thereby enhancing SP without direct comparison of individual inference results.

**Detailed Implementation of CDV**: Implementing CDV within Nesa's ecosystem involves a multi-faceted approach:

**Rigorous Distribution Comparison**: Utilizing sophisticated statistical methodologies, the derived metrics are juxtaposed with predefined benchmarks or dynamically established norms. Techniques such as hypothesis testing, divergence measures, or similarity indices evaluate the congruence between the observed and expected output distributions, facilitating an objective assessment of model integrity.**Enhanced Consensus Mechanism with Adaptive Thresholding**: The core of CDV lies in its consensus mechanism, where nodes collectively determine the acceptability of the observed distribution's alignment with benchmarks. Adaptive thresholding plays a crucial role here, dynamically adjusting sensitivity based on historical data and operational context to pinpoint deviations that truly signify integrity breaches.

Through its implementation, CDV offers a powerful solution to the challenges of verifying the integrity of distributed LLMs in Nesa's decentralized framework. By focusing on distributional characteristics rather than discrete output values, CDV not only elevates the verification process but also aligns with the goals of enhancing model security and maintaining stringent privacy standards.

**Taking Model Sharding into Account**: In Nesa's decentralized system, where LLMs may be sharded across multiple nodes for scalability, each node $i$ possesses a unique shard $\mathcal{M}_i$ of the complete model $\mathcal{M}$. This partitioning requires a specialized approach to Consensus-based Verification to accommodate the fragmented nature of model execution.

Consider the complete model $\mathcal{M}$ being divided into $k$ shards, such that $\mathcal{M} = \bigoplus_{i=1}^{k} \mathcal{M}_i$, where $\bigoplus$ denotes the operation of combining the model shards to represent the full model functionality. Given an input $x$, the execution of these shards across $k$ nodes produces a set of partial outputs $\{y_1, y_2, \ldots, y_k\}$, where $y_i = \mathcal{M}_i(x; \theta_i)$.

**Shard Redundant Execution**: For each shard $\mathcal{M}_i$ of the complete model $\mathcal{M}$, redundant execution is performed by a designated subset of nodes. Each of these nodes, within the subset responsible for shard $\mathcal{M}_i$, computes the output $y_{i,j}$ for the given input $x$, where $j$ represents the node within the subset.

$y_{i,j} = \mathcal{M}_i(x; \theta_{i,j}), \quad \forall j \in \text{Subset of nodes for } \mathcal{M}_i$

**Redundant Output Collection and Verification**: The outputs $\{y_{i,1}, y_{i,2}, \ldots, y_{i,m}\}$ for each shard $i$ are collected from the nodes in its subset. A consensus mechanism $\mathcal{C}_i$ specific to shard $i$ then evaluates these collected outputs to determine a shard-specific agreed-upon result $y_{\text{con},i}$.

$y_{\text{con},i} = \mathcal{C}_i(\{y_{i,1}, y_{i,2}, \ldots, y_{i,m}\})$

Here, $m$ denotes the number of nodes executing the shard $\mathcal{M}_i$. The redundancy in computation across these nodes allows for a robust verification mechanism, enhancing the detection of discrepancies or faults.

**Shard Verification Completion**: Upon achieving consensus for a shard $i$, signified by the result $y_{\text{con},i}$, the process ensures the integrity of the shard's computation before proceeding. This step-by-step verification across shards, with redundancy in each shard's computation, significantly reduces the risk of erroneous or malicious model execution.

**Model Reconstruction**: After each shard has been independently verified, the shard-specific consensus results $\{y_{\text{con},1}, y_{\text{con},2}, \ldots, y_{\text{con},k}\}$ are combined to reconstruct the final model output $Y_{\text{final}}$. This comprehensive output can ensure the integrity of the complete model execution.

$Y_{\text{final}} = \bigoplus_{i=1}^{k} y_{\text{con},i}$

At each inference task, a verification subset $V \subset \{1, 2, \ldots, k\}$ is randomly selected, where $k$ is the total number of nodes (or model shards), and $V$ represents the indices of nodes chosen for verification. Note this process can be achieved by our VRF Module (Verifiable Random Function).

Only the outputs $y_i$ from nodes $i \in V$ undergo the verification process:

$y_i = \mathcal{M}_i(x; \theta_i), \quad \forall i \in V$

A consensus mechanism $\mathcal{C}$ evaluates the partial outputs from the selected subset to ascertain the model's integrity:

$y_{\text{con}} = \mathcal{C}(\{y_i | i \in V\})$

If the consensus outcome $y_{\text{con}}$ aligns with expected results, the integrity of the model execution within the sampled subset is confirmed. Inconsistent results trigger a more extensive investigation, potentially leading to a wider verification scope.

**Sharded Execution and Output Synthesis**: In the initial phase, each node, housing a shard $\mathcal{M}_i$ of the overarching model $\mathcal{M}$, executes its segment on a shared input $x$, generating partial outputs $\{y_1, y_2, \ldots, y_k\}$. These outputs are synthesized to construct a comprehensive output profile that reflects the combined inference result of the entire model.

**Advanced Statistical Aggregation**: Following output synthesis, the system embarks on advanced statistical analysis, deriving metrics such as the mean $\mu$, standard deviation $\sigma$, and potentially higher-order moments. This stage may also incorporate non-parametric statistics to capture the full essence of the output distribution, offering a nuanced view of the model's performance landscape.