Inference Acceleration with MetaInf
Efficient inference in decentralized AI systems is a fundamental challenge. While many acceleration strategies have been proposed—such as prefix caching, chunked prefill, and continuous batching—no single method consistently performs best across all workloads, models, and hardware profiles. This is especially true in heterogeneous environments where nodes may differ in compute power, memory bandwidth, or latency constraints.
MetaInf is our proposed solution: a meta-learning-based scheduler that automatically selects the best inference optimization method based on real-time system and workload signals.
Why Traditional Acceleration Falls Short
Experimental benchmarks show that inference acceleration methods perform variably depending on batch size and hardware. For example:
Prefix caching excels in large-batch regimes.
Continuous batching works better in latency-sensitive settings.
Combining all techniques does not always yield the best outcome due to overhead.

Introducing MetaInf
MetaInf is a two-stage meta-learning framework:
Offline Meta-Training Learns a predictor that maps embeddings of model architecture, hardware configuration, and input data to expected inference performance.
Online Scheduling Given a new task, hardware, and cost budget, MetaInf selects the method with the highest predicted performance while staying within the budget.

Key Capabilities
Zero-shot generalization: No need to re-run all inference methods—MetaInf predicts performance based on embeddings.
Budget-aware selection: Accounts for runtime × hardware cost.
Strong empirical performance: On evaluation across four LLMs and multiple GPU setups, MetaInf outperforms both heuristic and traditional ML baselines.
MetaInf achieves the highest selection accuracy and lowest inference cost across all tested methods.
Results Summary
ISAC
0.578
0.60
1.10
ALORS
0.725
0.71
1.20
Random Forest
0.742
0.69
1.25
Gradient Boosting
0.815
0.78
1.30
MetaInf (Ours)
0.898
0.85
1.55
MetaInf outperforms all baselines in accuracy and speedup.
Prompt Engineering & Embedding Study
MetaInf leverages LLM-based semantic embeddings to encode dataset and system features. An ablation study compares:
One-hot encoding
Basic descriptors (e.g., “Model: LLaMA-7B”)
Rich structured templates
Chain-of-thought (CoT) prompts
Results confirm the value of rich prompts and SVD-based embedding reduction for effective meta-scheduling.

Conclusion
MetaInf shows that smart scheduling—not brute-force trial and error—can unlock fast inference in complex decentralized settings. By integrating real-time inference metadata and learned representations, MetaInf provides a scalable, adaptive foundation for deploying large models across dynamic infrastructure.
Last updated