Inference Acceleration with MetaInf

Efficient inference in decentralized AI systems is a fundamental challenge. While many acceleration strategies have been proposed—such as prefix caching, chunked prefill, and continuous batching—no single method consistently performs best across all workloads, models, and hardware profiles. This is especially true in heterogeneous environments where nodes may differ in compute power, memory bandwidth, or latency constraints.

MetaInf is our proposed solution: a meta-learning-based scheduler that automatically selects the best inference optimization method based on real-time system and workload signals.


Why Traditional Acceleration Falls Short

Experimental benchmarks show that inference acceleration methods perform variably depending on batch size and hardware. For example:

  • Prefix caching excels in large-batch regimes.

  • Continuous batching works better in latency-sensitive settings.

  • Combining all techniques does not always yield the best outcome due to overhead.

Performance comparison of acceleration strategies across batch sizes on LLaMA 3.1 8B. Combining strategies excels at small batch sizes; prefix caching performs best at scale.

Introducing MetaInf

MetaInf is a two-stage meta-learning framework:

  1. Offline Meta-Training Learns a predictor that maps embeddings of model architecture, hardware configuration, and input data to expected inference performance.

  2. Online Scheduling Given a new task, hardware, and cost budget, MetaInf selects the method with the highest predicted performance while staying within the budget.

MetaInf architecture. Top: training a predictor using model, hardware, and dataset embeddings. Bottom: selecting optimal method at inference time.

Key Capabilities

  • Zero-shot generalization: No need to re-run all inference methods—MetaInf predicts performance based on embeddings.

  • Budget-aware selection: Accounts for runtime × hardware cost.

  • Strong empirical performance: On evaluation across four LLMs and multiple GPU setups, MetaInf outperforms both heuristic and traditional ML baselines.

MetaInf achieves the highest selection accuracy and lowest inference cost across all tested methods.


Results Summary

Method
Accuracy
F1 Score
Acceleration Ratio

ISAC

0.578

0.60

1.10

ALORS

0.725

0.71

1.20

Random Forest

0.742

0.69

1.25

Gradient Boosting

0.815

0.78

1.30

MetaInf (Ours)

0.898

0.85

1.55

MetaInf outperforms all baselines in accuracy and speedup.


Prompt Engineering & Embedding Study

MetaInf leverages LLM-based semantic embeddings to encode dataset and system features. An ablation study compares:

  • One-hot encoding

  • Basic descriptors (e.g., “Model: LLaMA-7B”)

  • Rich structured templates

  • Chain-of-thought (CoT) prompts

Results confirm the value of rich prompts and SVD-based embedding reduction for effective meta-scheduling.

MetaInf performs consistently across different embedding schemes; rich prompts and CoT yield the best results.

Conclusion

MetaInf shows that smart scheduling—not brute-force trial and error—can unlock fast inference in complex decentralized settings. By integrating real-time inference metadata and learned representations, MetaInf provides a scalable, adaptive foundation for deploying large models across dynamic infrastructure.

Last updated