Inside the vLLM Semantic Router
In the world of production-grade LLM applications, simply forwarding a prompt to a single, powerful model like GPT-4 is often inefficient and expensive. Different tasks require different capabilities. A complex math problem needs a model trained for logical reasoning, while a request for a poem is better suited for a creative model. This is where a Mixture-of-Models (MoM) architecture, powered by a semantic router, becomes essential.
This post is a deep dive into the internal workings of our vLLM Semantic Router, a high-performance system designed to intelligently analyze and route LLM requests to the most appropriate backend model. We’ll explore its architecture, the hybrid Go/Rust design, the machine learning core, and how it all comes together to create a robust, scalable, and cost-effective solution.
High-Level System Architecture
At its core, the Semantic Router is designed as an Envoy External Processor (ExtProc). This allows it to operate as a separate, highly specialized service that intercepts, analyzes, and modifies requests and responses flowing through the Envoy proxy, which acts as the main entry point for all traffic.
The architecture is split into four distinct layers:
1. Client Layer: Any application that needs to interact with an LLM.
2. Proxy Layer: An Envoy proxy that manages traffic, load balancing, and health checks.
3. Processing Layer: The heart of our system, the Semantic Router service, which makes intelligent decisions.
4. Model Layer: A collection of specialized LLM backends (e.g., for math, coding, creative writing).
A Hybrid Go & Rust Architecture
The main ExtProc service is written in Go, perfect for building a gRPC server that can handle thousands of concurrent requests.
However, the core machine learning computations are handled by a library written in Rust, using the Candle ML framework. Rust offers unparalleled performance and memory safety, which is critical for running neural networks efficiently without the overhead of a garbage collector.
This hybrid approach gives us the best of both worlds. The bridge between them is a component called `candle-binding`.
The candle-binding Bridge
We use Go’s cgo tool to call Rust functions directly from our Go application.
The Rust library exposes a C-style Foreign Function Interface (FFI), which cgo can interact with.
When the Go router needs to calculate a text embedding, it calls a function in the candle-binding package. This function handles the data type conversion, calls the high-performance Rust code, and returns the result to Go.
The Request Processing Pipeline
Here’s a breakdown of the key steps:
1. PII Check: The prompt is first scanned for Personally Identifiable Information (PII). Based on the configuration, the router can redact the information, mask it, or block the request entirely.
2. Semantic Cache Check: The router calculates the prompt’s embedding. It then compares this vector to the embeddings of previously seen prompts stored in the cache. If the cosine similarity between prompt_vector and a cached_vector is above a certain threshold (e.g., `similarity > 0.9`), we consider it a cache hit and can return the stored response immediately, saving significant time and cost.
3. Classification: If it’s a cache miss, the prompt is classified into a predefined category like “math,” “creative writing,” or “code generation.”
4. Routing: Based on the category, PII status, and other factors, the router selects the most appropriate backend model and instructs Envoy (via a response header) to forward the request.
The Decision Engine: Multi-Task ML Models
The intelligence of the router comes from a suite of fine-tuned machine learning models. A key innovation is our dual-classifier, which uses a multi-head architecture on a shared DistilBERT backbone. This allows us to perform multiple tasks, like category classification and PII detection, with a single model inference, reducing memory footprint and computational cost.
The training pipeline, located in the src/training directory, includes scripts for:
Fine-tuning classifiers on custom datasets.
Fine-tuning PII detection models.
Fine-tuning “Prompt Guard” models to detect jailbreak and prompt injection attempts.
This multi-task approach ensures that our router is not only efficient but also secure.
Why synthetic data?
The dual_classifier POC uses synthetic data. This is a common strategy to bootstrap the development of a model when a large, labeled dataset is not yet available. It allows the developers to validate the architecture and training pipeline before investing in expensive data collection and labeling.
Training Scripts
The training scripts are organized by task, with separate scripts for:
- Category classification (classifier_model_fine_tuning)
- PII detection (pii_model_fine_tuning)
- Jailbreak detection (prompt_guard_fine_tuning)
- Multitask learning (multitask_bert_fine_tuning)
The scripts generally follow a standard ML training pipeline:
- Load Data: Load datasets from Hugging Face Hub or local files.
- Preprocess: Tokenize and prepare the data.
- Train: Fine-tune a pre-trained BERT-based model using the Hugging Face Trainer.
- Evaluate: Evaluate the model on a test set.
- Save: Save the trained model to be used by the candle-binding component.
Other Components
- deploy/: Contains Kubernetes manifests for deploying the router, showing that the project is designed for cloud-native environments.
- docker/: Provides a docker-compose.yml for easy local setup and development.
- e2e-tests/: A suite of end-to-end tests that validate the entire system, demonstrating a commitment to quality and reliability.
- bench/: Benchmarking scripts to measure the performance of the router, indicating a focus on performance optimization.
- config/: Centralized YAML configuration files, which make it easy to manage the behavior of the router without changing the code.
Performance and Resilience
A router is only useful if it doesn’t add significant latency. Thanks to the Go/Rust architecture, the total overhead of the semantic router is typically low. This is a small price to pay for the massive cost savings and performance gains from routing requests to smaller, specialized models.
Conclusion
The vLLM Semantic Router is more than just a simple proxy; it’s an intelligent control plane for your LLM infrastructure. By combining the strengths of Go for networking and Rust for high-performance machine learning, we’ve built a system that is efficient, scalable, and production-ready. It enables a sophisticated Mixture-of-Models architecture that significantly reduces operational costs while improving the quality and speed of responses.
