Copied

Share this post

Guardrails
Elad Levi
Elad Levi
March 11, 2026

Serving hundreds of guardrails in real-time on a single GPU

Every production AI system faces the same challenge: it's not enough to generate good responses, you also need to ensure they're safe and aligned with your organization's and system policies. Every customer interaction with an LLM-powered application may need to be evaluated against dozens or even hundreds of policy rules in real time: Is the agent giving medical advice? Is the response grounded in the provided context or is it hallucinating? Did the user attempt a jailbreak? Is the response leaking PII? Does it comply with industry-specific regulations?

At Plurai, this is our core workload. Our guardrail system evaluates each conversation against a large battery of specialized classifiers, each one a fine-tuned small language model (SLM) trained to detect a specific policy violation. The challenge isn’t building one good classifier. It’s running a hundred of them simultaneously, on the same conversation, at latencies that don’t break the user experience, and doing it on infrastructure that doesn’t break the budget.

This post describes how we achieved 10–32x efficiency improvements, turning what was a scaling bottleneck into a tractable engineering problem.

The KV cache bottleneck


The scaling wall: Why parallel guardrails are expensive

Running guardrails at scale demands models that are both fast and accurate. The common approach is to use small language models (SLMs) fine-tuned for specific tasks: one model to detect PII, another for medical advice, another for jailbreak attempts, and so on. LoRA adapters make this practical, letting you train dozens of lightweight, task-specific classifiers on top of a single base model without duplicating its full weights.

But even with SLMs, the math gets difficult quickly. A customer-agent conversation comes in, say, 2,000 tokens long. Your safety system needs to check it against 50 or 100 policy rules. That’s 50–100 inference calls on the same conversation, each with a different adapter, and this needs to happen at every single turn of the interaction.

Modern serving engines like vLLM offer a powerful optimization for exactly this pattern: prefix caching. When multiple requests share the same input tokens, the key-value (KV) cache for that shared prefix is computed once and reused across requests. Since all requests contain the same conversation, you’d expect to pay the expensive prefill cost only once.

In theory, this should make multi-adapter evaluation nearly free after the first request. In practice, standard LoRA completely defeats prefix caching.

How prefix caching works, and why it matters

In a decoder-only transformer (the standard architecture behind modern LLMs), each token produces key and value vectors that subsequent tokens attend to. The prefill phase, processing all input tokens before generation begins, is the most compute-intensive part of inference. Prefix caching works by hashing the input token blocks and storing their computed KV vectors. When a new request arrives with the same prefix, the engine skips the prefill and reuses the cached KV values.

Figure 1: Prefix caching computes the shared input once and reuses the KV cache across requests. Without it, the same expensive prefill is repeated for every request.

This is extremely effective when many requests share a common input prefix, which is simmilar to our workload pattern: the same conversation, evaluated by many different guardrails.

Breaking the KV cache in guardrails

The LoRA issue

Here’s the problem: LoRA modifies the model’s projection weights from the very first token. The adapter’s low-rank matrices (A and B) are added to the base model’s query, key, and value projections throughout the entire sequence. This means that even if two requests contain the identical conversation, the KV values produced by adapter A and adapter B are different, because the projection weights that generated them are different.

vLLM’s prefix cache is keyed by content hash, and with LoRA, that hash includes the adapter identity. So adapter-1’s cached blocks and adapter-2’s cached blocks are completely disjoint. With 50 adapters evaluating the same conversation, you’re computing 50 independent prefills. The cache is useless.

Figure 2: Each LoRA adapter produces different KV values for the same input tokens because it applies different weight modifications (ΔW). vLLM stores separate cache entries per adapter, turning one prefill into 100 redundant prefills.

At 2,000 tokens and 100 concurrent evaluations, this turns a sub-100ms operation into a 4-second bottleneck.

The prompt structure issue

There’s actually a second, more subtle obstacle. Even if we could somehow share the KV cache across adapters, the standard prompt structure for classification tasks prevents it.

The conventional approach puts the task description in the system prompt:

System: Determine whether the conversation contains
       medical advice.

User:   [conversation]
       ...2000 tokens...


Each guardrail has a different system prompt (different rule description), which means the prefix diverges from the very first token. Even two requests with the same conversation have completely different input sequences because the task instruction comes first.

So we face a double bind:

  1. Different task descriptions at the start of the prompt prevent prefix sharing
  2. Different LoRA adapter weights prevent KV cache reuse even where the tokens match

Our solution: aLoRA + prompt inversion


Activated LoRA: Deferring the adapter

Activated LoRA (aLoRA), proposed by Greenewald et al., offers an elegant solution to the second problem. The core idea: train the adapter to only activate its weights after a designated point in the sequence. All tokens before that activation point are processed using the base model weights exclusively, with no adapter modifications.

During training, the LoRA weight matrices are zeroed out for all positions before the activation token. The model learns to perform its classification task using adapter weights only on the tail end of the input. The mathematical consequence is that pre-activation tokens produce KV values identical to the base model’s, byte-for-byte identical, not approximately close.

This identity property is what unlocks cache sharing. When vLLM hashes the KV blocks for pre-activation tokens, they’re adapter-independent. Every adapter hits the same cache entry for the shared conversation prefix.

Flipping the prompt

Knowing that aLoRA makes prefix tokens adapter-independent solves the KV-cache-per-adapter problem. But we still needed to solve the prompt structure problem: different task descriptions mixed into the shared prefix.

Our solution was to separate the conversation from the task instruction into distinct user turns. The system prompt is generic and identical across all adapters. The first user turn contains only the conversation. The second user turn contains the task-specific instruction, and this is where the aLoRA adapter activates:

System: Analyze the conversation against the given rule.

User:   [conversation]               ← base model only (shared prefix)
       ...2000 tokens...

User:   [task-specific rule]          ← adapter activates HERE


The conversation, which is the long, shared portion, comes first and is processed entirely by the base model. The task-specific instruction, which is short and where each adapter diverges, comes at the end, after the aLoRA activation point.
This means:

  • The conversation prefix (potentially thousands of tokens) is computed once by the base model and cached
  • Each adapter only needs to process the short task instruction suffix (typically dozens of tokens)
  • The expensive work is done once; the per-adapter cost is marginal

We implemented this in vLLM by patching the prefix caching hash function so that pre-activation blocks are shared across all adapters, while post-activation blocks remain adapter-specific as usual.

Benchmark results


To validate the approach under realistic conditions, we benchmarked aLoRA against standard LoRA across two dimensions: conversation length (100–5,000 tokens) and number of concurrent adapter evaluations (1–100). We used Qwen3-4B as the base model with 100 adapter copies on a single GPU.

Since each guardrail only needs to produce a single classification token, the respond time is effectively just the prefill. There’s no meaningful decode phase. All numbers below report this end-to-end request time (equivalent to TTFT with max_tokens=1).

Scaling with conversation length

At 5 concurrent evaluations, the gap is already dramatic. With 5,000-token conversations, standard LoRA averages 506ms per request while aLoRA takes just 51ms, a 10x speedup. The aLoRA line stays nearly flat across conversation lengths because the prefix is cached after the first request; subsequent requests only compute the short adapter-specific suffix.

Figure 5: Mean response time as conversation length increases (m=5 concurrent evaluations). Standard LoRA grows linearly with input length. aLoRA stays nearly flat: the prefix is cached, and each adapter only processes the short suffix.

Scaling with concurrency

Fixing conversation length at 1,000 tokens and scaling concurrency reveals the core advantage. Standard LoRA degrades from 52ms at m=1 to 2,072ms at m=100 as the GPU is saturated computing 100 independent prefills. aLoRA scales from 56ms to just 144ms at m=100, a 14x speedup.

Figure 6: Mean response time as concurrent evaluations increase (1,000-token conversations). Standard LoRA degrades rapidly as the GPU computes independent prefills for each adapter. aLoRA stays nearly flat: all adapters share the same cached prefix.

At the extreme (5,000 tokens, 100 concurrent evaluations), the numbers tell the full story:

At m=1, both approaches are equivalent (no prefix to share). The speedup scales multiplicatively with both conversation length and adapter count, exactly the regime production guardrail systems operate in.

Quality tradeoff

The latency gains come at a small accuracy cost. Because adapter weights don’t influence the conversation prefix during inference, the model has slightly less context-awareness in the shared portion.

It’s important to note that these evaluation sets are deliberately constructed with challenging boundary cases, the kind of ambiguous, edge-case conversations where even strong models struggle. On typical production traffic, the gap would likely be narrower. We evaluated both training approaches across two tasks using identical training data and hyperparameters, and include GPT-4.1-mini (via LLM-as-a-judge prompting) as a reference baseline:

Both LoRA variants significantly outperform the GPT-4.1-mini baseline on these adversarial test sets, despite being small fine-tuned models. The ~2 percentage point gap between standard LoRA and aLoRA is modest in context, and a small price for an order-of-magnitude latency improvement.

What this means for production AI safety


The practical implication is straightforward: comprehensive guardrail coverage is no longer gated by latency or GPU cost.

With aLoRA-based serving, a single GPU can evaluate 100 specialized adapters against a 5,000-token conversation in under 500ms wall time. This changes the design space:

  • Granular policy enforcement: Instead of one monolithic safety classifier, deploy a dedicated adapter per policy rule. Each one can be trained, evaluated, versioned, and updated independently.
  • Real-time inline evaluation: Guardrails can run synchronously before the response reaches the user, not as an afterthought in an async pipeline.
  • Cost-efficient scaling: The compute cost per additional guardrail is marginal once the conversation prefix is cached. Adding the 101st adapter barely moves the needle.

Plurai’s approach


This work reflects how we operate at Plurai: applied research that combines cutting-edge academic ideas with real-world production impact. We develop and adopt novel techniques and ship them as infrastructure our customers rely on.

aLoRA-based serving is the deployment layer that makes large-scale guardrail evaluation viable in production. It complements our BARRED framework, which enables efficiently training high-quality custom guardrails tailored to any policy or domain.

If you’re building LLM-powered applications and looking for an efficient, scalable, and high-quality way to apply guardrails in production, we’d like to talk. This is the exact problem we solve.

Read more

Introducing BARRED
Introducing BARRED: turn any policy prompt into a high-accuracy efficient guardrail
Elad Levi
Arnon Mazza
Mar 3, 2026
Introducing IntellAgent
Introducing IntellAgent: your agent evaluation framework
PlurAi
Elad Levi
Ilan Kadar
Jan 21, 2025
Agent Deployments
Plurai uses NVIDIA nemotron and NIM software to speed time to LLM agents in production
Elad Levi
Amit Bleiweiss
Sep 9, 2025

Subscribe to our newsletter

Stay updated on the latest advancements and open-source releases.
By subscribing, you agree to Plurai’s Terms of Service and confirm that you’ve read our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.