Introducing BARRED

Elad Levi

Arnon Mazza

Elad Levi

Arnon Mazza

March 3, 2026

Introducing BARRED: turn any policy prompt into a high-accuracy efficient guardrail

Deploying guardrails for custom policies remains a persistent challenge. Generic safety models fail to capture task-specific requirements. Prompting LLMs is unreliable on edge cases and becomes expensive at scale. Fine-tuning custom classifiers, while the most accurate approach, requires labeled data that is costly to obtain

Table of contents

Motivation

Method Overview

Stage 1: Dimension Decomposition

Stage 2: Boundary Case Generation

Stage 3: Asymmetric Debate Validation

Stage 4: Iterative Refinement

Experimental Setup

Results

Conclusion

‍

We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework that eliminates the labeled data bottleneck entirely. From a policy description and a small set of unlabeled examples, BARRED generates synthetic training data of sufficient quality that fine-tuned small models consistently outperform frontier LLMs and dedicated guardrail systems.

BARRED is available as part of the Plurai platform. Full details in the paper.

‍

Motivation

The safety requirements of real-world LLM deployments rarely align with the predefined harm categories that existing guardrail models support. A financial services chatbot needs to detect unauthorized investment advice, not generic toxicity. A customer support agent needs to enforce escalation rules specific to their company's policy, not broad content moderation.

Existing approaches present a fundamental trade-off:

*BARRED achieves high accuracy with full policy adaptability and low annotation effort.*

‍

Static guardrail models achieve strong accuracy on predefined harm categories but cannot generalize to novel policies without retraining. Dynamic guardrails offer flexibility by conditioning on arbitrary policies at runtime, but require larger models and sacrifice accuracy. Prompting frontier LLMs is flexible but expensive, slow, and inconsistent, particularly on the boundary cases that matter most in production.

BARRED takes a different path: given a policy description and ~10–30 unlabeled seed examples, it generates a synthetic training dataset of 1,000 verified samples. Fine-tuning any small language model on this data yields a compact, task-specific classifier that combines the accuracy of custom fine-tuning with the low annotation cost of prompting.

‍

Method Overview

The framework addresses two fundamental challenges in purely synthetic guardrail training: diversity (covering the full variance of the task domain) and faithfulness (ensuring generated labels are correct). Synthetic datasets frequently suffer from mode collapse, and LLM-generated labels contain significant noise. BARRED tackles both through a four-stage pipeline.

‍

Stage 1: Dimension Decomposition

Given the task description and seed examples, BARRED identifies task-relevant dimensions that collectively span the domain — for instance, violation type, communication style, severity level, and user intent.

For each dimension, we apply Verbalized Sampling to elicit a diverse set of possible instantiations. This technique prompts the model to generate distributions rather than single outputs, enabling systematic exploration beyond typical modes. Sampling from these instantiation sets avoids the mode collapse that plagues naive LLM generation.

Effect of dimension instantiations on test coverage and model accuracy. Both metrics improve with additional instantiations, indicating that dimension decomposition increases diversity in the generated training data.

‍

Stage 2: Boundary Case Generation

For each sample, BARRED uniformly draws a dimension, an instantiation, and a target label, then prompts the generator to produce a boundary-challenging case, an example near the decision boundary where classification is most difficult. Each sample includes a reasoning trace justifying the label assignment.

The focus on boundary cases is deliberate: these are the examples where generic models fail and where fine-tuned classifiers gain their advantage. Trivially easy examples add little training signal.

‍

Stage 3: Asymmetric Debate Validation

Raw LLM-generated labels are noisy. Our ablation shows a 27% accuracy drop when validation is skipped entirely.

BARRED uses an asymmetric multi-agent debate to validate each sample:

An Advocate receives the generated sample and acts as a rigid proponent: It consistently argues for the proposed label, never changing its position.
A panel of Judges independently evaluates the sample and the Advocate’s arguments, updating their assessments over two debate rounds.
A sample is accepted when the Judges reach consensus on the target label. If they disagree or reach consensus on a different label, the sample is rejected.

This asymmetric design stress-tests sample quality: if the Advocate cannot convince independent Judges given the reasoning, the sample likely contains inconsistencies.

Analysis of the debate dynamics reveals three distinct interaction patterns beyond simple agreement: (1) disagreement, where judges maintain opposing views across rounds; (2) persuasion, where initial conflict is resolved through deliberation; and (3) consensus breaking, where initial agreement is challenged after reviewing the Advocate’s reasoning.

*Example of debate dynamics on the health advice task. Judge-1 initially disagrees but revises its prediction in round 2 after considering the other agents’ arguments.*

‍

Stage 4: Iterative Refinement

Rejected samples are not discarded. Each dissenting Judge provides structured feedback explaining their objections. This feedback is aggregated and passed back to the generator, which produces a refined sample targeting the same dimension, instantiation, and label. The refined sample re-enters validation. This continues until the sample passes or a maximum iteration count is reached.

This closed-loop approach is notably effective at salvaging borderline cases; the generator receives specific, actionable feedback rather than starting from scratch.

‍

Experimental Setup

We evaluate BARRED on four guardrail tasks spanning three domains:

Conversational policy enforcement: Repetition handling and privacy protection rules in customer service dialogues, adapted from the DynaGuard benchmark.
Agentic output verification: Classifying whether an LLM-generated research plan adheres to task instructions, built on the GAIA benchmark.
Regulatory compliance: Detecting health advice in text, adapted from the Health Advice benchmark.

As part of this work, we curate and release a benchmark dataset with human-verified annotations across all four tasks, available on HuggingFace.

For each task, we generate 1,000 synthetic training samples using GPT-5-mini with medium reasoning effort for all generative components. We fine-tune GPT-4.1-nano through Azure and Qwen2.5 models (1.5B–14B) with LoRA. Baselines include LLM-as-a-Judge (GPT-4.1 family, GPT-5-mini, Qwen2.5-14B) and generic guardrail models (OSS-Safeguard-20B, Glider).

‍

Results

Fine-tuned models consistently outperform all baselines across tasks. A fine-tuned GPT-4.1-nano, the smallest model in the GPT-4.1 family, achieves 96% accuracy on the Repetition task’s human test set, compared to 90% for GPT-4.1, 94% for GPT-5-mini, and 89% for OSS-Safeguard-20B. Our fine-tuned Qwen2.5-14B surpasses all frontier LLMs despite having significantly fewer parameters.

Even our fine-tuned 3B model outperforms both Glider (3.8B) and OSS-Safeguard (20B) as well as most frontier LLMs across all benchmarks, highlighting the limitations of both general-purpose guardrails and large-scale prompting compared to task-specific synthetic training.

‍

Conclusion

BARRED provides a path from a policy description to a production-ready guardrail classifier without requiring labeled data. The combination of dimension decomposition for diversity and asymmetric debate for label faithfulness produces synthetic training data of sufficient quality to outperform models with orders of magnitude more parameters.

The framework generalizes beyond safety applications to any classification task where labeled data is scarce but task specifications are available. While data generation requires multiple LLM calls, this cost is amortized over the resulting compact model with much lower inference latency and cost. Future directions include extending to multi-label and hierarchical classification, exploring transfer of synthetic data across related tasks, and integrating human feedback for iterative improvement.

To train a custom guardrail for your policy, try BARRED through the Plurai platform. Evaluation datasets are available on HuggingFace.