How do I use the evals and guardrails on my agents?

To use evals and guardrails for your agents, define evaluation criteria that match your goals. Set guardrails to ensure safe and effective agent behavior. Regularly assess performance with evals to enhance your agents' responses.

How is that different from the evals I already have?

To use evals and guardrails for your agents, define evaluation criteria that match your goals. Set guardrails to ensure safe and effective agent behavior. Regularly assess performance with evals to enhance your agents' responses.

Can you do this on prem?

To use evals and guardrails for your agents, define evaluation criteria that match your goals. Set guardrails to ensure safe and effective agent behavior. Regularly assess performance with evals to enhance your agents' responses.

What makes your SLMs so accurate and cost-effective?

To use evals and guardrails for your agents, define evaluation criteria that match your goals. Set guardrails to ensure safe and effective agent behavior. Regularly assess performance with evals to enhance your agents' responses.

Do you only have SLMs or other models as well?

To use evals and guardrails for your agents, define evaluation criteria that match your goals. Set guardrails to ensure safe and effective agent behavior. Regularly assess performance with evals to enhance your agents' responses.

Evals & Guardrails for AI Agents

Select use case

Agent response

Conversation evaluation

Grounding

Reference evaluation

User response classification

Sentiment Analysis

Semantic similarity

Policy compliance classification

Toxicity

User intent detection

Tool invocation validation

Agent action validation

Agent response

Evaluate whether the agent response complies with policy, avoids sensitive data exposure, follows brand tone, and meets required attributes.

Input

Output

Classification on agent response (for example: helpful, not helpful / PII Detected, PII not Detected) & reasoning

Conversation evaluation

Evaluate multi-turn conversations for compliance, consistency, sentiment, and successful task resolution.

Input

Output

Conversation evaluation (for example: resolved, unresolved / compliant, violation / consistent, inconsistent / positive sentiment, frustrated sentiment) and reasoning.

Grounding

Evaluate whether the agent response is fully supported by the provided context and does not introduce unsupported claims or contradictions. A response is grounded if all factual statements are supported by the context.

Input

Context:
<retrieved context or source material>

Agent response:
<agent response>

Output

Grounding classification (for example: grounded / partially grounded / not grounded)

Reference evaluation

Evaluate whether the agent response matches the expected reference answer in correctness and completeness. A response is correct if it aligns with the reference meaning and key facts. Partially correct if it captures some but not all required information or introduces minor inaccuracies. Incorrect if it contradicts or significantly deviates from the reference.

Input

Reference answer:
<golden answer>

Agent response:
<agent response>

Output

Reference evaluation classification (for example: correct / partially correct / incorrect)

User response classification

Classify the user message according to its primary intent to enable correct routing or handling. Select the label that best represents the user’s goal or request.

Input

User message:
<user response>

Output

User intent classification
(for example: billing / technical_support / sales / account_management / general_query / other)

Sentiment Analysis

Determine the emotional tone expressed in the user message. Identify whether the sentiment is positive, neutral, or negative, and capture stronger signals such as frustration or satisfaction when present.

Input

User message:
<user response or multi turn conversation>

Output

Sentiment classification
(for example: positive / neutral / negative / frustrated / satisfied)

Semantic similarity

Evaluate the semantic similarity between two text samples based on meaning rather than exact wording.
Determine whether the two inputs express the same intent, a related idea, or different meanings.

Input

Sample A:
<text A>

Sample B:
<text B>

Output

Semantic similarity classification
(for example: same_meaning / related / different)

Policy compliance classification

Evaluate whether the agent response complies with safety, legal, and company policies. A response is a violation if it enables harmful, unsafe, illegal, or policy-restricted behavior, exposes sensitive data, or fails to appropriately refuse disallowed requests.

Input

Agent response:
<agent response>

Output

Policy compliance classification
(for example: compliant / violation)

Toxicity

Evaluate whether the message contains toxic, abusive, or harmful language. Toxic content includes insults, harassment, hate speech, threats, or condescending and demeaning tone. Non-toxic content may express frustration or criticism without abusive or harmful language.

Input

Message:
<agent or user text>

Output

Toxicity classification (for example: toxic / safe)

User intent detection

Identify the primary intent expressed in the user message. Determine what the user is trying to accomplish, such as asking a question, requesting an action, reporting an issue, or providing feedback.

Input

User message:
<user message>

Output

User intent classification
(for example: question / request_action / complaint / feedback / information_request / other)

Tool invocation validation

Evaluate whether the tool selected or invoked by the agent is appropriate for the user’s request and context. A correct tool invocation matches the user’s intent and is necessary to fulfill the task. An incorrect invocation uses the wrong tool, or a tool that does not address the request. An unnecessary invocation uses a tool when no tool was required.

Input

User request:
<user message>

Invoked tool:
<tool name or action>

Output

Tool invocation classification
(for example: correct_tool / wrong_tool / unnecessary_tool)

Agent action validation

Evaluate whether the agent’s action correctly follows the user’s request and system constraints. A correct action fulfills the user’s intent and is appropriate for the context. An incorrect action does not match the request or performs the wrong operation. An unnecessary action performs a step or operation that was not required to fulfill the user’s request.

Input

User request:
<user message>

Agent action:
<action taken by agent>

Output

Agent action classification
(for example: correct_action / incorrect_action / unnecessary_action)

Can't find what you're looking for?

FAQ

You can use our models across a wide range of semantic tasks, including conversation evaluation, semantic similarity, grounding validation, policy compliance, and more. Explore our use case catalog to see what’s possible

Plurai uses a proprietary intent calibration process to deeply understand your task and generate a high-quality testing set and consistent evaluator. This enables production-grade evals and guardrails powered by optimized small language models (SLMs), which are far more cost-efficient and scalable than traditional LLM-as-judge approaches that are expensive and difficult to run at full production coverage.

Sure, Plurai can be deployed in your VPC for maximum security, data control, and even lower latency. Contact us to discuss your infrastructure and deployment requirements.

Plurai’s SLMs are purpose-built for your specific tasks through our intent calibration and synthetic data generation process. We don’t require prior labeled data. If you don’t have historical datasets, we generate high-fidelity synthetic data tailored to your use case.

By training and optimizing evaluators on highly targeted datasets instead of relying on general-purpose LLMs, we achieve high accuracy with far lower latency and cost. The result is production-grade coverage you can run continuously without the expense of traditional LLM-as-judge approaches.

In addition to purpose-built SLMs, we also offer optimized LLM-based evaluators for maximum accuracy at competitive cost. These are ideal for sampled data and offline evaluation workflows.

For large-scale testing or real-time guardrails, SLMs are typically the better choice due to their lower latency and cost efficiency.

Improve agent quality and prevent realtime glitches in up to 15x less the cost

Say goodbye to costly, inconsistent LLM-as-a-judge evaluations

Real-time protection for unbreakable agents with our ultra fast guardrails

How it works

Describe in free language your eval/guardrail task

Review the auto-generated test set, anditerate via AI chat if needed

Your eval or guardrail endpoint  is ready to use!

Beyond evals & guardrails: powering all semantic tasks

Agent response

Conversation evaluation

Grounding

Reference evaluation

User response classification

Sentiment Analysis

Semantic similarity

Policy compliance classification

Toxicity

User intent detection

Tool invocation validation

Agent action validation

FAQ

Improve agent quality and prevent realtime glitches in up to 15x less the cost

Say goodbye to costly, inconsistent LLM-as-a-judge evaluations

Real-time protection for unbreakable agents with our ultra fast guardrails

How it works

Describe in free language your eval/guardrail task

Review the auto-generated test set, anditerate via AI chat if needed

Your eval or guardrail endpoint is ready to use!

Beyond evals & guardrails: powering all semantic tasks

Agent response

Conversation evaluation

Grounding

Reference evaluation

User response classification

Sentiment Analysis

Semantic similarity

Policy compliance classification

Toxicity

User intent detection

Tool invocation validation

Agent action validation

FAQ

Your eval or guardrail endpoint  is ready to use!