Skip to main content
Vorel’s per-tenant guardrails are the safety differentiator. Most AI receptionist platforms ship a single prompt for every customer. Vorel lets your operator tune the safety policy per tenant, strict for clinics + regulated verticals, looser for low-stakes tenants, without a code deploy. Two guardrails ship today, each with its own threshold + action knobs.

What ships today

Hallucination grader

A deterministic grader checks every agent reply for unsupported factual claims (price, hours, availability, contact details) against the conversation’s tool results, your offerings, and your working hours. No LLM on the live path; an async LLM pass backstops what the deterministic rules miss. Configurable threshold and action.

Forbidden-phrase guardrail

Substring-match every agent reply against the merged forbidden-phrase list (vertical pack defaults + tenant-specific additions). Configurable action when a hit fires.

Say-guard

Strips reasoning leaks, tool-call narration, and deliberation preamble out of the spoken/sent reply before the customer ever sees it. The customer hears the answer, never the agent’s internal thinking.

Inline circuit-breaker

On high-stakes turns (booking confirmation, clinic red-flags) a sentence-level verifier can short-circuit a reply that claims an action without the tool actually having succeeded, for example claiming an appointment was booked when no booking tool call committed.

Hallucination guardrail

Every reply gets graded by the hallucination grader, which flags issues by severity (medium / high) and kind: unsupported_price, unsupported_hours, unsupported_availability, unsupported_contact, and llm_flagged (the async LLM backstop’s catch). The flags land in messages.hallucination_flags for analytics. The grader is deterministic: it extracts factual claims from the reply with precision-tuned pattern matching, then verifies each against grounding evidence (recent tool results, your active offerings, your configured working hours, and prior caller turns). A price within a small tolerance of a grounded value is supported; a price the caller themselves introduced is treated as an acknowledgment, not a fabrication. This runs synchronously on the live reply with no LLM call, so it adds no provider latency or dependency. An async LLM pass runs afterward to catch what the deterministic rules miss, appending llm_flagged entries. The guardrail’s job is to decide what to do when flags fire.

Threshold (which severities trip the guardrail)

ThresholdTrips on
lowAny flagged reply (medium or high)
mediumMedium or high flags
high (default)Only high flags
neverNever trips. The grader still records flags for the dashboard, but no runtime action.

Action (what to do when it trips)

ActionBehaviour
warn (default)Log only. An internal alert fires for high-severity flags either way.
handoffDrop the bot reply. Route the conversation to a human via the existing handoff queue.
Default for new tenants: threshold='high' + action='warn'. This matches today’s pre-Phase-O behaviour: no behaviour change for tenants who haven’t tuned the policy yet.
  • Clinics, regulated verticals, financial services: threshold='medium' + action='handoff'. A medium-confidence hallucination should never reach the customer. Cost: more handoffs; benefit: no AI-generated misinformation in a regulated context.
  • High-volume retail, hospitality: threshold='high' + action='warn'. The default; high flags get escalated via an internal alert, but mid / low flags ride through. Cost: occasional surface of imperfect replies; benefit: no extra handoff load.
  • Pilot / staging tenants: threshold='low' + action='warn'. Maximally noisy: surfaces every flag in the analytics dashboard so you can calibrate the threshold based on real data before promoting the tenant to production policy.

Forbidden-phrase guardrail

If the agent’s reply contains any phrase from the merged forbidden-phrase list, this guardrail’s action kicks in. The phrase list comes from two sources, concatenated + de-duplicated:
  1. Vertical pack defaults: every pack ships its own list (e.g. clinic ships diagnose, you have, definitely, it's nothing serious).
  2. Tenant-specific additions: your operator adds phrases via the dashboard pack-overrides UI.
Detection is substring match, case-insensitive, on the trimmed phrase. So diagnose matches “I diagnose…”, “let me diagnose…”, “I can’t diagnose…”. This is intentional: at runtime, near-misses are mostly the agent trying to talk around a forbidden term; exact-word matching would let too much through.

Action

ActionBehaviour
warn (default)Log only. The prompt already tells the model not to use these phrases; the guardrail logs the slip without overriding.
blockReplace the reply with a generic fallback string ("Let me get a colleague to help with that. I'll connect you now." / "دعني أحول طلبك لزميل من الفريق ليساعدك."). The dispatch logs the override.
handoffDrop the bot reply, route to a human.
Default for new tenants: action='warn'. Same matched-today’s-behaviour rationale as the hallucination guardrail.

Why three actions instead of two

block and handoff differ in customer experience: block keeps the bot in the conversation (the customer reads the fallback string and can keep talking); handoff drops the bot and routes to a human. Use block when you want to give the bot a graceful exit; use handoff when a forbidden-phrase hit means a human absolutely must take the conversation from here.

Where the guardrail policy lives

Per-tenant guardrails live in tenants.guardrails (JSONB column). The schema:
{
  "hallucination": {
    "threshold": "high",
    "action": "warn"
  },
  "forbidden_phrase": {
    "action": "warn"
  }
}
The parser is tolerant: bad / missing fields fall through to defaults. A stale operator save or a malformed value never breaks dispatch; the agent runs on defaults until the policy is fixed.

Operator UI

Configure per-tenant from app.vorel.ai/admin/tenants/[id]/guardrails. The form writes the JSONB column directly; changes take effect on the next agent turn (no code deploy, no service restart). The audit log records every change with the actor, the previous value, and the new value, so “who turned off the hallucination guardrail and when?” is always answerable.

Pack-level forbidden phrases (read-only floor)

The vertical pack’s forbidden phrases are a floor, not an override target. You can add to the list per-tenant; you cannot remove pack-shipped phrases via the standard pack-override UI. This protects against a clinic operator accidentally turning off the diagnose block. To remove a pack-level phrase requires a code-side change to the vertical pack JSON (and an explicit comment justifying the removal). Don’t.

Hallucination scoring details

The primary grader is deterministic and runs synchronously on the live reply path, with the conversation’s tool results, your active offerings, and your working hours as grounding evidence. It extracts factual claims with precision-tuned pattern matching and verifies each one. A second, asynchronous LLM pass runs after the reply lands and catches claims the deterministic rules miss, merging its findings back in as llm_flagged entries. Flag kinds:
  • unsupported_price: the agent quoted a price not grounded in a tool result or your offerings (within a small tolerance). A price the caller introduced first is not flagged.
  • unsupported_hours: the agent stated opening/closing hours that fall outside your configured working hours.
  • unsupported_availability: the agent asserted a specific available time that no tool result grounds.
  • unsupported_contact: the agent produced an email, phone number, or reference code with no grounding (reference codes are high severity, since a caller acts on a fake confirmation).
  • llm_flagged: the async LLM backstop caught something the deterministic rules did not.
These flags also feed the Analytics weekly-rollup so you can track hallucination rate over time per tenant.

Confidence calibration

Beyond flagging fabrications, Vorel tracks whether the agent’s self-reported confidence matches its actual truthfulness. Post-turn, the platform extracts the agent’s hedges and deferrals to infer how confident it sounded, then scores calibration (an expected-calibration-error metric) against grader truthfulness on the operator quality surfaces. A well-calibrated agent hedges when it should and asserts when it’s right; this is a distinct quality axis from raw hallucination rate.

What’s NOT a guardrail today

Things you might expect that aren’t on this surface:
  • Profanity filter. The forbidden-phrase guardrail handles tenant-specific terms; we don’t ship a generic profanity list. Add brand-restricted vocabulary via pack overrides.
  • PII redaction in agent replies. The agent doesn’t have access to other customers’ data via RLS, so there’s nothing to redact at the reply layer. PII redaction happens at the data-export + audit-log layer instead.
  • Topic restriction. “The agent must only talk about real estate, not weather” is enforced via the faq_redirect_message_* strings in vertical packs, not as a separate guardrail.
  • Verticals: pack-level forbidden phrases (clinic is the load-bearing example)
  • Analytics: hallucination flag rates over time
  • Security overview: broader safety posture, RLS, PII handling
  • How it works: where guardrails sit in the dispatch pipeline