What ships today
Hallucination grader
A deterministic grader checks every agent reply for unsupported factual claims (price, hours,
availability, contact details) against the conversation’s tool results, your offerings, and your
working hours. No LLM on the live path; an async LLM pass backstops what the deterministic rules
miss. Configurable threshold and action.
Forbidden-phrase guardrail
Substring-match every agent reply against the merged forbidden-phrase list (vertical pack
defaults + tenant-specific additions). Configurable action when a hit fires.
Say-guard
Strips reasoning leaks, tool-call narration, and deliberation preamble out of the spoken/sent
reply before the customer ever sees it. The customer hears the answer, never the agent’s
internal thinking.
Inline circuit-breaker
On high-stakes turns (booking confirmation, clinic red-flags) a sentence-level verifier can
short-circuit a reply that claims an action without the tool actually having succeeded, for
example claiming an appointment was booked when no booking tool call committed.
Hallucination guardrail
Every reply gets graded by the hallucination grader, which flags issues by severity (medium /
high) and kind: unsupported_price, unsupported_hours, unsupported_availability,
unsupported_contact, and llm_flagged (the async LLM backstop’s catch). The flags land in
messages.hallucination_flags for analytics.
The grader is deterministic: it extracts factual claims from the reply with precision-tuned
pattern matching, then verifies each against grounding evidence (recent tool results, your active
offerings, your configured working hours, and prior caller turns). A price within a small tolerance
of a grounded value is supported; a price the caller themselves introduced is treated as an
acknowledgment, not a fabrication. This runs synchronously on the live reply with no LLM call, so it
adds no provider latency or dependency. An async LLM pass runs afterward to catch what the
deterministic rules miss, appending llm_flagged entries.
The guardrail’s job is to decide what to do when flags fire.
Threshold (which severities trip the guardrail)
| Threshold | Trips on |
|---|---|
low | Any flagged reply (medium or high) |
medium | Medium or high flags |
high (default) | Only high flags |
never | Never trips. The grader still records flags for the dashboard, but no runtime action. |
Action (what to do when it trips)
| Action | Behaviour |
|---|---|
warn (default) | Log only. An internal alert fires for high-severity flags either way. |
handoff | Drop the bot reply. Route the conversation to a human via the existing handoff queue. |
threshold='high' + action='warn'. This matches today’s
pre-Phase-O behaviour: no behaviour change for tenants who haven’t tuned the policy yet.
Recommended configurations
- Clinics, regulated verticals, financial services:
threshold='medium'+action='handoff'. A medium-confidence hallucination should never reach the customer. Cost: more handoffs; benefit: no AI-generated misinformation in a regulated context. - High-volume retail, hospitality:
threshold='high'+action='warn'. The default; high flags get escalated via an internal alert, but mid / low flags ride through. Cost: occasional surface of imperfect replies; benefit: no extra handoff load. - Pilot / staging tenants:
threshold='low'+action='warn'. Maximally noisy: surfaces every flag in the analytics dashboard so you can calibrate the threshold based on real data before promoting the tenant to production policy.
Forbidden-phrase guardrail
If the agent’s reply contains any phrase from the merged forbidden-phrase list, this guardrail’s action kicks in. The phrase list comes from two sources, concatenated + de-duplicated:- Vertical pack defaults: every pack ships its own list (e.g. clinic ships
diagnose,you have,definitely,it's nothing serious). - Tenant-specific additions: your operator adds phrases via the dashboard pack-overrides UI.
diagnose matches
“I diagnose…”, “let me diagnose…”, “I can’t diagnose…”. This is intentional: at runtime, near-misses
are mostly the agent trying to talk around a forbidden term; exact-word matching would let too
much through.
Action
| Action | Behaviour |
|---|---|
warn (default) | Log only. The prompt already tells the model not to use these phrases; the guardrail logs the slip without overriding. |
block | Replace the reply with a generic fallback string ("Let me get a colleague to help with that. I'll connect you now." / "دعني أحول طلبك لزميل من الفريق ليساعدك."). The dispatch logs the override. |
handoff | Drop the bot reply, route to a human. |
action='warn'. Same matched-today’s-behaviour rationale as the
hallucination guardrail.
Why three actions instead of two
block and handoff differ in customer experience: block keeps the bot in the conversation
(the customer reads the fallback string and can keep talking); handoff drops the bot and routes
to a human. Use block when you want to give the bot a graceful exit; use handoff when a
forbidden-phrase hit means a human absolutely must take the conversation from here.
Where the guardrail policy lives
Per-tenant guardrails live intenants.guardrails (JSONB column). The schema:
Operator UI
Configure per-tenant fromapp.vorel.ai/admin/tenants/[id]/guardrails. The form writes the JSONB
column directly; changes take effect on the next agent turn (no code deploy, no service restart).
The audit log records every change with the actor, the previous value, and the new value, so
“who turned off the hallucination guardrail and when?” is always answerable.
Pack-level forbidden phrases (read-only floor)
The vertical pack’s forbidden phrases are a floor, not an override target. You can add to the list per-tenant; you cannot remove pack-shipped phrases via the standard pack-override UI. This protects against a clinic operator accidentally turning off thediagnose block.
To remove a pack-level phrase requires a code-side change to the vertical pack JSON (and an
explicit comment justifying the removal). Don’t.
Hallucination scoring details
The primary grader is deterministic and runs synchronously on the live reply path, with the conversation’s tool results, your active offerings, and your working hours as grounding evidence. It extracts factual claims with precision-tuned pattern matching and verifies each one. A second, asynchronous LLM pass runs after the reply lands and catches claims the deterministic rules miss, merging its findings back in asllm_flagged entries.
Flag kinds:
unsupported_price: the agent quoted a price not grounded in a tool result or your offerings (within a small tolerance). A price the caller introduced first is not flagged.unsupported_hours: the agent stated opening/closing hours that fall outside your configured working hours.unsupported_availability: the agent asserted a specific available time that no tool result grounds.unsupported_contact: the agent produced an email, phone number, or reference code with no grounding (reference codes are high severity, since a caller acts on a fake confirmation).llm_flagged: the async LLM backstop caught something the deterministic rules did not.
Confidence calibration
Beyond flagging fabrications, Vorel tracks whether the agent’s self-reported confidence matches its actual truthfulness. Post-turn, the platform extracts the agent’s hedges and deferrals to infer how confident it sounded, then scores calibration (an expected-calibration-error metric) against grader truthfulness on the operator quality surfaces. A well-calibrated agent hedges when it should and asserts when it’s right; this is a distinct quality axis from raw hallucination rate.What’s NOT a guardrail today
Things you might expect that aren’t on this surface:- Profanity filter. The forbidden-phrase guardrail handles tenant-specific terms; we don’t ship a generic profanity list. Add brand-restricted vocabulary via pack overrides.
- PII redaction in agent replies. The agent doesn’t have access to other customers’ data via RLS, so there’s nothing to redact at the reply layer. PII redaction happens at the data-export + audit-log layer instead.
- Topic restriction. “The agent must only talk about real estate, not weather” is enforced
via the
faq_redirect_message_*strings in vertical packs, not as a separate guardrail.
Related docs
- Verticals: pack-level forbidden phrases (clinic is the load-bearing example)
- Analytics: hallucination flag rates over time
- Security overview: broader safety posture, RLS, PII handling
- How it works: where guardrails sit in the dispatch pipeline