Vorel — Documentation

How voice works

Vorel’s voice agent runs a chained, streaming pipeline: speech recognition transcribes the caller, our voice AI reasons over the turn (router, sub-agents, tool calls, guardrails), and lifelike text-to-speech speaks the reply. Audio streams both ways in real time, the agent barges in cleanly when the caller interrupts, and we target a sub-2-second response on each turn.

Customer phone
    │
    ▼
Telephony carrier  ──►  voice-ws (Vorel realtime service)
                            │
                            ├─►  Speech recognition (streaming, bilingual)
                            │
                            ├─►  /api/voice/dispatch-turn
                            │        │
                            │        ▼
                            │    Router → sub-agent → tool calls → per-tenant guardrails
                            │        │
                            │        ▼
                            │    Reply text (streamed token-by-token)
                            │
                            └─►  Lifelike text-to-speech (streaming)
                                     │
                                     ▼
                             Audio frames → carrier → caller

The pipeline is chained, not speech-to-speech: each stage is a discrete, swappable step, which is what lets the agent reliably call tools, ground its answers, and enforce per-tenant guardrails on every turn. The reply text streams token-by-token straight into text-to-speech, so the caller starts hearing the answer before the agent has finished forming it. When a lookup takes a moment, the agent covers the pause with a short, context-aware holding phrase rather than dead air. For the full transport detail (per-tenant configuration, the eval-gate that protects against quality regressions on a pipeline change, and the shadow-mode protocol before any cutover) see Voice pipelines. The agent runs through the same router → sub-agent flow as chat does (qualification / FAQ / booking / handoff), so a customer who calls and later WhatsApps continues the same conversation thread. Customer identity is keyed on phone number, not channel.

What the voice agent can do

Bilingual Arabic + English

Speech recognition runs in multilingual mode, so the same call can carry both languages, including Levantine and Jordanian Arabic dialects. The system prompt instructs the agent to switch when the caller switches and stay consistent within each turn. No mid-turn code-mixing.

Qualify a new lead

Asks the questions you’d want a human receptionist to ask: name, intent, timeline, contact preference. Captures structured data + writes the lead into your CRM.

Answer questions from your knowledge base

Vector-indexed retrieval against the offerings + FAQ + knowledge entries you’ve populated for your tenant. Answers cite the source.

Book appointments

Finds available slots within your working hours + handoff rules. Books against the local DB today; Google Calendar real-sync ships in Phase 7.

Escalate to your team

On configurable triggers: explicit request, complaint, negotiation, stuck conversation, compliance question. Routes via Slack webhook or email; the conversation lands in your /inbox.

TTS-safe formatting

The agent’s voice replies are formatted for natural speech: no markdown, no URLs, no bullet lists, no parenthetical asides. A channel-rules block is appended to every sub-agent prompt to enforce 2-sentence-per-turn, no-markdown, spell-out-numbers behaviour.

Voice quality

Voice quality depends on:

Network conditions between the customer’s carrier and our telephony layer. UAE-domestic calls work cleanly; unusual international carrier pairs can show audio artefacts.
Speech-recognition confidence: recognition runs in multilingual mode (English + Arabic in the same call); strong on English, decent on Arabic, with accented Arabic varying. Per-tenant recognition confidence thresholds are not yet exposed.
Text-to-speech: we use a low-latency, lifelike multilingual voice as the platform default. The voice is tuned to stay natural while keeping per-turn latency low, since slower, higher-fidelity models hurt live conversation. Per-tenant voice selection is configurable.

We control our own telephony codec negotiation, which materially improves audio on regional cellular calls.

Voice billing model

Voice cost per call decomposes into the underlying transport, speech, and reasoning components. We capture each component into billing_events (vendor cost-of-goods) and per-call cost tables so your operator can see the full breakdown. Your operator reviews the per-tenant cost breakdown at /admin/cost-rollup. Vorel’s customer-facing billing is outcome-based (you pay per resolved outcome, not per minute or per token); see Pricing. Invoicing is manual today: your operator generates the monthly invoice from the resolution-event and ledger rows plus your agreed rate card. The internal cost rollup is operator-only and never surfaces the vendor stack on tenant-facing pages.

Voice quality assurance

Every call is scored after-the-fact by Vorel’s QA pipeline:

An LLM scores the call against an 11-criterion rubric (language matching, tone, brevity, factual grounding, tool-usage correctness, qualification completeness, booking-flow accuracy, handoff judgment, safety/compliance, conversion progress, customer-sentiment trajectory).
Output: a normalized score, per-criterion breakdown, and derived flags.
Operator-side: stored in qa_evaluations with the conversation transcript; surfaced on the analytics + quality surfaces per-tenant and cross-tenant operator-side.

In addition, every turn is graded in real time by the hallucination grader and the per-tenant guardrails (see Guardrails); these run on the live reply path, not just after the fact.

What’s currently NOT supported on voice

Voicemail / call-back when busy: if the agent fails (LLM-provider outage, network issue), the call disconnects. A voicemail-style fallback is a future addition.
Outbound calls: operator-initiated outbound dialing is built but ships dark behind a flag; inbound is the live surface today.
Call recording archive: recordings are hosted by the telephony layer and we store the URL reference. Long-term archiving + PII-redacted recordings are deferred features.
Conference / multi-party: single-customer-to-agent only.

Per-vertical specifics

Each vertical pack tunes the voice agent’s qualification + handoff behaviour. The summary below mirrors the prompt_overrides.qualification_extra_rules field of each pack:

Real estate: captures intent (buy vs rent), property type, bedrooms, budget range with currency, preferred areas, timeline, and financing_needed. Books viewings against your offerings.
Salon: captures whether the caller is returning, occasion (regular vs wedding vs trial), and (for color services) whether they’re matching an existing tone or changing. Sensitive topics like allergies are captured once and never re-asked.
Clinic: confirms patient_status (new vs returning), insurance provider + member number, then captures symptoms in the patient’s own words and routes to the right specialty. Will not diagnose (forbidden_phrases includes diagnose, you have, it's nothing serious). Red-flag symptoms (chest pain, severe bleeding, suicidal ideation, possible stroke) trigger an immediate human handoff with an explicit instruction to call emergency services.
Restaurant: captures party size, target service period (lunch / dinner / etc.), dietary restrictions and allergies, and seating preference. Large parties (8+) and private-room bookings: confirms minimum spend, set-menu requirement, and deposit policy at booking time.
Auto service: captures make / model / year first (everything else depends on it), captures symptoms in the caller’s own words without diagnosing, and never quotes a final repair price over the phone for symptom-driven work. Routine services (oil, brakes, tires) book straight; symptom-driven calls book a diagnostic first.
Generic SMB: captures name, contact, and what the caller is trying to accomplish. Operator tailors the qualification questions per-tenant in the dashboard.

For per-vertical detail: Real estate · Salon · Clinic · Restaurant · Auto service · Generic.

​How voice works

​What the voice agent can do