> ## Documentation Index
> Fetch the complete documentation index at: https://docs.vorel.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Voice

> Vorel handles inbound calls end-to-end: picks up, qualifies, books, escalates. Bilingual Arabic/English, MENA-region telephony.

## How voice works

Vorel's voice agent runs a **chained, streaming pipeline**: speech recognition transcribes the caller, our voice AI reasons over the turn (router, sub-agents, tool calls, guardrails), and lifelike text-to-speech speaks the reply. Audio streams both ways in real time, the agent barges in cleanly when the caller interrupts, and we target a sub-2-second response on each turn.

```
Customer phone
    │
    ▼
Telephony carrier  ──►  voice-ws (Vorel realtime service)
                            │
                            ├─►  Speech recognition (streaming, bilingual)
                            │
                            ├─►  /api/voice/dispatch-turn
                            │        │
                            │        ▼
                            │    Router → sub-agent → tool calls → per-tenant guardrails
                            │        │
                            │        ▼
                            │    Reply text (streamed token-by-token)
                            │
                            └─►  Lifelike text-to-speech (streaming)
                                     │
                                     ▼
                             Audio frames → carrier → caller
```

The pipeline is **chained**, not speech-to-speech: each stage is a discrete, swappable step, which is what lets the agent reliably call tools, ground its answers, and enforce per-tenant guardrails on every turn. The reply text streams token-by-token straight into text-to-speech, so the caller starts hearing the answer before the agent has finished forming it. When a lookup takes a moment, the agent covers the pause with a short, context-aware holding phrase rather than dead air.

For the full transport detail (per-tenant configuration, the eval-gate that protects against quality regressions on a pipeline change, and the shadow-mode protocol before any cutover) see [Voice pipelines](/product/voice-pipelines).

The agent runs through the same router → sub-agent flow as chat does (qualification / FAQ / booking / handoff), so a customer who calls and later WhatsApps continues the same conversation thread. Customer identity is keyed on phone number, not channel.

## What the voice agent can do

<CardGroup cols={2}>
  <Card title="Bilingual Arabic + English" icon="language">
    Speech recognition runs in multilingual mode, so the same call can carry both languages,
    including Levantine and Jordanian Arabic dialects. The system prompt instructs the agent to
    switch when the caller switches and stay consistent within each turn. No mid-turn code-mixing.
  </Card>

  <Card title="Qualify a new lead" icon="user-check">
    Asks the questions you'd want a human receptionist to ask: name, intent, timeline, contact
    preference. Captures structured data + writes the lead into your CRM.
  </Card>

  <Card title="Answer questions from your knowledge base" icon="book-open">
    Vector-indexed retrieval against the offerings + FAQ + knowledge entries you've populated for
    your tenant. Answers cite the source.
  </Card>

  <Card title="Book appointments" icon="calendar-check">
    Finds available slots within your working hours + handoff rules. Books against the local DB
    today; Google Calendar real-sync ships in Phase 7.
  </Card>

  <Card title="Escalate to your team" icon="user-headset">
    On configurable triggers: explicit request, complaint, negotiation, stuck conversation,
    compliance question. Routes via Slack webhook or email; the conversation lands in your `/inbox`.
  </Card>

  <Card title="TTS-safe formatting" icon="microphone">
    The agent's voice replies are formatted for natural speech: no markdown, no URLs, no bullet
    lists, no parenthetical asides. A channel-rules block is appended to every sub-agent prompt to
    enforce 2-sentence-per-turn, no-markdown, spell-out-numbers behaviour.
  </Card>
</CardGroup>

## Voice quality

Voice quality depends on:

* **Network conditions** between the customer's carrier and our telephony layer. UAE-domestic calls work cleanly; unusual international carrier pairs can show audio artefacts.
* **Speech-recognition confidence**: recognition runs in multilingual mode (English + Arabic in the same call); strong on English, decent on Arabic, with accented Arabic varying. Per-tenant recognition confidence thresholds are not yet exposed.
* **Text-to-speech**: we use a low-latency, lifelike multilingual voice as the platform default. The voice is tuned to stay natural while keeping per-turn latency low, since slower, higher-fidelity models hurt live conversation. Per-tenant voice selection is configurable.

We control our own telephony codec negotiation, which materially improves audio on regional cellular calls.

## Voice billing model

Voice cost per call decomposes into the underlying transport, speech, and reasoning components. We capture each component into `billing_events` (vendor cost-of-goods) and per-call cost tables so your operator can see the full breakdown.

Your operator reviews the per-tenant cost breakdown at `/admin/cost-rollup`. Vorel's customer-facing billing is **outcome-based** (you pay per resolved outcome, not per minute or per token); see [Pricing](/product/pricing). Invoicing is **manual today**: your operator generates the monthly invoice from the resolution-event and ledger rows plus your agreed rate card. The internal cost rollup is operator-only and never surfaces the vendor stack on tenant-facing pages.

## Voice quality assurance

Every call is scored after-the-fact by Vorel's QA pipeline:

* An LLM scores the call against an 11-criterion rubric (language matching, tone, brevity, factual grounding, tool-usage correctness, qualification completeness, booking-flow accuracy, handoff judgment, safety/compliance, conversion progress, customer-sentiment trajectory).
* Output: a normalized score, per-criterion breakdown, and derived flags.
* Operator-side: stored in `qa_evaluations` with the conversation transcript; surfaced on the analytics + quality surfaces per-tenant and cross-tenant operator-side.

In addition, every turn is graded **in real time** by the hallucination grader and the per-tenant guardrails (see [Guardrails](/product/guardrails)); these run on the live reply path, not just after the fact.

## What's currently NOT supported on voice

* **Voicemail / call-back when busy**: if the agent fails (LLM-provider outage, network issue), the call disconnects. A voicemail-style fallback is a future addition.
* **Outbound calls**: operator-initiated outbound dialing is built but ships dark behind a flag; inbound is the live surface today.
* **Call recording archive**: recordings are hosted by the telephony layer and we store the URL reference. Long-term archiving + PII-redacted recordings are deferred features.
* **Conference / multi-party**: single-customer-to-agent only.

## Per-vertical specifics

Each vertical pack tunes the voice agent's qualification + handoff behaviour. The summary below mirrors the `prompt_overrides.qualification_extra_rules` field of each pack:

* **Real estate**: captures intent (buy vs rent), property type, bedrooms, budget range with currency, preferred areas, timeline, and `financing_needed`. Books viewings against your offerings.
* **Salon**: captures whether the caller is returning, occasion (regular vs wedding vs trial), and (for color services) whether they're matching an existing tone or changing. Sensitive topics like allergies are captured once and never re-asked.
* **Clinic**: confirms `patient_status` (new vs returning), insurance provider + member number, then captures symptoms in the patient's own words and routes to the right specialty. **Will not diagnose** (`forbidden_phrases` includes `diagnose`, `you have`, `it's nothing serious`). Red-flag symptoms (chest pain, severe bleeding, suicidal ideation, possible stroke) trigger an immediate human handoff with an explicit instruction to call emergency services.
* **Restaurant**: captures party size, target service period (lunch / dinner / etc.), dietary restrictions and allergies, and seating preference. Large parties (8+) and private-room bookings: confirms minimum spend, set-menu requirement, and deposit policy at booking time.
* **Auto service**: captures make / model / year first (everything else depends on it), captures symptoms in the caller's own words **without diagnosing**, and **never quotes a final repair price over the phone for symptom-driven work**. Routine services (oil, brakes, tires) book straight; symptom-driven calls book a diagnostic first.
* **Generic SMB**: captures name, contact, and what the caller is trying to accomplish. Operator tailors the qualification questions per-tenant in the dashboard.

For per-vertical detail: [Real estate](/verticals/real-estate) · [Salon](/verticals/salon) · [Clinic](/verticals/clinic) · [Restaurant](/verticals/restaurant) · [Auto service](/verticals/auto-service) · [Generic](/verticals/generic).

{/* verified-against: handoff/codebase/web-voice.md + app-voice-ws.md (chained pipeline, S2S deferred/single-vendor-rejected; sub-2s binding latency mandate; AWT context-aware fillers) */}

{/* verified-against: config/verticals/real_estate.json + salon.json + clinic.json + restaurant.json + auto_service.json + generic.json prompt_overrides.qualification_extra_rules */}