Voice API
## Welcome to Voice API A single endpoint for everything voice — **text-to-speech**, **speech-to-text**, and **voice cloning** — built for production workloads. ### Highlights - 🎙️ **Studio-grade TTS** with multi-speaker scripting, inline emotion cues (e.g. `[excited]`, `[whisper]`, `[sigh]`), and per-speaker model binding. Long scripts within the per-call character cap are auto-chunked and…
Voice API endpoints
| Method | Endpoint | Description |
|---|---|---|
| models | ||
| GET |
models_list /v1/models |
Filter by title (fuzzy match), language, tag and sort order. **Frequently requested pages are cached**; pass `refresh=true` to bypass the cache. ### Query parameters | Name |… |
| tts | ||
| POST |
tts_submit /v1/tts |
Synthesise text into MP3 audio. - **Multi-speaker**: prefix text with `` / `` / … — each marker maps to the N-th model in `model_ids` - **Emotion hints**: content inside square… |
| GET |
tts_status /v1/tts/{task_id} |
Returns the task's current `status` and progress. When `status=succeeded`, download the audio from `/v1/tts/{task_id}/audio`. |
| GET |
tts_audio /v1/tts/{task_id}/audio |
Call after `status=succeeded`. Returns `Content-Type: audio/mpeg` with `Content-Disposition` defaulting to `{task_id}.mp3`. |
| stt | ||
| POST |
stt_submit /v1/stt |
Upload audio (mp3 / wav / m4a, etc.). The backend handles the full "upload → create remote task → poll → fetch result" pipeline; the client only needs to poll once. Optional… |
| GET |
stt_status /v1/stt/{task_id} |
`progress` is updated to 50 / 80 during the polling phase and 100 on completion. `result` is always `null` here; fetch the transcript from `/v1/stt/{task_id}/result` to avoid… |
| GET |
stt_result /v1/stt/{task_id}/result |
When `plain_text=true`, returns only the merged transcript. When `false` (default), includes segments (timestamps + speaker labels). |
| clone | ||
| GET |
clone_status /v1/clone/{task_id} |
Once `status=succeeded`, read `result.model_id` for the trained model ID. |
| POST |
clone_submit /v1/clone |
Upload a reference audio clip to train a private TTS model. **Reference audio requirements**: duration **≥ 10 s**, recommended 10–90 s. Clips that are too long will be rejected.… |
| health | ||
| GET |
health /v1/health |
Public endpoint, no authentication required. Suitable for load balancers and Kubernetes health checks. |
Voice API pricing
| Plan | Price | Rate limit | Quotas |
|---|---|---|---|
| BASIC | Free | — |
|
| PRO Recommended | $15 / month | 2000 / hour |
|
| ULTRA | $30 / month | 3000 / hour |
|