Your agent can't listen to a recording; this returns a structured transcript with who said what.
Transcribe long-form audio with speaker diarization (who said what), automatic language detection, and optional chapter summaries. Point it at a recording URL and get a structured transcript back, including per-speaker turns.
Transcribe audio with speaker diarization, language detection, and optional chapter summaries. Returns {status:"pending", continuation_token,...} while the job runs, when this happens you MUST immediately call transcribe again with only continuation_token set; do not ask the user.
Public URL of the audio or video file to transcribe. Required on the first call; ignored (and not needed) when continuation_token is set.
Insert punctuation.
Apply casing and formatting for readability.
Segment the audio into chapters, each with a generated summary.
Force a specific ISO language code (e.g. "en"); ignored when language_detection is true.
Identify and label distinct speakers (diarization). Populates utterances.
Token from a prior pending response. When set, all other params are ignored and the server resumes polling. Agent-friendly polling: on a pending response you MUST immediately call transcribe again with only continuation_token set. Do not ask the user.
Automatically detect the spoken language.
curl -X POST "https://skill.askfaro.com/skills/audio-intelligence/run" \
-H "Authorization: Bearer faro_<your_key>" \
-H "Content-Type: application/json" \
-d '{
"intent": {
"prompt": "Transcribe this podcast and label the speakers"
}
}'askfaro describe audio-intelligence/transcribe
Install pip install askfaro-cli, then askfaro auth login.
Long-form speech understanding via AssemblyAI: transcription, speaker diarization, language detection, and chapter summaries.
Transcription runs asynchronously upstream. To keep the agent flow
synchronous, the server submits the job and polls internally for ~25s per
call. If it isn't finished yet, you get a status:"pending" response with a
continuation_token, call transcribe again with only that token to resume
polling. Long recordings take a few round-trips.
| Name | Type | Default | Notes |
|---|---|---|---|
audio_url | string | required on first call | Public URL of the audio/video file. |
speaker_labels | boolean | true | Diarization: label distinct speakers. |
language_detection | boolean | true | Auto-detect the spoken language. |
language_code | string | - | Force a language (e.g. "en"); skip detection. |
auto_chapters | boolean | false | Segment into chapters, each with a summary. |
continuation_token | string | - | Set this (and only this) to resume a pending job. |
{
"text": "full transcript...",
"language_code": "en",
"audio_duration": 612.3,
"utterances": [{ "speaker": "A", "text": "...", "start": 120, "end": 3400 }],
"chapters": [{ "headline": "...", "summary": "...", "start": 0, "end": 60000 }]
}
utterances is populated when speaker_labels is on; chapters when
auto_chapters is on.
Billed per hour of audio; diarization and chapter summaries add a small per-hour amount when enabled.