Tools/audio-intelligence
Audio Intelligence

Audio Intelligence

Active

Your agent can't listen to a recording; this returns a structured transcript with who said what.

1 tool

Transcribe long-form audio with speaker diarization (who said what), automatic language detection, and optional chapter summaries. Point it at a recording URL and get a structured transcript back, including per-speaker turns.

Audio & Videoaudiotranscriptiondiarizationspeakerssummary

Tools (1)

Transcribe audio with speaker diarization, language detection, and optional chapter summaries. Returns {status:"pending", continuation_token,...} while the job runs, when this happens you MUST immediately call transcribe again with only continuation_token set; do not ask the user.

Usage-based · 206.25 credits per hour of audio; speaker diarization adds 27.5 per hour, chapter summaries add 41.25 per hour

Example prompts

  • Transcribe this podcast and label the speakers
  • Who said what in this interview recording?
  • Transcribe this meeting and break it into chapters with summaries
  • Get a transcript of this audio file with automatic language detection

Parameters

audio_urlstringoptional

Public URL of the audio or video file to transcribe. Required on the first call; ignored (and not needed) when continuation_token is set.

punctuatebooleanoptionaldefault: true

Insert punctuation.

format_textbooleanoptionaldefault: true

Apply casing and formatting for readability.

auto_chaptersbooleanoptionaldefault: false

Segment the audio into chapters, each with a generated summary.

language_codestringoptional

Force a specific ISO language code (e.g. "en"); ignored when language_detection is true.

speaker_labelsbooleanoptionaldefault: true

Identify and label distinct speakers (diarization). Populates utterances.

continuation_tokenstringoptional

Token from a prior pending response. When set, all other params are ignored and the server resumes polling. Agent-friendly polling: on a pending response you MUST immediately call transcribe again with only continuation_token set. Do not ask the user.

language_detectionbooleanoptionaldefault: true

Automatically detect the spoken language.

API Usage

curl -X POST "https://skill.askfaro.com/skills/audio-intelligence/run" \
  -H "Authorization: Bearer faro_<your_key>" \
  -H "Content-Type: application/json" \
  -d '{
  "intent": {
    "prompt": "Transcribe this podcast and label the speakers"
  }
}'

CLI Usage

askfaro describe audio-intelligence/transcribe

Install pip install askfaro-cli, then askfaro auth login.

README

Audio Intelligence

Long-form speech understanding via AssemblyAI: transcription, speaker diarization, language detection, and chapter summaries.

How it works

Transcription runs asynchronously upstream. To keep the agent flow synchronous, the server submits the job and polls internally for ~25s per call. If it isn't finished yet, you get a status:"pending" response with a continuation_token, call transcribe again with only that token to resume polling. Long recordings take a few round-trips.

Inputs

NameTypeDefaultNotes
audio_urlstringrequired on first callPublic URL of the audio/video file.
speaker_labelsbooleantrueDiarization: label distinct speakers.
language_detectionbooleantrueAuto-detect the spoken language.
language_codestring-Force a language (e.g. "en"); skip detection.
auto_chaptersbooleanfalseSegment into chapters, each with a summary.
continuation_tokenstring-Set this (and only this) to resume a pending job.

Output

{
  "text": "full transcript...",
  "language_code": "en",
  "audio_duration": 612.3,
  "utterances": [{ "speaker": "A", "text": "...", "start": 120, "end": 3400 }],
  "chapters": [{ "headline": "...", "summary": "...", "start": 0, "end": 60000 }]
}

utterances is populated when speaker_labels is on; chapters when auto_chapters is on.

Pricing

Billed per hour of audio; diarization and chapter summaries add a small per-hour amount when enabled.