Documentation Index Fetch the complete documentation index at: https://mintlify.com/Bijit-Mondal/VoiceAgent/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Voice Agent SDK uses WebSocket for bidirectional real-time communication between clients and agents. The protocol supports:
Audio/video frame transmission
Text transcripts
Streaming text responses
Speech audio chunks
Tool execution events
Lifecycle management
Connection Lifecycle
All messages are JSON objects with a type field:
interface BaseMessage {
type : string ;
[ key : string ] : any ;
}
Client → Agent Messages
transcript
Send transcribed text from speech-to-text (bypasses agent transcription).
{
type : 'transcript' ,
text : string , // User speech text
}
Example:
webSocket . send ( JSON . stringify ({
type: 'transcript' ,
text: 'What is the weather in San Francisco?' ,
}));
Agent behavior:
Interrupts current LLM stream and speech
Queues text input for processing
(VideoAgent only) Requests frame capture
audio
Send audio data for transcription by the agent.
{
type : 'audio' ,
data : string , // Base64-encoded audio
format ?: string , // Audio format (e.g., 'webm', 'mp3')
}
Example:
// Record audio from microphone
const audioBlob = await mediaRecorder . stop ();
const base64Audio = await blobToBase64 ( audioBlob );
webSocket . send ( JSON . stringify ({
type: 'audio' ,
data: base64Audio ,
format: 'webm' ,
}));
Agent behavior:
Interrupts current response
Transcribes audio using transcriptionModel
Emits transcription event
Queues transcribed text for processing
(VideoAgent only) Requests frame capture
video_frame
(VideoAgent only) Send a video frame for vision analysis.
{
type : 'video_frame' ,
sessionId : string ,
sequence : number ,
timestamp : number ,
triggerReason : 'scene_change' | 'user_request' | 'timer' | 'initial' ,
previousFrameRef ?: string ,
image : {
data : string , // Base64-encoded image
format : string , // 'webp', 'jpeg', 'png'
width : number ,
height : number ,
},
}
Example:
const canvas = document . createElement ( 'canvas' );
const ctx = canvas . getContext ( '2d' );
canvas . width = video . videoWidth ;
canvas . height = video . videoHeight ;
ctx . drawImage ( video , 0 , 0 );
const frameData = canvas . toDataURL ( 'image/webp' , 0.8 ). split ( ',' )[ 1 ];
webSocket . send ( JSON . stringify ({
type: 'video_frame' ,
sessionId: sessionId ,
sequence: frameCount ++ ,
timestamp: Date . now (),
triggerReason: 'user_request' ,
image: {
data: frameData ,
format: 'webp' ,
width: canvas . width ,
height: canvas . height ,
},
}));
Agent behavior:
Validates frame size (rejects if exceeds maxFrameInputSize)
Updates visual context buffer
Emits frame_received event
Sends frame_ack confirmation
interrupt
Request cancellation of current LLM stream and speech generation.
{
type : 'interrupt' ,
reason ?: string , // Optional reason (e.g., 'user_speaking', 'button_click')
}
Example:
// User starts speaking while agent is responding
webSocket . send ( JSON . stringify ({
type: 'interrupt' ,
reason: 'user_speaking' ,
}));
Agent behavior:
Aborts current LLM stream via AbortController
Clears speech queue
Sends speech_interrupted message
client_ready
(VideoAgent only) Signal that client is ready with capabilities.
{
type : 'client_ready' ,
capabilities ?: {
video: boolean ,
audio: boolean ,
// ... other capabilities
},
}
Agent behavior:
Sends session_init with session ID
Emits client_ready event
Agent → Client Messages
text_delta
Streaming text token from LLM.
{
type : 'text_delta' ,
id : string , // Chunk ID
text : string , // Text fragment
}
Client handling:
let fullText = '' ;
webSocket . addEventListener ( 'message' , ( event ) => {
const msg = JSON . parse ( event . data );
if ( msg . type === 'text_delta' ) {
fullText += msg . text ;
displayText . textContent = fullText ;
}
});
reasoning_delta
Streaming reasoning token (models with reasoning support).
{
type : 'reasoning_delta' ,
id : string ,
text : string ,
}
Tool invocation detected in LLM stream.
{
type : 'tool_call' ,
toolName : string ,
toolCallId : string ,
input : object , // Tool input arguments
}
Example:
{
type : 'tool_call' ,
toolName : 'getWeather' ,
toolCallId : 'call_abc123' ,
input : { location : 'San Francisco' },
}
Tool execution completed.
{
type : 'tool_result' ,
name : string ,
toolCallId : string ,
result : any , // Tool output
}
Example:
{
type : 'tool_result' ,
name : 'getWeather' ,
toolCallId : 'call_abc123' ,
result : { temperature : 72 , conditions : 'sunny' },
}
audio_chunk
Streaming audio chunk (TTS generated).
{
type : 'audio_chunk' ,
chunkId : number , // Sequential chunk ID
data : string , // Base64-encoded audio
format : string , // 'mp3', 'opus', 'wav', etc.
text : string , // Original text for this chunk
}
Client handling:
const audioQueue = [];
let isPlaying = false ;
webSocket . addEventListener ( 'message' , async ( event ) => {
const msg = JSON . parse ( event . data );
if ( msg . type === 'audio_chunk' ) {
const audioBlob = base64ToBlob ( msg . data , `audio/ ${ msg . format } ` );
audioQueue . push ( audioBlob );
if ( ! isPlaying ) {
playNextChunk ();
}
}
});
async function playNextChunk () {
if ( audioQueue . length === 0 ) {
isPlaying = false ;
return ;
}
isPlaying = true ;
const blob = audioQueue . shift ();
const audioUrl = URL . createObjectURL ( blob );
const audio = new Audio ( audioUrl );
audio . onended = () => {
URL . revokeObjectURL ( audioUrl );
playNextChunk ();
};
audio . play ();
}
audio
Full audio response (non-streaming fallback).
{
type : 'audio' ,
data : string , // Base64-encoded audio
format : string , // 'mp3', 'opus', 'wav', etc.
}
speech_stream_start
Speech generation started (streaming mode).
{
type : 'speech_stream_start' ,
}
speech_stream_end
All speech chunks sent.
{
type : 'speech_stream_end' ,
}
speech_interrupted
Speech generation cancelled.
{
type : 'speech_interrupted' ,
reason : string , // 'user_speaking', 'interrupted', etc.
}
Client handling:
webSocket . addEventListener ( 'message' , ( event ) => {
const msg = JSON . parse ( event . data );
if ( msg . type === 'speech_interrupted' ) {
// Stop audio playback immediately
audioQueue . length = 0 ;
currentAudio ?. pause ();
}
});
response_complete
Full LLM response finished.
{
type : 'response_complete' ,
text : string , // Full response text
}
capture_frame
(VideoAgent only) Request client to capture and send a video frame.
{
type : 'capture_frame' ,
reason : 'scene_change' | 'user_request' | 'timer' | 'initial' ,
timestamp : number ,
}
Client handling:
webSocket . addEventListener ( 'message' , ( event ) => {
const msg = JSON . parse ( event . data );
if ( msg . type === 'capture_frame' ) {
captureAndSendFrame ( msg . reason );
}
});
frame_ack
(VideoAgent only) Acknowledgment that frame was received.
{
type : 'frame_ack' ,
sequence : number ,
timestamp : number ,
}
session_init
(VideoAgent only) Session initialized with ID.
{
type : 'session_init' ,
sessionId : string ,
}
Error Handling
Errors are emitted as agent events, not WebSocket messages:
agent . on ( 'error' , ( error ) => {
console . error ( 'Agent error:' , error );
// Optionally send error to client via custom message
});
agent . on ( 'warning' , ( message ) => {
console . warn ( 'Agent warning:' , message );
});
To send errors to clients, implement custom error messages:
agent . on ( 'error' , ( error ) => {
if ( agent . connected ) {
agent . ws . send ({
type: 'error' ,
message: error . message ,
});
}
});
Example: Full Client Implementation
class VoiceClient {
private ws : WebSocket ;
private audioQueue : Blob [] = [];
private isPlaying = false ;
constructor ( url : string ) {
this . ws = new WebSocket ( url );
this . ws . addEventListener ( 'message' , this . handleMessage . bind ( this ));
}
private handleMessage ( event : MessageEvent ) {
const msg = JSON . parse ( event . data );
switch ( msg . type ) {
case 'text_delta' :
this . displayText ( msg . text );
break ;
case 'audio_chunk' :
this . queueAudio ( msg . data , msg . format );
break ;
case 'speech_interrupted' :
this . stopAudio ();
break ;
case 'tool_call' :
this . showToolExecution ( msg . toolName , msg . input );
break ;
}
}
sendAudio ( audioBlob : Blob ) {
const reader = new FileReader ();
reader . onload = () => {
const base64 = ( reader . result as string ). split ( ',' )[ 1 ];
this . ws . send ( JSON . stringify ({
type: 'audio' ,
data: base64 ,
format: 'webm' ,
}));
};
reader . readAsDataURL ( audioBlob );
}
sendText ( text : string ) {
this . ws . send ( JSON . stringify ({
type: 'transcript' ,
text ,
}));
}
interrupt () {
this . ws . send ( JSON . stringify ({
type: 'interrupt' ,
reason: 'user_request' ,
}));
}
private queueAudio ( base64Data : string , format : string ) {
const binary = atob ( base64Data );
const bytes = new Uint8Array ( binary . length );
for ( let i = 0 ; i < binary . length ; i ++ ) {
bytes [ i ] = binary . charCodeAt ( i );
}
const blob = new Blob ([ bytes ], { type: `audio/ ${ format } ` });
this . audioQueue . push ( blob );
if ( ! this . isPlaying ) {
this . playNextChunk ();
}
}
private async playNextChunk () {
if ( this . audioQueue . length === 0 ) {
this . isPlaying = false ;
return ;
}
this . isPlaying = true ;
const blob = this . audioQueue . shift () ! ;
const url = URL . createObjectURL ( blob );
const audio = new Audio ( url );
audio . onended = () => {
URL . revokeObjectURL ( url );
this . playNextChunk ();
};
audio . play ();
}
private stopAudio () {
this . audioQueue . length = 0 ;
this . isPlaying = false ;
}
private displayText ( text : string ) {
// Update UI with streaming text
}
private showToolExecution ( toolName : string , input : any ) {
// Show tool execution in UI
}
}
// Usage
const client = new VoiceClient ( 'ws://localhost:8080' );
client . sendText ( 'What is the weather in Paris?' );
Security Considerations
Always validate message types and data on both sides: // Server-side validation
private async handleMessage ( message : any ): Promise < void > {
if (!message.type || typeof message.type !== 'string' ) {
this . emit ( 'warning' , 'Invalid message format' );
return ;
}
if (message.type === 'audio' ) {
if ( typeof message . data !== 'string' || ! message . data ) {
this . emit ( 'warning' , 'Invalid audio data' );
return ;
}
}
// ...
}
Implement authentication and rate limiting: import WebSocket from 'ws' ;
const wss = new WebSocket . Server ({ port: 8080 });
wss . on ( 'connection' , ( socket , req ) => {
const token = req . headers [ 'authorization' ];
if ( ! validateToken ( token )) {
socket . close ( 1008 , 'Unauthorized' );
return ;
}
// Rate limit messages
let messageCount = 0 ;
const resetInterval = setInterval (() => {
messageCount = 0 ;
}, 60000 ); // Reset every minute
socket . on ( 'message' , ( data ) => {
messageCount ++ ;
if ( messageCount > 100 ) {
socket . close ( 1008 , 'Rate limit exceeded' );
clearInterval ( resetInterval );
return ;
}
// Process message
});
});
Protocol Versions
The current protocol is v1 (implicit). Future versions may include a version field:
{
type : 'connect' ,
protocolVersion : '1.0' ,
}
Next Steps
VoiceAgent Learn about the voice agent architecture
VideoAgent Understand video frame message types
Quick Start Build your first WebSocket client
API Reference Full API documentation