WebSocket Protocol - Voice Agent AI SDK

Overview

The Voice Agent SDK uses WebSocket for bidirectional real-time communication between clients and agents. The protocol supports:

Audio/video frame transmission
Text transcripts
Streaming text responses
Speech audio chunks
Tool execution events
Lifecycle management

Connection Lifecycle

Message Format

All messages are JSON objects with a type field:

interface BaseMessage {
  type: string;
  [key: string]: any;
}

Client → Agent Messages

transcript

Send transcribed text from speech-to-text (bypasses agent transcription).

{
  type: 'transcript',
  text: string,  // User speech text
}

Example:

webSocket.send(JSON.stringify({
  type: 'transcript',
  text: 'What is the weather in San Francisco?',
}));

Agent behavior:

Interrupts current LLM stream and speech
Queues text input for processing
(VideoAgent only) Requests frame capture

audio

Send audio data for transcription by the agent.

{
  type: 'audio',
  data: string,      // Base64-encoded audio
  format?: string,   // Audio format (e.g., 'webm', 'mp3')
}

Example:

// Record audio from microphone
const audioBlob = await mediaRecorder.stop();
const base64Audio = await blobToBase64(audioBlob);

webSocket.send(JSON.stringify({
  type: 'audio',
  data: base64Audio,
  format: 'webm',
}));

Agent behavior:

Interrupts current response
Transcribes audio using transcriptionModel
Emits transcription event
Queues transcribed text for processing
(VideoAgent only) Requests frame capture

video_frame

(VideoAgent only) Send a video frame for vision analysis.

{
  type: 'video_frame',
  sessionId: string,
  sequence: number,
  timestamp: number,
  triggerReason: 'scene_change' | 'user_request' | 'timer' | 'initial',
  previousFrameRef?: string,
  image: {
    data: string,    // Base64-encoded image
    format: string,  // 'webp', 'jpeg', 'png'
    width: number,
    height: number,
  },
}

Example:

const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
canvas.width = video.videoWidth;
canvas.height = video.videoHeight;
ctx.drawImage(video, 0, 0);

const frameData = canvas.toDataURL('image/webp', 0.8).split(',')[1];

webSocket.send(JSON.stringify({
  type: 'video_frame',
  sessionId: sessionId,
  sequence: frameCount++,
  timestamp: Date.now(),
  triggerReason: 'user_request',
  image: {
    data: frameData,
    format: 'webp',
    width: canvas.width,
    height: canvas.height,
  },
}));

Agent behavior:

Validates frame size (rejects if exceeds maxFrameInputSize)
Updates visual context buffer
Emits frame_received event
Sends frame_ack confirmation

interrupt

Request cancellation of current LLM stream and speech generation.

{
  type: 'interrupt',
  reason?: string,  // Optional reason (e.g., 'user_speaking', 'button_click')
}

Example:

// User starts speaking while agent is responding
webSocket.send(JSON.stringify({
  type: 'interrupt',
  reason: 'user_speaking',
}));

Agent behavior:

Aborts current LLM stream via AbortController
Clears speech queue
Sends speech_interrupted message

client_ready

(VideoAgent only) Signal that client is ready with capabilities.

{
  type: 'client_ready',
  capabilities?: {
    video: boolean,
    audio: boolean,
    // ... other capabilities
  },
}

Agent behavior:

Sends session_init with session ID
Emits client_ready event

Agent → Client Messages

text_delta

Streaming text token from LLM.

{
  type: 'text_delta',
  id: string,   // Chunk ID
  text: string, // Text fragment
}

Client handling:

let fullText = '';

webSocket.addEventListener('message', (event) => {
  const msg = JSON.parse(event.data);
  
  if (msg.type === 'text_delta') {
    fullText += msg.text;
    displayText.textContent = fullText;
  }
});

reasoning_delta

Streaming reasoning token (models with reasoning support).

{
  type: 'reasoning_delta',
  id: string,
  text: string,
}

tool_call

Tool invocation detected in LLM stream.

{
  type: 'tool_call',
  toolName: string,
  toolCallId: string,
  input: object,  // Tool input arguments
}

Example:

{
  type: 'tool_call',
  toolName: 'getWeather',
  toolCallId: 'call_abc123',
  input: { location: 'San Francisco' },
}

tool_result

Tool execution completed.

{
  type: 'tool_result',
  name: string,
  toolCallId: string,
  result: any,  // Tool output
}

Example:

{
  type: 'tool_result',
  name: 'getWeather',
  toolCallId: 'call_abc123',
  result: { temperature: 72, conditions: 'sunny' },
}

audio_chunk

Streaming audio chunk (TTS generated).

{
  type: 'audio_chunk',
  chunkId: number,  // Sequential chunk ID
  data: string,     // Base64-encoded audio
  format: string,   // 'mp3', 'opus', 'wav', etc.
  text: string,     // Original text for this chunk
}

Client handling:

const audioQueue = [];
let isPlaying = false;

webSocket.addEventListener('message', async (event) => {
  const msg = JSON.parse(event.data);
  
  if (msg.type === 'audio_chunk') {
    const audioBlob = base64ToBlob(msg.data, `audio/${msg.format}`);
    audioQueue.push(audioBlob);
    
    if (!isPlaying) {
      playNextChunk();
    }
  }
});

async function playNextChunk() {
  if (audioQueue.length === 0) {
    isPlaying = false;
    return;
  }
  
  isPlaying = true;
  const blob = audioQueue.shift();
  const audioUrl = URL.createObjectURL(blob);
  const audio = new Audio(audioUrl);
  
  audio.onended = () => {
    URL.revokeObjectURL(audioUrl);
    playNextChunk();
  };
  
  audio.play();
}

audio

Full audio response (non-streaming fallback).

{
  type: 'audio',
  data: string,   // Base64-encoded audio
  format: string, // 'mp3', 'opus', 'wav', etc.
}

speech_stream_start

Speech generation started (streaming mode).

{
  type: 'speech_stream_start',
}

speech_stream_end

All speech chunks sent.

{
  type: 'speech_stream_end',
}

speech_interrupted

Speech generation cancelled.

{
  type: 'speech_interrupted',
  reason: string,  // 'user_speaking', 'interrupted', etc.
}

Client handling:

webSocket.addEventListener('message', (event) => {
  const msg = JSON.parse(event.data);
  
  if (msg.type === 'speech_interrupted') {
    // Stop audio playback immediately
    audioQueue.length = 0;
    currentAudio?.pause();
  }
});

response_complete

Full LLM response finished.

{
  type: 'response_complete',
  text: string,  // Full response text
}

capture_frame

(VideoAgent only) Request client to capture and send a video frame.

{
  type: 'capture_frame',
  reason: 'scene_change' | 'user_request' | 'timer' | 'initial',
  timestamp: number,
}

Client handling:

webSocket.addEventListener('message', (event) => {
  const msg = JSON.parse(event.data);
  
  if (msg.type === 'capture_frame') {
    captureAndSendFrame(msg.reason);
  }
});

frame_ack

(VideoAgent only) Acknowledgment that frame was received.

{
  type: 'frame_ack',
  sequence: number,
  timestamp: number,
}

session_init

(VideoAgent only) Session initialized with ID.

{
  type: 'session_init',
  sessionId: string,
}

Error Handling

Errors are emitted as agent events, not WebSocket messages:

agent.on('error', (error) => {
  console.error('Agent error:', error);
  // Optionally send error to client via custom message
});

agent.on('warning', (message) => {
  console.warn('Agent warning:', message);
});

To send errors to clients, implement custom error messages:

agent.on('error', (error) => {
  if (agent.connected) {
    agent.ws.send({
      type: 'error',
      message: error.message,
    });
  }
});

Example: Full Client Implementation

class VoiceClient {
  private ws: WebSocket;
  private audioQueue: Blob[] = [];
  private isPlaying = false;

  constructor(url: string) {
    this.ws = new WebSocket(url);
    this.ws.addEventListener('message', this.handleMessage.bind(this));
  }

  private handleMessage(event: MessageEvent) {
    const msg = JSON.parse(event.data);

    switch (msg.type) {
      case 'text_delta':
        this.displayText(msg.text);
        break;

      case 'audio_chunk':
        this.queueAudio(msg.data, msg.format);
        break;

      case 'speech_interrupted':
        this.stopAudio();
        break;

      case 'tool_call':
        this.showToolExecution(msg.toolName, msg.input);
        break;
    }
  }

  sendAudio(audioBlob: Blob) {
    const reader = new FileReader();
    reader.onload = () => {
      const base64 = (reader.result as string).split(',')[1];
      this.ws.send(JSON.stringify({
        type: 'audio',
        data: base64,
        format: 'webm',
      }));
    };
    reader.readAsDataURL(audioBlob);
  }

  sendText(text: string) {
    this.ws.send(JSON.stringify({
      type: 'transcript',
      text,
    }));
  }

  interrupt() {
    this.ws.send(JSON.stringify({
      type: 'interrupt',
      reason: 'user_request',
    }));
  }

  private queueAudio(base64Data: string, format: string) {
    const binary = atob(base64Data);
    const bytes = new Uint8Array(binary.length);
    for (let i = 0; i < binary.length; i++) {
      bytes[i] = binary.charCodeAt(i);
    }
    const blob = new Blob([bytes], { type: `audio/${format}` });
    this.audioQueue.push(blob);

    if (!this.isPlaying) {
      this.playNextChunk();
    }
  }

  private async playNextChunk() {
    if (this.audioQueue.length === 0) {
      this.isPlaying = false;
      return;
    }

    this.isPlaying = true;
    const blob = this.audioQueue.shift()!;
    const url = URL.createObjectURL(blob);
    const audio = new Audio(url);

    audio.onended = () => {
      URL.revokeObjectURL(url);
      this.playNextChunk();
    };

    audio.play();
  }

  private stopAudio() {
    this.audioQueue.length = 0;
    this.isPlaying = false;
  }

  private displayText(text: string) {
    // Update UI with streaming text
  }

  private showToolExecution(toolName: string, input: any) {
    // Show tool execution in UI
  }
}

// Usage
const client = new VoiceClient('ws://localhost:8080');
client.sendText('What is the weather in Paris?');

Security Considerations

Always validate message types and data on both sides:

// Server-side validation
private async handleMessage(message: any): Promise<void> {
  if (!message.type || typeof message.type !== 'string') {
    this.emit('warning', 'Invalid message format');
    return;
  }

  if (message.type === 'audio') {
    if (typeof message.data !== 'string' || !message.data) {
      this.emit('warning', 'Invalid audio data');
      return;
    }
  }
  // ...
}

Implement authentication and rate limiting:

import WebSocket from 'ws';

const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (socket, req) => {
  const token = req.headers['authorization'];
  
  if (!validateToken(token)) {
    socket.close(1008, 'Unauthorized');
    return;
  }
  
  // Rate limit messages
  let messageCount = 0;
  const resetInterval = setInterval(() => {
    messageCount = 0;
  }, 60000); // Reset every minute
  
  socket.on('message', (data) => {
    messageCount++;
    if (messageCount > 100) {
      socket.close(1008, 'Rate limit exceeded');
      clearInterval(resetInterval);
      return;
    }
    // Process message
  });
});

Protocol Versions

The current protocol is v1 (implicit). Future versions may include a version field:

{
  type: 'connect',
  protocolVersion: '1.0',
}

Next Steps

VoiceAgent

Learn about the voice agent architecture

VideoAgent

Understand video frame message types

Quick Start

Build your first WebSocket client

API Reference

Full API documentation

Documentation Index

​Overview

​Connection Lifecycle

​Message Format

​Client → Agent Messages

​transcript

​audio

​video_frame

​interrupt

​client_ready

​Agent → Client Messages

​text_delta

​reasoning_delta

​tool_call

​tool_result

​audio_chunk

​audio

​speech_stream_start

​speech_stream_end

​speech_interrupted

​response_complete

​capture_frame

​frame_ack

​session_init

​Error Handling

​Example: Full Client Implementation

​Security Considerations

​Protocol Versions

​Next Steps

VoiceAgent

VideoAgent

Quick Start

API Reference

Overview

Connection Lifecycle

Message Format

Client → Agent Messages

transcript

audio

video_frame

interrupt

client_ready

Agent → Client Messages

text_delta

reasoning_delta

tool_call

tool_result

audio_chunk

audio

speech_stream_start

speech_stream_end

speech_interrupted

response_complete

capture_frame

frame_ack

session_init

Error Handling

Example: Full Client Implementation

Security Considerations

Protocol Versions

Next Steps