Realtime, Image & Media
OpenAI’s multimodal stack is the broadest in the industry — realtime voice, image generation, video generation, and speech processing all available through a single platform.
Realtime API — Voice & Audio
The Realtime API enables low-latency voice interactions, live translation, and streaming speech processing.
Connection Methods
| Method | Protocol | Best For | Latency |
|---|---|---|---|
| WebRTC | Peer-to-peer, browser-native | Browser-based voice apps | Lowest |
| WebSocket | Persistent bidirectional | Server-side, custom clients | Low |
| SIP | Traditional telephony | Phone call integration | Moderate |
GPT Realtime 2 — Voice Agents
# WebRTC connection for realtime voiceimport asynciofrom openai import OpenAI
client = OpenAI()
# Create a realtime sessionsession = client.realtime.sessions.create( model="gpt-realtime-2", voice="alloy", instructions="You are a helpful customer support agent.")
# Send audio and receive responses# Audio in: $32 / 1M tokens ($0.40 cached)# Audio out: $64 / 1M tokens# Text in: $4 / 1M tokens ($0.40 cached)# Text out: $24 / 1M tokensRealtime Translation
Live speech-to-speech translation at $0.034/minute:
session = client.realtime.sessions.create( model="gpt-realtime-translate", source_language="en", target_language="fr")# Translates English speech to French speech in real timeRealtime Whisper — Streaming Transcription
Live speech-to-text at $0.017/minute:
session = client.realtime.sessions.create( model="gpt-realtime-whisper")# Streams transcription as the speaker talksTraditional Speech-to-Text
# GPT-4o Transcribe — high qualityaudio_file = open("meeting.mp3", "rb")transcript = client.audio.transcriptions.create( model="gpt-4o-transcribe", file=audio_file)print(transcript.text)Text-to-Speech
# GPT-4o mini TTSresponse = client.audio.speech.create( model="gpt-4o-mini-tts", voice="nova", input="Welcome to the AI Playbook. Today we'll explore...")response.stream_to_file("output.mp3")GPT Image 2 — Image Generation
State-of-the-art image generation with text and image input:
response = client.images.generate( model="gpt-image-2", prompt="A futuristic AI research lab with holographic displays showing neural network architectures, cinematic lighting", size="1024x1024", quality="hd")| Feature | Detail |
|---|---|
| Input (image) | 2 cached) |
| Output (image) | $30 / 1M tokens |
| Input (text) | 1.25 cached) |
| Maximum resolution | Up to 2048x2048 |
| Capabilities | Generation, editing, variation, inpainting |
Image Editing
response = client.images.edit( model="gpt-image-2", image=open("original.png", "rb"), mask=open("mask.png", "rb"), prompt="Replace the background with a modern office setting")Sora — Video Generation
Cinematic video generation from text prompts, available via ChatGPT Pro ($200/mo) and API:
response = client.video.generate( model="sora", prompt="A drone shot flying over a futuristic city at sunset, with flying cars and holographic billboards", duration=10, # seconds resolution="1080p")| Feature | Detail |
|---|---|
| Max duration | Up to 60 seconds |
| Resolution | Up to 4K |
| Capabilities | Text-to-video, image-to-video, video extension |
| Availability | ChatGPT Pro + API |
Whisper / TTS — Traditional Audio
For non-realtime speech workloads, OpenAI offers traditional speech models:
| Model | Use Case | Cost |
|---|---|---|
| Whisper | Batch/offline speech-to-text | $0.006/minute |
| GPT-4o Transcribe | High-quality speech-to-text | Pay-per-token |
| GPT-4o mini Transcribe | Cost-efficient speech-to-text | Pay-per-token |
| GPT-4o mini TTS | Text-to-speech | Pay-per-token |
Use Case Matrix
| Use Case | Best Tool |
|---|---|
| Browser-based voice agent | Realtime API (WebRTC) |
| Phone call integration | Realtime API (SIP) |
| Live translation | Realtime Translate |
| Meeting transcription | Realtime Whisper (streaming) or Whisper (batch) |
| Product images / marketing | GPT Image 2 |
| Video content / ads | Sora |
| Audiobook narration | TTS |
| Podcast transcription | Whisper |
For the MCP protocol and how it connects to OpenAI tools, see MCP & Integrations.