Skip to content

Realtime, Image & Media

📖 4 min read openairealtimeimagevideovoicewhisper
OpenAI's multimodal capabilities — Realtime API (voice agents, live translation), GPT Image 2 (image generation), Sora (video), Whisper (speech-to-text), and TTS (text-to-speech).
Key Takeaways
  • Realtime API: Voice agents, live translation, speech transcription via WebRTC, WebSocket, or SIP
  • GPT Image 2: State-of-the-art image generation ($8/$30 per 1M image tokens)
  • Sora: Cinematic video generation available via ChatGPT Pro and API
  • Whisper/TTS: Speech-to-text and text-to-speech for traditional audio workloads

OpenAI’s multimodal stack is the broadest in the industry — realtime voice, image generation, video generation, and speech processing all available through a single platform.

Realtime API — Voice & Audio

The Realtime API enables low-latency voice interactions, live translation, and streaming speech processing.

Connection Methods

MethodProtocolBest ForLatency
WebRTCPeer-to-peer, browser-nativeBrowser-based voice appsLowest
WebSocketPersistent bidirectionalServer-side, custom clientsLow
SIPTraditional telephonyPhone call integrationModerate

GPT Realtime 2 — Voice Agents

# WebRTC connection for realtime voice
import asyncio
from openai import OpenAI
client = OpenAI()
# Create a realtime session
session = client.realtime.sessions.create(
model="gpt-realtime-2",
voice="alloy",
instructions="You are a helpful customer support agent."
)
# Send audio and receive responses
# Audio in: $32 / 1M tokens ($0.40 cached)
# Audio out: $64 / 1M tokens
# Text in: $4 / 1M tokens ($0.40 cached)
# Text out: $24 / 1M tokens

Realtime Translation

Live speech-to-speech translation at $0.034/minute:

session = client.realtime.sessions.create(
model="gpt-realtime-translate",
source_language="en",
target_language="fr"
)
# Translates English speech to French speech in real time

Realtime Whisper — Streaming Transcription

Live speech-to-text at $0.017/minute:

session = client.realtime.sessions.create(
model="gpt-realtime-whisper"
)
# Streams transcription as the speaker talks

Traditional Speech-to-Text

# GPT-4o Transcribe — high quality
audio_file = open("meeting.mp3", "rb")
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe",
file=audio_file
)
print(transcript.text)

Text-to-Speech

# GPT-4o mini TTS
response = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="nova",
input="Welcome to the AI Playbook. Today we'll explore..."
)
response.stream_to_file("output.mp3")

GPT Image 2 — Image Generation

State-of-the-art image generation with text and image input:

response = client.images.generate(
model="gpt-image-2",
prompt="A futuristic AI research lab with holographic displays showing neural network architectures, cinematic lighting",
size="1024x1024",
quality="hd"
)
FeatureDetail
Input (image)8/1Mtokens(8 / 1M tokens (2 cached)
Output (image)$30 / 1M tokens
Input (text)5/1Mtokens(5 / 1M tokens (1.25 cached)
Maximum resolutionUp to 2048x2048
CapabilitiesGeneration, editing, variation, inpainting

Image Editing

response = client.images.edit(
model="gpt-image-2",
image=open("original.png", "rb"),
mask=open("mask.png", "rb"),
prompt="Replace the background with a modern office setting"
)

Sora — Video Generation

Cinematic video generation from text prompts, available via ChatGPT Pro ($200/mo) and API:

response = client.video.generate(
model="sora",
prompt="A drone shot flying over a futuristic city at sunset, with flying cars and holographic billboards",
duration=10, # seconds
resolution="1080p"
)
FeatureDetail
Max durationUp to 60 seconds
ResolutionUp to 4K
CapabilitiesText-to-video, image-to-video, video extension
AvailabilityChatGPT Pro + API

Whisper / TTS — Traditional Audio

For non-realtime speech workloads, OpenAI offers traditional speech models:

ModelUse CaseCost
WhisperBatch/offline speech-to-text$0.006/minute
GPT-4o TranscribeHigh-quality speech-to-textPay-per-token
GPT-4o mini TranscribeCost-efficient speech-to-textPay-per-token
GPT-4o mini TTSText-to-speechPay-per-token

Use Case Matrix

Use CaseBest Tool
Browser-based voice agentRealtime API (WebRTC)
Phone call integrationRealtime API (SIP)
Live translationRealtime Translate
Meeting transcriptionRealtime Whisper (streaming) or Whisper (batch)
Product images / marketingGPT Image 2
Video content / adsSora
Audiobook narrationTTS
Podcast transcriptionWhisper

For the MCP protocol and how it connects to OpenAI tools, see MCP & Integrations.