The Economics of Replacing Call Center Workers With AIs
TLDR: Voice AIs aren't that much cheaper in the year 2025 My friend runs a voice agent startup in Canada for walk-in clinics. The AI takes calls and uses tools to book appointments in the EMR (electronic medical record) system. In theory, this helps the clinic hire less front desk staff and the startup makes infinite money. In reality, the margins are brutal and they barely charge above cost. This is surprising to me: surely a living, breathing, squishy human costs more per hour than a GPU in a datacenter somewhere? An industry overview of voice AIs Broadly speaking there are 3 types of companies in the voice AI industry 1. Foundation model companies: 1. These companies actually train the text to speech and realtime audio models 2. Openai, Elevenlabs, Cartesia 2. Pipeline companies 1. Infrastructure companies that aggregate multiple foundation model providers and help you experiment with multiple providers, build agents, and connect with SIP and WebRTC transports (think OpenRouter but with extra steps). 2. Developer focused: N8n, Bland, Vapi 3. Enterprise focused: Ada, Sierra, Fin 3. Vertical startups 1. Startups that do "voice agents for {healthcare | logistics | real estate | etc }" 2. Here's 142 of them Of course, these categories are fuzzy and some companies might vertically integrate over many layers (e.g. Vapi has its own foundation model for TTS). The line by line breakdown Let's dive into the heart of the stack, using Vapi as an example Vapi works like a sandwich with a few flavors Speech to Text (STT) => LLM => Text to Speech (TTS) * First, deepgram converts calls to text (100ms) * Then, gpt 4o does text to text (600ms) * Finally, Vapi does text to speech (250 ms) * Add in some latency sauce from WebRTC transport (100ms) or Twilio phone service (600 ms) * At a minimum this costs $0.15/minute * $0.05 for Vapi hosting * $0.01 for Deepgram Speech to Text * $0.07 for GPT 4o * $0.022 for Vapi Text to Sp