Over the past few months, I’ve been experimenting with different ways to implement high-performance Voice AI agents for my personal project — Enkitalki, the language AI tutor app. It’s been quite a journey—starting with quick prototypes, moving through a fully custom backend, and eventually settling on an open-source framework that hit the sweet spot between flexibility, cost, and maintainability.
In this post, I’ll walk you through my three major iterations, what worked, what didn’t, and why I’d recommend the path I ultimately took if you’re building your own real-time conversational AI.
Iteration 1: The Quick Start with Vapi API
When I first started, I wanted to get something running quickly, so I turned to Vapi API — a service that lets you combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) with minimal setup. All those parts integrate into an AI voice agent and are highly configurable.
What I liked:
- Very easy to configure
- Flexible choice of models for each stage (STT, LLM, TTS)
- Nice developer experience for rapid prototyping
The downsides:
- Cost: pricing made it less viable at scale
- Model limitations: newer, faster, or cheaper models often weren’t available yet
- Poor integration with React Native and mobile platforms (Android, iOS)
For an MVP, Vapi is fantastic. But for a production-grade system where cost efficiency and model variety matter, I quickly had to look for other options.
Iteration 2: Building the Backend from Scratch
The next step was… ambitious. I decided to implement the full pipeline myself — connecting STT, LLM, and TTS directly, handling everything from streaming to context management in a Python backend service.
That meant:
- Taking and grouping transcribed speech into meaningful chunks
- Sending those chunks to the LLM
- Streaming the LLM’s response into the TTS engine to reduce perceived latency
- Streaming TTS output back to the client
- Implementing interruption handling and noise detection for smooth back-and-forth conversation (barge-in)
The goal was to get human-like conversation speed while keeping everything modular, configurable, and replaceable.
In the end, it worked… eventually. But it was complex, brittle, and time-consuming to maintain. To build a truly performant, production-ready AI voice agent, there are many small features that need to be implemented: barge-in, streaming, passing chunks of audio between client and server, and more.
Iteration 3: Pipecat Framework
Then I discovered Pipecat — an open-source framework for building AI agents.
Pipecat essentially abstracts away all the glue code I had been writing by hand:
- Standardized chunk passing between components
- Built-in context storage
- Easy model swapping with many different models and services supported
- Streaming support baked in
Rewriting the project with Pipecat was pretty fast. I used “vibe coding” and AI to execute the migration, and it went smoothly overall. Some parts required digging into Pipecat’s codebase — for example, updating the LLM context on-demand wasn’t documented very well. But overall, the experience was good 👍.
As a result, I could achieve:
- Cost savings: ~3× cheaper than Vapi for my use case
- Better performance: I could integrate faster and more precise models for TTS and STT
- Flexibility: Easy to customize specific logic without breaking the pipeline
In my case, for example, switching from OpenAI’s TTS model to minimax/speech-02-turbo
or kokoro-82m
saved a lot of money, and using Gladlia for STT helped as well. However, you need to test models for your own use case, as each comes with trade-offs. I’d suggest using artificialanalysis.ai, which provides valuable comparison data for initial decisions.
Lessons Learned & Recommendations
If you’re thinking about building your own Voice AI agent:
- For prototyping – Start with something like Vapi API to validate the concept quickly.
- For production – I’d suggest using Pipecat or another framework, but having a framework definitely makes sense, even for small projects.
Also:
- Model experimentation is key – The landscape changes fast; choose STT/TTS/LLM models based on your needs for latency, quality, and cost. For example, from my experience, GPT-5 is too slow for Voice AI agent use cases.
- Streaming is essential – The closer you get to real-time, the more natural the user experience. Without streaming in all stages, the experience feels too laggy.
Today, it’s entirely possible to build low-cost, high-performance Voice AI agents without massive infrastructure investment. Frameworks like Pipecat make it both approachable and scalable.
So — dive in, experiment, and have fun. The tools are ready.