Building an AI Voice Agent to Automate a Robot Cafe with Google Gemini Live and MCP
By Gerard Sans
Building an AI Voice Agent to Automate a Robot Cafe with Google Gemini Live and MCP
The launch of Alexa+ has sparked renewed excitement around the next generation of AI voice assistants powered by generative AI. With Gemini 2.5 and the new Gemini Live API together with the power of MCP, developers now have the tools to build voice-driven AI agents that seamlessly integrate into web applications, backend services, and third-party APIs.In this talk we will go beyond simple chatbot interactions to explore how AI agents can power real-world automation—in this case, running an entire robot cafe. We’ll walk through building a voice-first assistant capable of executing complex workflows using MCP, streaming real-time audio, querying databases, and interacting with external services. This marks a shift from "ask and respond" to a more dynamic "talk, show, and act" experience. You might assume taking a coffee order is straightforward, but even a basic interaction involves more than 15 distinct states. These include greeting the customer, handling the order flow, confirming selections, applying offer codes, managing exceptions, and supporting cancellations or changes. Behind the scenes, the AI agent using MCP coordinates with multiple systems to fetch menu data, validate inputs, and trigger robotic actions. You’ll learn how to stream microphone data, integrate with Gemini voice responses, and use the GenAI SDK to connect everything together using MCP. Instead of a traditional chat UI, this project creates a fully voice-automated, hands-free experience where the assistant doesn’t just chat—it runs the operation. Join us for a deep dive into the future of AI automation using MCP — where natural voice is the interface, and the AI agent takes care of the rest, including your fancy choice of coffee!