Fluent Enough

Google’s latest translation model preserves tone and intonation across seventy languages, but conversation requires more than words.

Jun 12, 2026

Standing in a restaurant abroad with a menu in an unfamiliar script used to mean pointing at what the next table ordered and hoping for the best. The standard workaround, typing a few halting words into a translation app and showing the screen to the waiter, worked after a fashion, but it turned every exchange into a small production. Several developments in recent months have changed what a phone or a pair of earbuds can do in that moment. The tools now handle the words reliably, and the gap that remains has less to do with language than with the experience of talking to another person.

A trillion words a month

Google released Gemini 3.5 Live Translate this week across the Translate app on Android and iOS, Google Meet, and a developer API. The model translates speech to speech in over seventy languages and more than two thousand language combinations, and it does so continuously, generating output while the speaker is still talking. Earlier systems operated in turns, creating a rhythm closer to a walkie-talkie than a conversation. Gemini 3.5 stays a few seconds behind the speaker throughout, producing translated audio that preserves the original intonation, pacing, and pitch. On Android, a new listening mode lets you hold the phone to your ear like a regular call and hear the translation through the earpiece, no headphones required. Google Meet will use the same model to expand its speech translation from five supported languages to over seventy, a change scheduled to roll out to business customers later this year.

The hardware layer has matured alongside the software. Pixel Buds Pro 2, Samsung Galaxy Buds3 Pro, and dedicated translation earbuds from companies like iFLYTEK and Vasco now function as real-time interpreters, with each speaker wearing one earbud and hearing the other’s words in their own language. Vasco’s E1 supports up to ten simultaneous earbuds for group conversations. For text, Google Lens overlays translations onto menus and street signs through the phone’s camera, and Circle to Search on Android translates anything visible on screen with a long press.

The cumulative effect for anyone traveling this summer is that the transactional layer of communication abroad works. You can order dinner in a language you do not speak, ask a taxi driver for a route change, or read a document in a foreign script, and the tools will get the meaning across. Google Translate turned twenty in April and now processes roughly a trillion words per month across its products. Grab, the Southeast Asian ride-hailing company, is testing Gemini 3.5 Live Translate for driver-rider communication across more than ten million monthly voice calls, a deployment at scale that suggests the accuracy question has been largely settled.

Behind the beat

The friction that remains sits in the timing. Even continuous translation runs a few seconds behind the speaker, and those seconds matter more than they sound. Conversation depends on micro-signals that operate on tight timing, from well-placed nods to brief pauses that invite the other person to speak. A two-second delay collapses the structure of turn-taking. You laugh at a joke after the speaker has already moved on, signal agreement after the moment for agreement has passed. The exchange functions, but it does not flow.

Translated speech can now carry a rising tone of surprise, a flat delivery of skepticism, or a warm cadence meant to put someone at ease. Those qualities were largely lost a year ago, and their preservation in Gemini 3.5 represents genuine progress. The mismatch that persists is temporal. You hear someone’s warmth a beat after you have already seen their face move on to the next thought, and that small asynchrony leaves the exchange feeling mediated.

For most everyday purposes, this distinction is academic. A taxi ride, a restaurant order, and a request for directions do not require conversational intimacy, and the tools handle those exchanges well. The gap surfaces in sustained interaction, the kind where understanding the person matters as much as understanding the words. A dinner with someone who speaks a different language, a negotiation conducted through earbuds, and a long conversation with a new acquaintance all reveal how much of communication still travels through channels that translation has yet to reach.

Already good enough

The volume of use suggests that people have already decided the tools work well enough to rely on. Most of those interactions are new ones, exchanges that would not have happened at all if someone had to speak the other person’s language or find an interpreter. The temporal gap between meaning and feeling may prove the hardest to close, and seamless cross-language conversation remains ahead. For now, the tools handle the words, and for most of what travel and daily life demand, the words turn out to be enough.

AI Central

Discussion about this post

Ready for more?