What are audio tags in Gemini 3.1 Flash TTS?

Audio tags are natural language commands you embed directly into text input to control vocal style, pace, and delivery. You can instruct the AI to whisper, shout, or add dramatic pauses, offering granular control over speech output.

How does Gemini 3.1 Flash TTS compare to other TTS models?

It achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard, placing it in the 'most attractive quadrant' for balancing high-quality speech with low cost. It supports native multi-speaker dialogue and over 70 languages.

Google Gemini 3.1 Flash TTS: Audio Tags Make AI Voice Control Fun

Q: Is Gemini 3.1 Flash TTS available to everyone?

It's rolling out in preview starting now. Developers can access it via the Gemini API and Google AI Studio, enterprises on Vertex AI, and Workspace users will see it in Google Vids.

Google just dropped Gemini 3.1 Flash TTS, their latest text-to-speech model, and I have to say — the audio tags feature is the kind of thing I’ve been waiting for. Finally, you can tell an AI to whisper, shout, or slow down without jumping through hoops.

Let’s get the basics out of the way: this thing is rolling out in preview starting today. Developers get it via the Gemini API and Google AI Studio, enterprises can poke at it on Vertex AI, and Workspace users will see it in Google Vids. That’s a pretty wide net.

Speech quality that actually competes

Google claims this is their most natural and expressive TTS model yet. They’re backing it up with numbers: an Elo score of 1,211 on the Artificial Analysis TTS leaderboard. That’s a blind preference benchmark with thousands of human ratings, so it’s not just internal fluff.

Artificial Analysis also put Gemini 3.1 Flash TTS in their “most attractive quadrant” — their fancy way of saying it balances high-quality speech with low cost. Native multi-speaker dialogue, 70+ languages, and that granular control I mentioned. Pretty solid package.

Audio tags: the real star here

This is where it gets interesting. Audio tags let you embed natural language commands directly into the text input. Want the AI to sound excited? Tag it. Need a dramatic pause? Tag it. You can control vocal style, pace, and delivery with a level of precision that previous models just didn’t offer.

I’ve fiddled with enough TTS systems to know that most of them give you a slider for speed and maybe a dropdown for “happy” or “sad.” This is different. You’re writing commands into the flow of the text itself, which feels more like directing a voice actor than programming a robot.

SynthID watermarking is included

Every piece of audio generated with this model gets watermarked with SynthID. That’s Google’s tool for tagging AI-generated content so it can be identified later. Given how realistic these voices are getting, that’s not just a nice-to-have — it’s a necessity. Misinformation concerns are real, and this is a step toward keeping things honest.

What I’m watching for

I’m curious to see how developers actually use these audio tags in production. The potential is there for everything from audiobooks with dynamic narration to customer service bots that don’t sound like they’re reading a script. But we’ve seen plenty of promising TTS models that never quite made it past the demo stage.

The 70+ language support is also a big deal. Most models in this space are heavily English-centric, so seeing broad language coverage out of the box is refreshing.

Is it perfect? Hard to say without spending serious time with it. But the direction is right. Google’s finally treating AI speech as something you direct, not just generate.

Google’s Gemini 3.1 Flash TTS is Here, and It’s Actually Fun to Control

Speech quality that actually competes

Audio tags: the real star here

SynthID watermarking is included

What I’m watching for

Comments (0)