Simula: A Smarter Way to Generate Synthetic Data by Designing Datasets, Not Just Samples

Simula: A Smarter Way to Generate Synthetic Data by Designing Datasets, Not Just Samples

6 0 0

We all know the drill: generalist AI models are hungry for data, and the internet has been their all-you-can-eat buffet. But the next wave of AI adoption isn’t going to be about generating more cat memes or summarizing news articles. It’s about specialized, privacy-sensitive, and frankly boring domains where data is scarce, expensive, or just doesn’t exist in a usable form.

Creating datasets manually for these niches? That’s a nightmare. It’s slow, error-prone, and costs a fortune. Real-world data also locks you into a static snapshot—good luck iterating quickly when your dataset is a fossil. And let’s not even talk about safety: waiting for failures to happen in the wild before you harden your models is a reactive game nobody should be playing.

Synthetic data seems like the obvious answer, but most current approaches are half-baked. They rely on manual prompting, evolutionary algorithms that feel like black magic, or they need a bunch of seed data from the target distribution. That limits scalability, explainability, and control. Worse, they usually operate at the sample level—optimizing one data point at a time—instead of thinking about the dataset as a whole.

Reframing the problem as mechanism design

Google Research’s new paper, “Reasoning-Driven Synthetic Data Generation and Evaluation,” published in TMLR, introduces a framework called Simula that takes a fundamentally different approach. Instead of generating samples in isolation, Simula treats the whole process as dataset-level mechanism design. You’re not just asking a model to spit out more examples; you’re architecting a dataset from first principles, with fine-grained control over coverage, complexity, and quality.

The key insight is that production use cases need more than just “more data.” They need resource allocation—you want to cover the long tail of a domain, not just cluster around the common modes. You want to control how complex or simple the examples are. You want to ensure quality without relying on human annotators. Simula gives you those levers.

How Simula works: reasoning-first, seedless, agentic

Simula is built around a “reasoning-first” methodology. It’s seedless—you don’t need existing data from the target domain. And it’s agentic, meaning it uses reasoning models (like LLMs with strong reasoning capabilities) to generate and evaluate data in a structured way.

The generation process is broken down into four controllable axes:

Global Diversification: Instead of randomly sampling, Simula uses reasoning models to map out the conceptual space of a target domain into deep, hierarchical taxonomies. Think of it as a “sampling scaffold.” By defining sampling strategies over these taxonomies, you can ensure your dataset covers the long tail—the rare edge cases—rather than just the typical examples. The system does this recursively: it proposes sub-categories, evaluates them, merges duplicates, and filters out noise. The result is a dense taxonomy that serves as the blueprint for diversity.

Local Complexity: Once you have the taxonomy, you can control the complexity of individual samples. For example, in a cybersecurity threat dataset, you might want simple phishing emails for beginners and multi-stage attacks for advanced scenarios. Simula lets you dial this in per taxonomy node.

Quality Control: Instead of relying on human reviewers, Simula uses a critic model to evaluate generated samples against predefined quality criteria. This is integrated into the generation loop, so you can reject or refine samples on the fly.

Scalable Generation: Because the process is automated and seedless, you can generate datasets at scale without human bottlenecks. The generation capabilities improve naturally as the underlying reasoning models get better.

What this means in practice

I’ve been playing with synthetic data generators for a while, and most of them feel like brute force—throw more compute at the problem and hope for the best. Simula’s approach is more surgical. By treating the dataset as a designed artifact, you get reproducibility (the dataset is like code—versioned and inspectable), controllability (you can tweak the taxonomy or complexity without regenerating everything), and explainability (you can trace why a particular sample was generated).

The paper demonstrates this on a few domains, including cyber threat intelligence and medical dialogue generation. In both cases, Simula outperformed baseline methods on coverage and quality metrics, while requiring no seed data.

Of course, this isn’t magic. The quality of the taxonomies depends on the reasoning capabilities of the underlying model. If your model is weak, your taxonomies will be shallow or noisy. And the critic model for quality control introduces its own biases—garbage in, garbage out. But as reasoning models continue to improve (and they are improving fast), this approach scales naturally.

The bigger picture

What I like about Simula is that it shifts the conversation from “how do we generate more data?” to “how do we design better datasets?” That’s a more interesting and more useful question. For privacy-sensitive domains like healthcare or finance, where you can’t just scrape the web, this kind of controlled generation is a game-changer. For safety-critical applications, being able to proactively generate edge cases—rather than waiting for them to happen—is exactly what we need.

Is Simula the final word on synthetic data? No. But it’s a solid step in the right direction. It acknowledges that synthetic data generation is a design problem, not a sampling problem. And that’s a perspective more people in the field should adopt.

Comments (0)

Be the first to comment!