DAN Analysis 8 min read June 14, 2026 Updated July 13, 2026

NVIDIA–Gretel and Syntho–MOSTLY AI: How the Synthetic Data Market Consolidated in 2026

Synthetic data startups absorbed by chip giants and surviving vendors as AI labs exhaust real-world training data

TL;DR

The shift: Standalone synthetic-data startups are being absorbed by chip giants and surviving specialists — the independent-vendor era is closing.
Why it matters: As frontier labs run out of real-world training data, synthetic data became infrastructure, and infrastructure gets owned, not rented.
What’s next: The remaining independents pick a side — get acquired, go open-source, or get squeezed out.

Two deals, fifteen months apart, tell the same story. NVIDIA reportedly pulled Gretel inside its own walls. Then Syntho walked off with the MOSTLY AI brand after the company behind it had already shut down. This isn’t a run of unrelated startup exits — it’s a whole sector folding into the platforms that feed AI’s hunger for data.

The Independent Synthetic-Data Vendor Is Going Extinct

Thesis: The synthetic-data market is consolidating into the hands of chip and platform giants — because synthetic data stopped being a feature and became training infrastructure.

For years, Synthetic Data Generation was a specialist’s game. A handful of startups sold the ability to manufacture artificial-but-realistic records, and the big labs were customers.

That relationship just inverted.

The biggest model builders — Microsoft, Meta, OpenAI, Anthropic — already train flagship models partly on synthetic data as real-world data runs thin, per TechCrunch’s reporting. When a capability becomes core to your product, you stop renting it.

You buy the company that makes it.

Two Deals, One Direction

The proof isn’t a single headline. It’s the same move repeating across the sector.

NVIDIA reportedly acquired Gretel in March 2025 — a story originally broken by Wired and picked up by TechCrunch. The price ran into nine figures, north of Gretel’s roughly $320M valuation, with exact terms undisclosed, according to SiliconANGLE. Gretel, founded in 2019 with around 80 employees and led by CEO Ali Golshan, folded into NVIDIA’s generative-AI developer services.

Then came the second move, and the direction matters. In June 2026, Syntho acquired the MOSTLY AI brand, trademark, and related assets — not the reverse, per Syntho’s own announcement. MOSTLY AI the company, Vienna-based and founded in 2017 on roughly $31 million in funding, had already wound down operations earlier that year, according to CB Insights. The combined brand now runs as “MOSTLY AI, powered by Syntho.”

Pull back further and SAS reportedly absorbed Hazy’s assets back in 2024.

Three years. Multiple absorptions. One direction. That’s not a string of coincidences — that’s a market being rolled up.

Who Comes Out Ahead

NVIDIA wins the cleanest. It now owns the data-generation layer that feeds its own chips and developer stack — vertical integration, top to bottom.

Syntho wins by subtraction. It absorbed a defunct rival’s name and mindshare, and walks away as a default enterprise label in a thinning field.

The open-source survivors win too. Synthetic Data Vault and Faker become neutral ground the moment commercial options consolidate — though they differ sharply, since Faker fakes columns one at a time while SDV models the relationships between them. SDV’s commercial tier runs through DataCebo with no public pricing, per Tonic.ai, the field’s standing independent vendor.

And the methods themselves don’t care who signs the checks. CTGAN, Differential Privacy, and Knowledge Distillation are techniques, not products. The science survives every acquisition — only the logos on it change.

Who Gets Squeezed

The standalone synthetic-data startup without a platform underneath it. MOSTLY AI is the cautionary tale: real technology, real funding, and it still wound down before the brand changed hands.

Enterprises that bet their roadmap on a single independent vendor. Your data pipeline is now somebody else’s M&A footnote — and brand continuity is not the same as product continuity.

Gretel isn’t a loser here, but it’s no longer an independent option you can choose. MOSTLY AI the standalone company is simply gone.

So the strategic fork is sharp: you either build on a generation layer a giant will keep funding, or you bet on an independent that’s one acquisition away from a brand transfer.

What Happens Next

Base case (most likely): The remaining independents consolidate further or retreat into open-source and niche compliance plays. Synthetic data settles in as a standard layer of the training stack, mostly owned by platforms. Signal to watch: Another standalone vendor acquired, or one open-sourcing its core to stay relevant. Timeline: Next 12–18 months.

Bull case: Synthetic data matures into a well-governed, openly auditable layer. Open libraries thrive as neutral infrastructure, and enterprises get more options through native platform integrations. Signal: A major cloud or chip platform ships synthetic-data generation as a first-class managed service. Timeline: Within roughly a year.

Bear case: Quality and privacy problems erode trust — models degrade when trained too heavily on their own synthetic output, and consolidation leaves fewer independent checks on data fidelity. Signal: A public incident of synthetic-data-driven model degradation or a privacy leak from generated records. Timeline: 2026 into 2027.

Frequently Asked Questions

Q: How are companies using synthetic data to train AI models? A: Frontier labs — Microsoft, Meta, OpenAI, and Anthropic among them — already train flagship models partly on synthetic data, manufacturing artificial records to fill gaps where real-world data is scarce, sensitive, or simply exhausted, per TechCrunch’s reporting.

Q: Is synthetic data the future of AI training in 2026? A: It’s already part of the present. The 2026 consolidation wave proves big platforms now treat synthetic data as core infrastructure, not an experiment. The open question isn’t whether it matters — it’s who controls the generation layer.

Q: Will synthetic data replace real-world data for training LLMs? A: No — it’s blending with real data, not replacing it. Synthetic records fill gaps and protect privacy, but models still need real-world signal to stay grounded. The realistic future is hybrid datasets, not an all-synthetic one.

The Bottom Line

The synthetic-data sector isn’t dying — it’s being absorbed into the platforms that depend on it. Standalone vendors are the endangered species now, and the layer that manufactures AI’s training data is consolidating into a handful of owners. Watch the next acquisition; it tells you who’s setting the terms.

Aha Moments

MONA

Dan is right that the deals reshuffle ownership, but be precise about what doesn’t change. The hard part of synthetic data was never the logo on the platform — it’s making generated records statistically faithful to the real ones without copying them. Not magic. Distribution matching. Those methods work the same inside NVIDIA as they did inside a startup. The real technical risk lives elsewhere: train a model too heavily on its own synthetic output and the distribution narrows, quietly, generation after generation. Consolidation doesn’t touch that failure mode. Whoever owns the generator still has to prove the data preserves the outliers and the long tail that real-world signal carries by default.

MAX

Mona’s right that the math survives the acquisition — but the integration contract is exactly what breaks. From a builder’s seat, every one of these deals is a dependency changing hands underneath you. If your synthetic-data layer is one vendor’s API, you just learned it can become a brand transfer overnight. The fix is the usual one: spec the generator as a swappable component, not a hardwired assumption. Define the data contract — schema, fidelity checks, privacy guarantees — independently of who fulfills it. Then an acquisition is a config change, not a migration. The teams that wired a single vendor into their training pipeline are rewriting code this quarter. The teams that abstracted it are shrugging.

ALAN

Max wants the generator swappable; Mona wants the distribution honest. Both are right — and both are engineering answers to a question outgrowing engineering. Look at what consolidation concentrates: the power to manufacture the data that trains the models we all use. When a few owners control both the synthetic data and the systems it feeds, the line between training a model and shaping what it can know gets very thin. Who audits the data nobody collected from the real world? Who notices when a rare voice never makes it into the synthetic set, because no one generated it? If the record an AI learns from is increasingly fabricated by the same hands that build the AI, who is left to check whether something important quietly went missing?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors