We Trained mRNA Language Models Across 25 Species for $165—Here’s How

We Trained mRNA Language Models Across 25 Species for $165—Here’s How

10 0 0

Imagine going from a therapeutic protein concept to a synthesis-ready, codon-optimized DNA sequence in an afternoon. That’s what OpenMed set out to build, and this time they actually shipped it.

In their first post, they mapped the landscape—AlphaFold, ESMFold, ProteinMPNN, the whole zoo. That was the survey. This is the build. And honestly, I appreciate the honesty upfront: they admit this isn’t a polished success story. It’s a transparent account of what worked, what surprised them, and what they’d do differently.

What They Built

The pipeline has three stages, each tackling a different part of the protein engineering workflow:

  • Protein folding – ESMFold v1 on 30 protein chains. Average pTM: 0.79. Solid baseline.
  • Sequence design – ProteinMPNN on scaffold 7K00. 42% sequence recovery. That’s actually decent for a single pass.
  • mRNA optimization – This is where they went deep. Trained multiple transformer variants on 250k coding sequences, then scaled to 381k sequences across 25 species.

The folding and design parts use established tools (ESMFold from Meta, ProteinMPNN from the Baker Lab). The codon optimization is entirely their own—new models, new training infrastructure, new evaluation metrics. That’s where the interesting stuff lives.

The Architecture Showdown

Here’s the thing about codon optimization: the genetic code is degenerate. The same protein can be encoded by astronomically many DNA sequences, but some arrangements express 100x better than others. The Pfizer-BioNTech COVID vaccine was codon-optimized for human expression. OpenMed wanted to build a model that learns these preferences directly from natural coding sequences, not from hand-crafted frequency tables.

They tested five architectures:

  • CodonBERT (baseline) – 6M params, BERT-tiny. Just to establish floor performance.
  • ModernBERT-base – 90M params, 22 layers, RoPE. Latest efficiency innovations from NLP.
  • CodonRoBERTa-base – 92M params, 12 layers. Same family as ESM-2.
  • CodonRoBERTa-large – 312M params, 24 layers.
  • CodonRoBERTa-large-v2 – Same architecture, better hyperparameters.

The choice of RoBERTa was deliberate. Meta’s ESM-2 (which powers ESMFold) is itself a RoBERTa variant trained on protein sequences. The hypothesis: the same architecture that learned amino acid patterns would also learn codon usage biases.

The Winner and Why It Matters

CodonRoBERTa-large-v2 came out on top with a perplexity of 4.10 and a Spearman CAI correlation of 0.40. For context, ModernBERT scored worse despite being newer and more efficient. This surprised me a bit—ModernBERT has all the modern bells and whistles (rotary embeddings, efficient attention), but RoBERTa just performed better on codon sequences.

The lesson here is that biological sequences don’t always benefit from the latest NLP innovations. Sometimes the proven workhorse is still the right tool.

Scaling to 25 Species

This is the part that impressed me. They scaled to 25 species—covering human, mouse, zebrafish, E. coli, yeast, and others—in 55 GPU-hours. Total cost: $165. That’s absurdly cheap for training four production models.

They built a species-conditioned system that no other open-source project offers. You give it a protein sequence and a target organism, and it outputs codon-optimized DNA tailored to that species’ codon usage biases. The multi-species suite includes four models spanning those 25 organisms.

The End-to-End Workflow

Put it all together and you get a pipeline that takes you from protein concept to synthesis-ready DNA in a single afternoon. The code is runnable, the results are reproducible, and the whole thing cost less than a decent dinner out.

I have one minor gripe: the folding component only ran on 30 protein chains. That’s a small validation set. I’d want to see results on a larger, more diverse set before trusting the pipeline for production work. But as a proof of concept, it’s solid.

Where This Stands

OpenMed has built something genuinely useful. The codon optimization models are open-source, the training pipeline is documented, and the multi-species conditioning is a feature I haven’t seen in any other open project. The $165 price tag makes it accessible to labs that can’t afford the big cloud compute bills.

If you’re working on therapeutic mRNA, vaccine development, or recombinant protein production, this is worth a look. The code is on GitHub, the models are on Hugging Face, and the paper draft is linked in their references.

I’m curious to see where they go next. Fine-tuning on experimental expression data would be the obvious next step. Or maybe scaling to all 100+ sequenced species. Either way, they’ve set a high bar for open-source protein AI pipelines.

Comments (0)

Be the first to comment!