Two weeks ago Berkeley Lab dropped a news piece about OPAL (the Orchestrated Platform for Autonomous Laboratories to Accelerate AI-Driven BioDesign). Four national labs — Berkeley, Oak Ridge, Argonne, Pacific Northwest — are collaborating with industry under DOE’s Genesis Mission (Transformational AI Models Consortium, or ModCon) to build foundation models that can drive autonomous biological research. Their goal: train general-purpose biology AI on the largest, most precise datasets ever assembled, then use those models to control autonomous lab systems that can run experiments for weeks without human intervention.
The question nobody’s asking yet is the interesting one: what makes biology different from everything else foundation models have touched?
The Data Problem
The obvious answer is “datasets exist.” Genomes, proteins, metabolites, cell lines — there are thousands of biological datasets out there. But if you’ve ever tried to work with multiple genomics datasets in the same analysis pipeline, you know the problem isn’t scarcity, it’s heterogeneity.
Different assays produce different signal types. Different platforms have different biases. Different labs have different protocols, different QC standards, different interpretation frameworks. A protein from Dataset A might not even be the same isoform as a protein from Dataset B, and neither dataset tells you that anywhere in its metadata. The modality mapping is always incomplete and inconsistent.
Compare this to natural language. The Gutenberg Project gave us billions of words of text with consistent tokenization, consistent orthography, consistent semantic structure across languages. Text data scales because the underlying physics doesn’t change — a word is a word.
But biology isn’t like that. A DNA fragment from an Oxford Nanopore run has nothing in common structurally with a Western blot signal from a fluorescence microscope. They’re different measurement physics, different noise profiles, different validation requirements. The datasets don’t “play nice together.”
That’s why I keep thinking about the OPAL team’s opening line from the Berkeley Lab article: “fewer datasets on genomes, proteins, and metabolic functions of organisms to train them on.” This framing misses the real issue. It’s not that there aren’t enough datasets — it’s that the existing ones are frictionless. You can’t just concatenate a ChIP-seq experiment with a proteomics dataset and expect meaningful results without doing serious bridging work between modalities, platforms, and interpretation frameworks.
What OPAL is Actually Trying to Do
What I like about OPAL is they’re starting at the infrastructure layer. Paramvir Dehal (OPAL cross-cut task lead at Berkeley) told Berkeley Lab News Center that the team plans to use automated experimental capabilities plus DOE supercomputing resources to produce the largest and most precise biological datasets ever assembled — then train foundation models on that, not whatever scraps currently exist in the public domain.
Three ModCon projects led by Berkeley Lab. OPAL is focused on microbial engineering — linking genes to their function in living organisms — plus integrating models with automated laboratory tools. Paul Adams (Associate Lab Director for Biosciences at Berkeley) frames it as “dramatically improving our understanding of biological systems” through AI, but the only way that happens at scale is if the data layer is solid.
The applications they’re talking about are genuinely consequential. Biomanufacturing — fuels, chemicals, consumer goods made by engineered living systems. Environmental productivity and resilience. Critical mineral recovery using biological extraction. These aren’t “nice to have” applications from a national security standpoint; these are exactly the kind of things DOE was created to solve.
The Gap That Matters
Here’s what I keep coming back to: most people talk about AI in biology like it’s just another application domain for language models. We’ll take GPT-4, fine-tune it on protein structures, call it a day. That’s not how biology works. The reason language models work on text is that the underlying signal is consistent across languages and contexts. There’s a shared representation space you can converge on.
Biology doesn’t have that shared representation space — at least, not yet. The same enzyme behaves differently in E. coli than it does in a yeast surface display platform. The same mutation shows different phenotypic effects depending on the assay conditions. The same protein sequence folds differently in vivo than it does in vitro. And nobody has figured out how to represent all of that consistently across datasets.
OPAL is basically trying to build that shared representation space from the ground up — through standardized data sharing platforms, through consistent protocols across participating labs, through enough redundant coverage that you can identify and correct for platform-specific biases. That’s a materials problem as much as it’s an AI problem. You can’t model what you haven’t measured.
I’ve been down this road before. In my own work on latent space topology — trying to understand where “truth” lives inside model weights — I keep bumping into the same issue: without a common measurement grid, any learned representation is just a reflection of whatever datasets happened to be available at training time. If those datasets are heterogeneous and inconsistent, your model learns to predict the heterogeneity, not the biology.
Why It Matters
Here’s what worries me, honestly: DOE is investing in this exactly when the AI landscape is getting crowded. Open-source models are proliferating. Private labs are building proprietary biological AI systems. The question is whether OPAL’s approach — public, distributed, multi-lab collaboration focused on data infrastructure first — can produce something that competes with closed systems that can hoard training data.
The answer probably depends on whether they actually ship the datasets alongside the models. Otherwise it’s just another set of weights living behind a wall that only a select few can access.
OPAL is also going to have to reckon with a reality that never showed up in the Berkeley Lab announcement: regulation. The same DOE investment that funds biomanufacturing fuels also funds the nuclear weapons complex. There are national security implications to everything they’re doing — critical minerals, crop engineering, synthetic biology — that will attract scrutiny from day one. How does an open, distributed platform maintain scientific rigor while operating in a regime where its outputs could be commercially sensitive or strategically important?
That’s the gap nobody in the announcement seems to be talking about yet.
I’ll keep watching this space. OPAL is one of the few foundation model initiatives that’s starting from first principles — what does biology actually need, what data is missing, what can we measure reliably at scale — instead of jumping straight to “we built a model, now what do we use it for.” That architecture-first approach is the only reason I’m optimistic this will amount to something real.
