AI plays Mother Nature? - Historico da BBS TELESC.NET.BR

BBS:      TELESC.NET.BR
Assunto:  AI plays Mother Nature?
De:       Mike Powell
Data:     Sun, 29 Mar 2026 09:12:37 -0500
-----------------------------------------------------------
'Every living thing on Earth runs on the same programming language': How AI
foundation models trained on DNA could transform plant biology

Date: Sun, 29 Mar 2026 12:05:00 +0000

Description:
How AI foundation models are moving from language into the biology of living
systems and plant science.

FULL STORY
Artificial intelligence has already having a
big impact on fields like language processing and computer vision, but 
biology is emerging as one of the next major frontiers. Instead of training
models on text or images, researchers are now turning to DNA, RNA, and other
biological data, treating genetic sequences as information systems that can 
be analyzed at scale. That move comes at a moment when genomic data is 
growing faster than many traditional tools can handle. Sequencing technology
has become cheaper and more widespread over the past two decades, producing
vast collections of biological data that researchers can read but still
struggle to interpret in meaningful ways.

The challenge is no longer gathering genetic information, but understanding
how different sequences interact and influence real-world outcomes. Enter
Living Models Living Models is part of a growing group of companies 
attempting to tackle that gap using transformer-based architectures, the same
underlying approach that powered the recent wave of large language models. 

Instead of predicting the next word in a sentence, these systems analyze
patterns across biological sequences, aiming to uncover structural
relationships that traditional statistical tools often miss. 

The companys first model family focuses on plant biology, an area where
genetic data is widely available and where faster insight could directly
affect crop development and climate resilience.

The idea reflects a bigger shift in how researchers think about biology
itself, moving from static catalogs of genetic parts toward systems that can
interpret how those parts work together. 

"Every living thing on Earth runs on the same programming language: DNA codes
for RNA codes for proteins codes for phenotype," said Bertrand Gakire, VP
Biology at Living Models. "We're not building another chatbot . We're 
building a model that can read and interpret that code, which is infinitely
more useful than predicting the next word in a sentence." 

I wanted to understand what that transition could mean in practice, so I 
spoke to Living Models CEO and co-founder Cyril Vran about why biology is
becoming an information problem  and why plants are the starting point.

Living Models wants to build foundation models for biology. But why? Can we
draw parallels with the race, back in the 1990s to decode the human genome?
The Human Genome Project gives us a useful before-and-after. Before 2003, we
could not read the code at all. The project's achievement was monumental  a
complete parts list for human biology. 

But a parts list is not understanding. After twenty years of remarkable work
GWAS studies, CRISPR screens, QTL mapping, genomic selection  we have
accumulated enormous amounts of genomic data and produced real results. 

What we have not produced, at scale, is generalisation. The tools that exist
today are fundamentally correlative: they learn that certain marker
combinations tend to co-occur with certain phenotypes, within a given
population, in a given environment. 

They do not learn why. Ask them to extrapolate to a novel genetic 
combination, a different environment, or a related species, and the
statistical associations break down. That is the wall the industry has been
hitting for twenty years. 

What changed is the same thing that changed natural language processing:
transformer architecture. When applied to text, transformers stopped
memorising words and started learning the structural relationships between
them  grammar, context, long-range dependencies. That shift is now happening
in biology. 

The question is not whether DNA has 'intention' in the way human language
does. It does not. But it does have structure  regulatory grammar, conserved
motifs, epistatic interactions between distant genomic regions  and that
structure can be learned from sequence data alone, at scale, without 
requiring every relationship to be manually annotated. 

That is the race we are in. Not to sequence more genomes  we have plenty. To
build a model that reads them with sufficient comprehension that a breeder, a
researcher, or a biotech company can ask a meaningful question and get a
biologically grounded answer. 

The HGP was the Apollo Programme: it proved we could get there. What we are
building is the infrastructure that makes the journey routine. Why plants and
why not the other two major domains? I assume this is on your roadmap given
your name is Living Models. There is a strategic answer and a scientific one,
and they point in the same direction. 

The question people usually ask is: why not start with human health, where 
the funding is deeper and the clinical outcomes are more visible? There are
four concrete reasons we went the other way. 

First: data access. Every plant genome we trained on is fully public. No
HIPAA, no GDPR, no patient consent frameworks, no biobank access 
negotiations, no institutional review boards. We assembled training data
covering thousands of plant genomes without a single legal dependency. 

In human genomics, building an equivalent dataset would require years of
regulatory navigation before the first model is trained. That asymmetry is 
not a footnote  it is a fundamental structural advantage that let us move at 
a speed that would have been impossible in a clinical context. 

Second: regulatory friction. Deploying a genomic model in human medicine 
means navigating the FDA, the EMA, and their equivalents across every market.
The evidentiary bar is rightly very high  and very slow. 

In agriculture, the path from model output to field application is governed 
by plant variety registration frameworks that, while meaningful, operate on a
fundamentally different timescale. We can iterate, validate, and deploy in
years, not decades. 

Third: experimental velocity. In human biology, a failed prediction has
consequences that extend far beyond the experiment. 

In plant biology, we can design a trial, grow it out, and measure the result
in a single season. If a variant we predicted to confer drought tolerance
turns out to be irrelevant, we learn that in months, not years, and at a cost
measured in field plots rather than clinical trials. 

The feedback loop that improves the model is dramatically faster. Nobody
regulates what happens to a crop that underperforms. 

Fourth, and perhaps most important: urgency. Agriculture is the industry most
directly, most immediately, and most irreversibly affected by climate change.
Growing seasons are shifting. Drought and heat stress events that were once
rare are becoming baseline conditions in the world's breadbaskets. 

The varieties that will feed ten billion people by 2050 need to be bred for a
climate that does not yet exist at scale  which means we cannot wait for
twenty years of field trials to identify which genomic combinations are
relevant. 

The need for exactly what BOTANIC does  predicting biological function in
conditions outside the historical training distribution  is not a future use
case in agriculture. It is the defining problem of the sector right now. 

As for fungi, microbiome, and the rest: Living Models is not a plant company.
We are a foundation model company for living systems. Plants are where the
structural advantages are highest and the urgency is greatest. The
architecture generalises. The name was chosen deliberately. What prevents
Bayer CropScience, Corteva, Syngenta, BASF, and Limagrain from emulating what
you're doing? And how did you match much larger teams  are you the DeepSeek 
of your category? DeepSeek is a reasonable reference point, with one 
important clarification: what made DeepSeek significant was not that it was
cheap  it was that it was architecturally efficient in ways that larger,
better-resourced teams had not prioritised. 

The lesson is that in deep learning, the team closest to the problem often
moves faster than the team with the most capital. The same dynamic applies
here. 

The large agrochemical groups are extraordinary organisations. They run 
global breeding programmes, navigate complex regulatory environments across
dozens of markets, and manage supply chains of staggering scale. 

What they are structurally not built to do is frontier AI research  the kind
that requires hiring researchers from Huawei Noah's Ark Lab, Mila, Owkin, and
the cole Normale Suprieure, and giving them the autonomy to redesign training
pipelines from scratch. That is a different institutional mode. 

You do not acquire it by redirecting an IT budget. You build it over years, 
or you partner with someone who already has it. We expect many of the largest
seed companies to do the latter. 

On the IP question: we released BOTANIC as open weights deliberately, and the
logic is worth explaining precisely. The model weights are a snapshot. The
durable competitive asset is the flywheel that generates the next, better
snapshot: proprietary fine-tuning data accumulated through each customer
partnership, the feedback loops from real breeding programmes, and the
architectural improvements that compound over time. 

Every partnership we close with a major seed group produces training signal
that no competitor can replicate, because that phenotypic data  decades of
field trials, trait measurements, environment interactions  was never public
to begin with. Open weights accelerate the first step of adoption. 
Proprietary data pipelines create the moat that follows. 

As for acquisition: it is a real strategic option for the incumbents, and we
are aware of it. What it would confirm is that the capability cannot be built
internally at the pace required. That is itself a form of validation. What
could be the consequences of biological hallucinations, and what barriers do
you have to mitigate any risks? I want to be precise here rather than
reassuring, because the question deserves precision. 

BOTANIC operates as a hypothesis engine, not a decision system. When the 
model scores genomic variants for their likely contribution to drought
tolerance, it is prioritising a candidate list for experimental validation
not issuing a planting instruction. 

In a research setting, the consequence of an incorrect prediction is a wasted
experiment, typically weeks to months of work. That is a real cost, and we
take it seriously. 

The more significant risk operates at the industrial scale: a seed company
that allocates its R&D programme on the basis of systematically biased
predictions could misallocate resources over a multi-year breeding cycle
before the error surfaces in field data. 

Plant breeding runs on timescales of four to eight years from genomic
hypothesis to commercial variety. That is the error propagation window we
design against. 

Concretely, we do three things. First, uncertainty quantification is built
into every model output  predictions come with calibrated confidence
distributions, not point estimates, and we validate that calibration against
held-out genomic benchmarks documented in our bioRxiv technical report. 

Second, we explicitly flag low-coverage regions of genomic space where the
training distribution is thin and model confidence should be treated
sceptically. 

Third, our commercial deployments are integrated into existing breeding
workflows where domain experts make the consequential decisions  BOTANIC
accelerates the hypothesis generation step, it does not replace the 
agronomist or the field trial. 

The structural safeguard is the nature of the domain itself. Unlike a 
software system where a model error can propagate at machine speed through
millions of decisions, agricultural biology has human experts and 
multi-season validation cycles built into every step. We design for that
reality rather than trying to substitute it. Can you give a real application?
Would scientists talk to it like ChatGPT? Can companies combine BOTANIC with
proprietary data? Concrete example: a wheat breeder wants to develop 
varieties resilient to the kind of drought that devastated harvests in
southern Europe in 2022. The traditional approach means crossing thousands of
candidate lines, growing them through multiple seasons, and measuring which
ones survive. 

That is a process of five to twelve years from hypothesis to commercial
variety, with most candidates failing late. 

The existing computational toolkit  genomic selection models like GBLUP or
BayesC  already helps narrow that funnel. But those models work by learning
statistical correlations between marker combinations and measured phenotypes
within a specific training population. 

They require hundreds to thousands of phenotyped individuals per trait, they
degrade when you move to a different environment or genetic background, and
they are blind to biological mechanism. 

They will tell you that a particular haplotype block tends to co-occur with
drought tolerance in your historical data. They cannot tell you why, or
whether it will hold in a genetic background they have never seen. 

BOTANIC approaches the same problem from a different direction. Because it is
trained on raw genomic sequence across 1,600 plant genomes  not on
phenotype-marker associations  it learns the underlying biological structure:
regulatory grammar, conserved functional motifs, the long-range epistatic
interactions that classical models treat as noise. 

When applied to the breeder's candidate lines, it can prioritise variants 
that are biologically coherent, not just statistically associated  including
novel combinations absent from any historical training set. The experimental
programme then targets a far smaller, better-grounded set of candidates. 

The breeding cycle does not disappear, but its front end becomes dramatically
more efficient, and its predictions hold up further from the training
distribution. 

On the interface question: the primary environment is the computational
workflow that genomics researchers already use  sequence files, annotation
tracks, variant call formats. That is where the value is highest and the
integration is cleanest. 

On hybrid deployment: yes, and this is the architecture we run with 
enterprise customers. A major seed group typically holds decades of
proprietary phenotypic data  field trial results, trait measurements,
environment-specific performance records  that have never been combined with 
a model capable of reasoning over the underlying genomics. 

We fine-tune BOTANIC on that dataset in a private deployment: the customer's
data does not leave their environment, the resulting model weights remain
their property, and what they get back is a model that combines general
biological knowledge from 1,600 plant genomes with deep specificity to their
crops, environments, and breeding objectives. 

The difference between that and a genomic database query is the difference
between a fluent domain expert and a search engine.

Link to news story:
https://www.techradar.com/pro/every-living-thing-on-earth-runs-on-the-same-pro
gramming-language-how-ai-foundation-models-trained-on-dna-could-transform-plan
t-biology

$$
--- SBBSecho 3.28-Linux
 * Origin: Capitol City Online (1:2320/107)

-----------------------------------------------------------
[Voltar]