How artificial intelligence can accelerate research and development for cell and gene therapies

(PDF-814 KB)

Novel modalities carry huge potential.¹ Within oncology, for example, cell therapy is expected to become the third-largest segment across all modalities (behind antibodies and small molecules) by 2030, with 35 percent CAGR in sales over 2021–30 (Exhibit 1). Gene- and RNA-based therapies, on the other hand, are unlikely to play a major role in the short to medium term, although there are currently more than one hundred such assets in Phase I–III studies.

Bringing novel cell and gene therapy (CGT) modalities to patients successfully remains challenging. Notable headwinds include the complexity and heterogeneity of the solution space, manufacturing and supply chain challenges (especially for personalized therapies), and the difficulty of appropriately matching therapies to the suitable patient endotypes. Moreover, while AI applications are taking off in the wider biopharmaceutical R&D context, companies are only starting to explore how to apply their potential to CGT.

There is significant untapped opportunity in the industry to scale AI within the CGT value chain. Biotechnology companies enabled by machine learning (ML) that focus on novel modalities are still rare. Moderna is perhaps the most mature example, with a strongly articulated ten-year vision to have digital and analytics at its core to boost its mRNA platform.²

In the past three to five years, additional earlier-stage companies—including Modulus Therapeutics, Outpace Bio, and Serotiny in the cell therapy space; Dyno Therapeutics and Patch Biosciences in the gene therapy and adeno-associated virus (AAV) space; and Anima Biotech in mRNA-based therapeutics—have started to emerge. While the fairly limited scale of CGT over the next ten years could slow the acceleration of AI-driven companies that focus purely on these modalities, the upside may be significant, given the recent wider acceleration of AI in biopharma R&D.

Applying AI to R&D for novel therapeutic modalities brings three principal challenges:

Limited experimental data availability and expensive data generation. Given the novelty and diversity of CGT, experimental data (both public and commercial) are limited. Generating experimental data for these new modalities from scratch is typically very expensive and time consuming. While this poses challenges to training large AI systems, ML approaches can help explore and exploit the vast design space of these modalities, saving time and avoiding the need to undertake unnecessary costly experiments. Such approaches also highlight the upside of establishing these novel modalities as platform tech, reinforcing learning across candidates.
Functional complexity. Because the new modalities are complex, with a potentially huge solution space, it is challenging to establish an accurate relationship between the sequence (DNA, RNA, or amino acid), its structural properties, and observed functional behaviors—ultimately, connecting the design to desired therapeutic behaviors. With this myriad of mechanistic layers, AI and ML techniques present an opportunity to address the limitations of purely expert-driven intelligence to understand the propellers of experimental performance or to create novel designs. They do, however, need to factor in the potential compounding of errors across the different layers.
Separation between wet-lab and in silico research. In silico drug discovery requires a different skill set than the deep expertise required for CGT wet-lab experimentation. The various teams often work alongside each other in silos rather than together, with different scientific objectives, timelines, incentives, and suboptimal data and insights sharing. To reap the benefits of AI for these complex modalities, a closed-loop research system is required so that wet-lab and in silico research are intricately interwoven and build on each other.

Despite these challenges, using AI in R&D could further accelerate CGT innovation. The field is maturing rapidly and has started to receive an influx of talent and venture funding, with further proof points for its applicability and scalability expected soon. What, then, are the relevant use cases?

Where are the unique opportunities to apply AI along the R&D value chain for novel modalities?

Let’s explore three different novel pharma modalities: mRNA-based therapeutics and vaccines, viral therapeutics (such as AAV gene therapy), and ex vivo therapeutics, focusing on chimeric antigen receptor (CAR) T cells. AI can facilitate development of a novel therapy throughout the R&D value chain in a variety of stages, including target identification, payload design optimization, translational and clinical development, and end-to-end (E2E) digitization (see sidebar, “Summary of major AI use cases across the cell and gene therapy value chain”).

Summary of major AI use cases across the cell and gene therapy value chain

Looking along the length of the cell and gene therapy (CGT) value chain, from target identification to clinical development, multiple AI use cases are available. While several use cases are general to all modalities, others are confined to one or more of the following specific areas: mRNA-based therapeutics, viral therapeutics, and ex vivo therapeutics (such as chimeric antigen receptor [CAR] T cells).

Target identification:

all modalities:

epitope prediction to maximize on-target binding and minimize off-target activity
rapid large-scale in silico screening of predicted candidates to reduce wet-lab testing

viral therapeutics: CRISPR guide RNA (gRNA) target site prediction to identify unique, accessible genomic sites for editing
ex vivo therapeutics (CAR T cells): tumor antigen selection to enable appropriate design for CAR T-cell therapy

Lead optimization and payload design:

all modalities: optimization of genetic sequence to control expression levels and tissue specificity
mRNA therapeutics:

optimization of mRNA backbone chemistry to generate immune-silent mRNA
optimization of mRNA and protein sequence to modulate half-life and expression levels
optimization of targeting elements to control tissue specificity

viral therapeutics:

optimization of transgene sequence to modulate expression
optimization of gRNA sequence to minimize secondary structure
optimization of viral regulatory elements to control tissue tropism

Lead optimization and delivery vehicle design:

all modalities: optimization of delivery vehicle to minimize immune response, increase delivery efficiency, and enhance tissue-specific expression
mRNA therapeutics:

optimization of lipid nanoparticle (LNP) chemistry and composition for immune evasion, delivery efficiency, and tissue specificity
optimization of mRNA structure for naked delivery to prevent degradation

viral therapeutics:

engineering of capsid for immune evasion, delivery efficiency, tropism, and greater efficiency of capsid assembly
optimization of viral regulatory elements to increase tropism

Translational development:

all modalities:

modeling of immune response to predict serious adverse events (SAEs) or identify immune drivers for subpopulation analysis
prediction of clinical outcomes (for example, SAEs) using biomarker profiles, candidate features, and in silico toxicology modeling

ex vivo therapeutics (CAR T-cell therapy):

modeling of tumor microenvironment to understand response therapy
identification of local administration sites with improved tumor infiltration for better clinical outcomes

Chemistry, manufacturing, and controls development:

all modalities:

optimization of ability to synthesize delivery vehicle to improve yield
optimization of reagent usage and manufacturing to streamline process
set up of predictive maintenance to minimize downtime

mRNA therapeutics: optimization of ability to synthesize LNP, buffer conditions, and excipients to improve yield and thermostability
viral therapeutics: optimization of capsid to improve fitness, manufacturability, and full-to-empty ratio

Clinical development: optimization of trial design to increase probability of success (for example, identifying patients with right pharmacogenetic profile and predicting dosing and SAEs)
Overall digitization of enterprise-to-enterprise value chain:

all modalities:

long-term patient tracking and certification of outcomes to increase public confidence in CGT and support new payer models
knowledge management by creating a centralized repository for CGT knowledge that future companies can draw on

ex vivo therapeutics (CAR T cells): maintenance of chain of identity and custody to ensure that personalized CAR T-cell therapy is administered to the correct individual

Target identification

Applying AI to R&D for CGT begins with target identification. Here, the biggest challenge centers on selecting the appropriate target to optimize the probability of therapeutic success. Given the heavily personalized nature of most CGT and significant resource investment downstream, it is critical to have robust algorithms that enhance both speed and accuracy at this stage. AI and ML models can be used in various ways.

For viral therapeutics that aim to edit the genome, algorithms to predict CRISPR target sites can help identify genomic sites with genetic sequences or epigenetic features that permit increased efficiency of editing with minimal off-target activity. Older algorithms are hard coded to predict sites based on a set of known binding rules. Newer models based on ML and deep learning are trained on real-world experimental data and outperform older models.³

For therapies that aim to harness the immune system to target specific cancer cells or pathogens (such as mRNA-based vaccines or CAR T-cell therapies), AI and ML can be used to predict tumor epitopes that could be bound by the therapeutic molecule. For CAR T-cell therapies, for example, AI and ML can be used to facilitate the identification of appropriate antigens and binding sites, thereby enabling the design of CARs that have improved on-target activity and minimal cytotoxicity.⁴For example, the ML framework CIBERSORTx can infer gene expression profiles specific to cell type without the physical cell isolation from the tumor and can link phenotypic states with distinct driver mutations and tumor responses with immune checkpoint blockades. For more, see Aaron M. Newman et al., “Determining cell type abundance and expression from bulk tissues with digital cytometry,” Nature Biotechnology, July 2019, Volume 37.

Algorithms that predict protein structure (such as the AlphaFold Protein Structure Database and system) can be used to model how patient-specific mutations affect protein structure and thus CAR binding. Newer functional foundation models (such as ProteinBERT) go beyond the structure to estimate these functional properties of interest directly.⁵ Once a set of possible candidates has been identified, AI and ML can be used to facilitate mass in silico screening of thousands of CAR constructs to identify candidates with high tumor-specific binding affinity and concomitant ability to activate the immune system.

Similar techniques are relevant to construct personalized mRNA- or DNA-based cancer vaccines. They identify the antigens of an individual’s tumor that could solicit the desired immune system response (for example, through epitope prediction). Spatial transcriptomics—visualizing gene expression at different tumor locations at a single-cell resolution—brings a spatial dimension to these efforts, facilitating the understanding of interactions among cell subtypes to find novel targets for cancer therapy discovery.

Payload design optimization

After the identification of an appropriate lead target, the next stage involves optimizing payload design. Here, the challenge is to modulate the functional activity and tissue specificity of the therapeutic molecule while minimizing unwanted effects (such as activation of the immune system). AI and ML models can be used to screen high numbers of candidates rapidly and select designs that fulfill the desired criteria, similar to their use in target identification.

AI and ML models can be used to screen high numbers of candidates rapidly and select designs that fulfill the desired criteria, similar to their use in target identification.

To be most effective, the models should be part of an AI-enabled closed-loop research system, with initial primary screening results automatically fed into an ML pipeline. This pipeline starts to learn how the assay responds to each payload based on its computational features. It then suggests a next batch of optimized payload candidates for experimentation. Resulting experimental data are in turn automatically fed back to continue the learning, closing the research system.

For the closed loop to work, at least three elements should be in place:

The pace and throughput of each cycle need to be high enough (with thousands of candidates per step) to enable iteration at speed. The system is as strong as its weakest link. Experimental setup, payload synthesis, assay ordering, experiment execution, data collection, data structuring, and ML analysis should flow seamlessly into each other. The process can often be enabled by end-to-end (E2E) digitized workflows.
Different teams and capabilities within the research system (for example, computational groups and experimentalists) need to work together effectively, sharing objectives and incentives, and to be open to learning from one another.
Scalable tech and data infrastructure (including smart data governance) supporting these workflows need to be in place to allow for large data volumes and high computational loads.

Exhibit 2 illustrates how different computational and ML components could work within a closed loop for CGT lead optimization. Starting from the actual payload design (DNA, RNA, or protein), it is important to be able to explore the allowed design space computationally through in silico mutations. From there, molecular structure can be computationally inferred and a whole range of payload properties predicted. Finally, payload function can be measured through the relevant assays, whether via genome-activity-editing assays, transcriptomics, protein expression, or tissue specificity. The results can then be linked back to the original sequence, structure, and properties to understand (via ML) what drives function and suggest new payload designs to test.

Delivery vehicle design could similarly be part of an AI-enabled closed-loop research system. For instance, AI and ML could be used in vehicle design to increase AAV capsids’⁶ tissue specificity, load capacity, and stability:

Start with millions of mutated DNA-encoded capsid designs, computing structures, and features, as well as from resulting mRNA and protein.
Perform high-throughput measurement of the experimental capsid properties.
Link back the measurements to the original design space to improve the capsid designs.⁷

A similar concept applies to lipid nanoparticles, although the backbone is chemistry based and exploring the relevant design space is exponentially harder.

The development of chemistry, manufacturing, and controls (CMC) processes for these novel pharma modalities might be particularly well suited to an in silico process development approach, given the modalities’ platform-like nature and the relative independence of each molecule design. This approach encompasses the virtual design of production methods and equipment (instead of extensive lab optimization and screening experiments) to optimize production processes using a digital twin. The digital twin is built using a mechanistic model of each process step and complemented by statistical models based on previous process runs to reduce development costs, enable rapid scale-up and minimal tech transfer, and accelerate time to market.

Translational and clinical development

During the translational and clinical development stage, AI and ML can assist in getting CGT to the clinic by minimizing safety risk in clinical trials and increasing the overall probability of success. Preclinically, this starts with finding translational biomarkers indicative of future trial success, as well as a way to simulate patient heterogeneity through more complex preclinical assays. Although using AI to optimize trial design is not specific to novel modalities, it may be of particular importance given their association with typically small patient population sizes, long treatment processes, and potential for severe adverse events.

AI and ML algorithms can help identify the right patients, estimate optimal dosing, and predict severe adverse events based on patient profile and real-world data on response to similar treatments. Models can be trained to screen patient records for comorbidities and to use genetic profiles to identify the patient subgroups that will have the greatest response to the therapy. To enable this type of precision medicine, building up large integrated clinicogenomic databases for disease areas of interest is required.

End-to-end digitization

Finally, digitization across the entire E2E chain can add value—for example, by linking data from preclinical studies to trials, CMC readouts, and manufacturing batch records, allowing the tracing of a therapeutic design from its inception onward. It can also facilitate long-term tracking and certification of patient outcomes, which are important for establishing patient, healthcare provider, and payor confidence.

Long-term follow-up may also become important as innovative payment models arise to address CGT-specific payer challenges. Finally, detailed tracking of the E2E supply process can improve patient safety and outcomes. This is particularly important for personalized CAR T-cell therapy, with which maintaining a clear chain of identity and custody is important to ensure that a patient receives their own modified cells.⁸

Getting the emerging AI opportunity right: Balancing partnership and internalization

The CGT AI opportunity is predicated on operating within an industrialized framework, allowing for scalability, adaptability, and sustainability. This includes an experimental data generation engine that is both well oiled and tightly embedded in a closed loop to cope with long and expensive manufacturing timelines. Data across the value chain (for example, between research and CMC) need to be easily linkable, as fields are much more interconnected and interdependent than for classic modalities, with potentially significant variations on a batch-by-batch basis. This includes a focus on designing E2E ML operations (MLOps) solutions, integrated into the research system and driven by user experience. Finally, specific data science, engineering, chemistry, functional biology, and disease expertise could come together to tackle challenges at the edge of scientific understanding.

Data across the value chain need to be easily linkable, as fields are much more interconnected and interdependent than for classic modalities.

Companies are putting these enablers in place in different ways, each with upsides and downsides. Broadly, they are pursuing three main approaches—externalization of capabilities, selective partnership, and internalization of capabilities—across a spectrum of collaboration with biotech start-ups, each involving different risk profiles, talent considerations, and potential width of capabilities. Of course, a few companies take a mixed approach across archetypes, depending on the modality or therapeutic area.

Externalization of capabilities

Some biopharma companies active in CGT opt to externalize capabilities in applying AI and ML to their R&D processes. Given that these technologies are at an early stage, an advantage of this approach is to derisk and compartmentalize. It leverages these technologies from a partner with the right expertise and talent for a well-defined scope and milestones to sharpen focus and move more rapidly, which is especially relevant for novel modalities with an unproven record with greater inherent drug discovery risk.

However, there is no buildup of internal AI and ML capabilities, plus a risk of the biotech start-up learning and benefiting more from the partnership than the other way around, including potential loss of intellectual property. In short, while outsourcing AI capabilities could be a straightforward strategy in the short term to minimize a company’s risk or could be an option for modalities outside of a company’s core focus, this does pose the real risk of losing scientific edge within a company’s core R&D engine over the long term.

Selective partnership, with future internalization of capabilities

Other biopharma companies use a selective-partnering approach with a clear path toward internalization of capabilities. The approach’s advantages are similar to those of the externalization of capabilities archetype, offering a way to tap quickly into the best expertise and talent available while being able to derisk and focus. Moreover, there is a clear (albeit longer) path toward internalization of these capabilities and the talent supporting them. However, it also means there is likely limited incentive to be at the forefront of innovation and internally a lack of focus on company-wide assets and capabilities.

Internalization of capabilities

A third group works to develop and internalize capabilities to set up AI-enabled closed-loop research systems for novel modalities. If done right, this archetype allows for a broad base of digital, data, and analytics capabilities, which can power a company-wide R&D transformation. The focus could typically be on transversally applicable and generalizable tech across many teams, such as automated image segmentation and labeling and protein-structure prediction. This industrialized internal backbone could then allow to plug and play cutting-edge external technologies.

Disadvantages are typically an overreliance on internal expertise, leading to a slower innovation pace, slow buildup of necessary and sparsely available talent, conflicts with existing R&D priorities, endless proof of concepts without bringing the solution to users at scale, and a tendency for long parallel transformation programs at high costs. One way to overcome them is to apply a methodology based on quarterly value releases. It starts from a specific business or scientific need for which there is a conviction that a digital or analytics solution could deliver value. It aims to bring horizontal building blocks together vertically across teams (such as blueprint, data, analytics, tech, and change management groups) and rigorously deliver value to end users in short 90-day cycles. End users are involved along the way to define the need and cocreate the solution.

The road ahead

Opportunities for applying AI are coming of age now—with growing examples of impact—at a tipping point supported by an explosion of biological data, increasing computational power, next-generation in vitro models, wet-lab automation, and strong initial clinical proof points. Moreover, the next five years will be critical to prove the sustainability of CGT as broadly applicable therapeutic modalities.

For oncology alone, more than 500 assets based on complex modalities are currently in preclinical and clinical development, and as many as 80 could get to market by 2030. Embedding digital and analytics in R&D is crucial to making this a success and to capturing value for patients. AI and advanced analytics are poised to become vital enablers for boosting the return on R&D spending in the CGT value chain by increasing speed, reducing clinical failures, cutting costs across the R&D value chain, and enabling sustainable tech platforms.

Explore a career with us

Search Openings

How AI can accelerate R&D for cell and gene therapies

About the authors