Better data for better therapies: The case for building health data platforms

| Artigo

The past decade has seen an important and, for many patients, a life-changing rise in the number of innovative new drugs reaching the market to treat diseases such as multiple sclerosis, malaria, and subtypes of certain cancers (such as melanoma or leukemia). In the United States, the Food and Drug Administration approved an average of 41 new molecular entities (including biologic license applications) each year from 2011 to 2020—almost double the number in the previous decade.1 Despite the immense costs of such achievements,2 biopharmaceutical researchers have not yet uncovered a universal, systematic approach to generate many more breakthroughs or predict which interventions will work for which patients.

A major barrier is the daunting challenge of understanding the multifactorial nature of many diseases coupled with the vast set of variables in therapy design. Very few diseases, such as cystic fibrosis, are linked to variants in single genes. Drug development therefore tends to rely on a reductionist, hypothesis-driven approach that narrows the focus to individual cell types or pathways. Focused assays often based on partial information or informed by animal models that never perfectly reflect human disease then attempt to identify single molecules that will benefit patients. Inevitably, that approach can be both slow and prone to bias, and most breakthroughs are serendipitous rather than systematic, as companies are largely relying on imperfect preclinical models that are not predictive enough even for diseases with relatively simple pathology.3

It has long been hoped that new computational methods from the fields of artificial intelligence (AI) and machine learning (ML) would help scientists bring about a step change. Activity and investments by venture capital firms and pharmaceutical companies in these areas have grown substantially,4 and there have been some astonishing breakthroughs in the understanding of the causes of some indications, often as a result of leveraging new or greater volumes of data generated by whole-genome sequencing, high-throughput assays, and clinical or real-world evidence. Genetic insights, for example, helped identify PCSK9 as a hyperlipidemia modulator and have enabled new, tumor-agnostic oncology therapies. A better understanding of the role of immunological mechanisms has also introduced a new paradigm in cancer treatment, and similar concepts are now being applied to fibrosis, degenerative diseases, and more targeted immunotherapies.5

Such breakthroughs remain relatively rare, however. Despite expanding development pipelines,6 many pharmaceutical companies find themselves focusing on the same limited number of derisked areas and mechanisms of action in, for example, immuno-oncology. This “herding” reflects the challenges of advancing understanding of disease and hence of developing novel therapeutic approaches. The full promise of innovation from data, AI, and ML has not yet materialized.

It is increasingly evident that one of the main reasons for this is insufficient high-quality, interconnected human data that go beyond just genes and corresponding phenotypes—the data needed by scientists to form concepts and hypotheses and by computing systems to uncover patterns too complex for scientists to understand. Only such high-quality human data would allow deployment of AI and ML, combined with human ingenuity, to unravel disease biology and open up new frontiers to prevention and cure. Here, therefore, we suggest a way of overcoming the data impediment and moving toward a systematic, nonreductionist approach to disease understanding and drug development: the establishment of trusted, large-scale platforms that collect and store the health data of volunteering participants. Importantly, such platforms would allow participants to make informed decisions about who could access and use their information to improve the understanding of disease.

Data are the key to a leap in understanding disease

Tech companies have transformed marketing through the use of computational models to analyze huge amounts of data about individual consumers and their buying behavior and then to deploy that information to adjust offers dynamically. Similarly, computational techniques and big data could transform drug R&D by shedding light on biological complexity to build holistic and reproducible disease models and develop more specific scalable therapies.

The power of computational science in drug discovery is becoming evidently clear. DeepMind’s AI algorithm, AlphaFold,7 has made progress predicting a protein’s structure from its amino acid sequence. Computational techniques, coupled with the ability to collect huge amounts of data on biological systems, have dramatically increased the accuracy of predictions on whether biological molecules will be efficacious, even without an understanding of the underlying biological concepts. For example, generative adversarial networks, an in silico methodology, can suggest molecules that might be effective against a disease, and these predictions can then be verified in vitro and in vivo. Yet until research sheds light on the fundamental mechanisms that explain if and why a drug candidate is efficacious and safe, drug discovery will remain an act of chance because so many candidates will turn out to be neither.

The interplay of AI and ML systems, insights guided by scientists, and the generation of evidence could change that model. The main barrier—according both to traditional pharmaceutical companies and new, technology-driven ones—is the lack of enough reliable, longitudinal, and interconnected health data.

Although there are reams of health data, much are of low quality because they are hard to validate and fragmented because they sit on many different platforms. Moreover, the quantity of these data makes them almost impossible for any purely human effort to categorize, clean, and label, so they are hard to access. Data privacy and security exacerbate the challenge. The hurdles are therefore twofold: to generate and collect sufficient good-quality data and to link or combine the data sets in a single, secure source suitable for ML and AI. In this respect, health data platforms could be a significant and exciting step forward.

The role of health data platforms

Even before the COVID-19 pandemic, individuals generated large amounts of health-relevant data from wearables, geolocation services, online symptom searches, and consumer-grade genetic testing. They also generated a lot of more structured data, such as insurance claims, pharmacy purchases, and electronic health records. In addition, many people take part in clinical or academic research, including research to collect high-quality multi-omics data, regarded as essential to progress in disease understanding and drug development.

Today, still more data are being generated as a result of the accelerated digitization of healthcare during the pandemic. And there is much wider acceptance—among patients, physicians, health systems, and regulators—of digital health in the form of telemedicine, wearables, and the virtual monitoring of clinical trials, for example. The sense of urgency that the pandemic unleashed has also encouraged more data sharing, which has improved the quality of data thanks to higher levels of scrutiny.

To be clear, there are many national and international health data initiatives to collect structured, high-quality data securely. For example, the 1+Million Genomes project involving 22 EU countries, the United Kingdom, and Norway aims to give access to more than one million sequenced genomes this year.8 And Denmark has a long tradition of collecting data from records of treatment, medicine, and diagnosis that can be linked for every citizen.9 Health data companies are also accumulating and combining existing data from patient records and real-world evidence. Notwithstanding, data gaps remain. Relatively little omics data beyond the genome is collected—data on the expressome, proteome, or microbiome, for example (Exhibit 1). Longitudinal data will be key too—data from numerous sources, including social media and wearables, and from different points in time to capture early symptoms and manifestation of diseases and comorbidities. This can not only help identify individual novel molecular targets but also—as important and perhaps a bigger challenge—help to understand all disease-related signals and hence the optimal therapeutic approach. (Ultimately, greater depth of data on any individual might be more important than data on an ever-greater number of individuals.)

Yet the foundations exist for what could be a hugely valuable homogenous, longitudinal, quantitative, and qualitative data-vault, if existing and new data could be linked and shared with a respect for privacy; hence the requirement for secure, trusted, data platforms where participants would agree to deposit their health data on the understanding they would receive a copy of each recorded data point. That would help them make informed decisions about giving third parties, such as public or private pharmaceutical researchers, the right to access and use this information.

Privacy and security will of course be a challenge. But technologies, such as blockchain, that enable secure data transactions and smart contracts are fast maturing. What’s more, unlike many of today’s data-based applications, which require users to give blanket approval for the use of their data, newer technologies could give participants the choice to approve or deny each data-sharing request. Both developments make sharing data more attractive.

Several initiatives have met with early success. The Swiss nonprofit MIDATA,10 for example, is a platform where people can give research projects access to their data. Estonia aggregates the electronic health records of all its citizens, using blockchain to make them secure and accessible with a simple electronic ID card.11 To fully unlock opportunities for systematic disease understanding, initiatives will need to be on a much larger scale, however.

The benefits of secure health data platforms

The potential benefits of large-scale data platforms are far-reaching.

First, by amassing and monitoring data from both healthy people and those in the early and later stages of disease, these platforms could enable clinicians to distinguish more clearly between disease states and therefore to diagnose or even predict and preempt a disease with the help of a more precise progression taxonomy.

Second, good data would promote a less reductionist approach to understanding disease biology and generating hypotheses for treating previously untreatable conditions, such as dementia, or to addressing the still unmet needs of patients with treatable diseases: drug-resistant disease variants or complications, for example.

Finally, harnessing large-scale human health platforms could broaden the scope of medical innovation by shedding light on how factors (such as behavior, nutrition, surgery, or medical devices) beyond prescribed drugs could be used and sequenced to treat individual patients in more holistic and sustainable ways.

The overall result could be a faster, less risky (that is, more predictive) R&D process in which the knowledge and skills of scientists, combined with in silico and in vitro tests, would drive faster cycles for generating and validating ideas. Higher-quality and more complete real-world evidence could enrich this process further by complementing and extending the health data platform, creating the potential for further insights and unleashing a self-propelling loop of generating evidence and improving the management of disease.

The value at stake: An example from neurology

Certain kinds of data will be relevant for understanding many diseases. But a deep understanding of specific ones—an understanding deep enough to uncover actionable therapeutic insights—would probably require patient data related to specific diseases. Building platforms with sufficient data to cover all diseases would still therefore appear to be a dream. A more feasible starting point would be a disease-specific platform.

Costs are a significant consideration. To illustrate the costs of building such a platform, we looked at neurology—an area of high unmet need. We estimate that it would cost around $27 billion to build a platform with relevant omics measurement data for some one million people (an estimated minimum for systematic and scalable drug discovery), as well as imaging data and qualitative assessments, such as functional or cognitive tests (Exhibit 2). The $27 billion would not include the cost of IT infrastructure, of enrolling participants, or of complementing the generated data with existing data from health records or real-world evidence—both of which are, in comparison, relatively affordable, especially if we assume greater digitization and interoperability of health systems in the future. (The sidebar, “The cost of a better understanding of disease,” explains the calculation in more detail.)

The sum is large—1.5 times greater than the estimated amount the global biopharmaceutical industry spends each year on neurology R&D, for example.12 However, we estimate that such a platform could have a direct, positive economic impact of some $25 billion a year within 20 to 25 years, even assuming relatively modest progress in capturing the potential for innovation.13Prioritizing health: A prescription for prosperity, July 8, 2020; The Bio Revolution: Innovations transforming economies, societies, and our lives, May 13, 2020. Exhibit 3 shows that the biggest opportunity lies in helping people to live longer, healthier lives (with a resulting annual GDP increase of $12.3 billion).

The second-biggest opportunity lies in the discovery and development of novel therapeutics for neurological diseases currently untreatable because the underlying biology isn’t understood. If novel therapies removed just 1 percent of the global disease burden associated with neurological disorders by around 2045, the economic impact would be striking: an estimated $9.7 billion a year.

A third opportunity (of $2.8 billion) lies in more efficient R&D: the identification of additional promising drug candidates, the successful repurposing of drugs, better patient stratification in clinical trials, and less animal testing (which is reported to be a weak indicator of clinical outcomes and is due to be banned by the US Environmental Protection Agency by 2035).14

Gains of such magnitude could eventually amortize the initial costs many times over. Some 90 percent of the 20,000 or so existing proteins have not been touched by pharmacology—a huge potential value pool. Add in improved and novel modalities, and the scientific breakthroughs for many disease-related proteins once considered undruggable could be dramatic.

Moving toward health data platforms and a data-driven understanding of disease

The first step toward building health data platforms of the type described here would be to prove the concept by establishing core principles and demonstrating the benefits. There are two possible ways of proceeding. The first would be to coordinate and link existing country-level initiatives to establish standards and platforms. The other would be to set up a pilot in an area of high unmet need, such as dementia. To date, neurological and mental-health disorders have proved extremely challenging to treat, and that has limited R&D efforts in these fields. This makes them a good testing ground, given the societal and health burden they impose, though we are not suggesting the data on all new platforms would have to be as comprehensive (and hence as costly) as the neurological example outlined here. There is reason to believe smaller projects could still deliver significant benefits if well designed.

We also recognize the challenges of building health data platforms—defining a clear and holistic approach on data ownership and usage, data privacy and security, and the potential for monetization, for example. But such challenges can be addressed with the help of a range of organizations, including pharmaceutical and biotech companies, tech giants, governments, foundations, and other nongovernmental organizations. Many of the data collection initiatives afoot have been directed by public, nonprofit, and public–private bodies. But the most effective approaches will probably be collaborative, combining the different strengths of different stakeholders, harnessing their credibility, resources, societal reach, and technological know-how. Researchers too have a part to play, since they and their organizations must start adjusting their approach to drug discovery, moving away from a reductionist approach by harnessing all available data. To this end, interdisciplinary teams will have to be formed, drawing on the power of both science and computational techniques. It’s time to start putting our health data to work for a better understanding of disease.


Breakthroughs in the biological sciences and advances in computing—particularly AI and ML—are accelerating progress at each stage of the pharmaceutical R&D value chain. Yet the coming of a new era of biopharmaceutical innovation arguably requires making much more progress at the very first step in that chain—the understanding of disease. Given the multifactorial nature of disease biology, AI and ML will be needed to help scientists systematically improve that understanding by uncovering the patterns that explain disease. And that in turn will depend upon access to better data that health data platforms could provide. They could be key to ushering in an era of exponential innovation in healthcare.

Explore a career with us