Selling Data

TL;DR

These companies focus on developing a tech process to develop uniquely valuable datasets. They sell the data itself to pharma, rather than partnering on drugs.

Companies that develop datasets that unlock entirely new target spaces and / or provide existentially important data on currently difficult to drug targets can garner $30M+ up front deals like with Tempus and Fauna. But those are the largest deals ever for the space.

Companies

Tempus, UK Biobank, deCode Genetics, 23andMe, Fauna, Basecamp, Ochre, Flatiron Health, Foundation Medicine

Overview

On a high-level, these companies can be thought of in terms of their dataset’s value-per-datapoint vs size of dataset. Some have created tiny datasets but ones that are very difficult to replicate with potentially unique biological insights into the hottest target space.

Others have pushed for scale. Epic has the full EHR on 300M+ patients and Tempus has 350+ petabytes of data from 1M+ sequenced tumors and 9M patient’s clinical records.

Those that push for larger scale datasets are either integration plays that connect to existing records (e.g. Epic, Tempus) or those that create a technological platform to massively increase the throughput of data collection on some hard-to-gather data (e.g. Olden Labs, BiomeSense).

Examples of smaller, high-value datasets we at Compound are interested in include ones on the skin microbiome and muscle biopsy biobanks.

If you can do a diagnostics -> biobank business model (like Tempus and Caris), that’s significantly more sustainable because you cover the cost of your data + have high value data and services contracts once your dataset gets to a sufficient size.

With that said, it has generally been a difficult business model in biotech/pharma because the customer is:

Shrewd: they know quite precisely how much a potential dataset is worth because their core competency is also drug discovery
Stingy: they know what the minimum price is and are reluctant to pay any more than that for it. Companies we know that have developed step function better data generation tools have struggled to get $500K milestone-gated contracts from large pharma or the top AI labs with $1B+ in funding for datasets that directly fuel their core competency.
Long sales cycles: large pharma are very slow moving and bureaucratic. Expect 6-18 month sales cycles with constant hand-to-hand combat. Selling to startups may be 1-3 months but contract sizes will likely be at least 5x lower.

To better understand why the model is difficult, see this dropdown:

Difficulty of selling data to biotech/pharma as opposed to tech or finance

Consider the Bloomberg business model and the respects in which it differs.

§ Time value of information: the bond or currency trader making multi-hundred million-dollar trades lives or dies based on moment-to-moment information. Last time I checked, the drug discoverer can wait a week (a month? six months?) for the latest genetic sequence or proteomic profile.

§ Value of that information: for the trader, that information, and its currency, is the key to making the investment which is the money-making activity. For the drug discoverer, the information is just the start of a multi-year, multi-disciplinary journey lasting a decade or more to the money-making activity (= discovery, development, approval, and marketing of the drug). Hence, a subscription model, not a royalty/profit-share model of payment, is the only one that will sell.

§ Number of high-paying customers: a Bloomberg terminal subscription costs $30,000/year per user (not per firm). In 2016, there were 325,000 terminals in use. How many pharma and biotech subscribers, in a best case, would there be at what user fee?

This industry dynamic of pharma paying pennies for pre-clinic discovery tools no less data may change if and only if pharma starts to view in silico discovery as existential threats. We’re currently seeing this play out with our portfolio company Wayve. Just within the last year, the legacy car OEMs have internalized that AVs are here and are existential to their very survival. Negotiations have shifted from contracts for tiny margins won via hand-to-hand combat over 2-year sales cycles to gargantuan deals over 3 month sales cycles.

This may also suggest a longer-term, radical reshuffling of profit pools in the industry from the marketing of approved drugs collecting effectively all the industries’ profits to being relatively commoditized. We at Compound broadly presume the commoditization of drug discovery, but expect it to play out incrementally over decades.

It’s possible that the data contracts will scale substantially in the coming years if big tech / frontier AI labs start actually getting serious about curing cancer.

For illustrative purposes, Google invested ~$20B into Waymo’s R&D to ~solve~ autonomous driving. How much might it invest into Isomorphic to also ~solve~ drug discovery?

Best of all, those customers should be far, far less price sensitive than what bio startups are used to.

The tricky part is it’s been largely all talk so far and a bio startup employing a data-as-a-service business model is betting it’ll happen in the next couple years.

A signal for when the frontier labs get serious about curing cancer may be Twist’s annual revenue growth re-inflecting upwards from ~25% to say 40% y/y.

Though not quite convincing yet as it's one datapoint and they're guiding towards just 13-15% growth this year, Twist made the following commentary in Q4’25:

Recently, this AI-driven discovery fueled significant growth for Twist. In fiscal 2025, orders from customers working on AI discovery projects grew more than $25 million versus fiscal 2024. And a customer pursuing AI-enabled discovery delivered our single largest purchase order to date.

We’re watching other early signals that the necessary dry powder is being accumulated. OpenAI’s new philanthropic foundation has a $25B budget to “fund work to accelerate health breakthroughs so everyone can benefit from faster diagnostics, better treatments, and cures.” Meanwhile the US announced Genesis and the UK announced:

Renaissance Philanthropy (@RenPhilanthropy) on X

Typical price-points:

Tempus makes $450 per patient slide and clinical outcome it sells
Flatiron had 2.1M patient records at time of its $1.9B acquisition, implying ~$900 per patient record. The data is "curated EHR" as they use human extractors to turn unstructured PDF pathology reports into structured data points. Pharma pays a massive premium for this "clean" Real-World Evidence.
23andMe’s GSK deal of $350M for access to its database of 5M customers suggests on the order of $70 per genome
However, 23andMe’s data didn’t include whole genomes. Amgen acquired (and therefore also paid the premium for exclusive access to) deCODE’s ~140k genotypes and 2k whole genomes for $415M. deCODE was also a uniquely homogenous and high-quality dataset relative to 23andMe. But that suggests a far higher price at that time.
EHR/EMR data has an estimated value typically <$130 per patient record alone but $1,300-$6,500 when combined with genomic and phenotypic data

We have further insights that we’d love to share directly. If you’re thinking about this model please reach us at Shelby Compound!

Buyer	Startup	Deal type	Headline terms	Notes on dataset
GSK	Ochre Bio	Data liscence (multi‑year)	Up to $37.5M	Human liver disease atlases/perturbations; “foundational” dataset. (PMLiVE, Business Wire)
AstraZeneca + Pathos	Tempus	Data licensing + model build	$200M fees	De‑identified multimodal oncology RWD for a foundation model. (Tempus Investors)
Recursion	Tempus	Data access (preferred)	Up to $160M / 5 yrs	Large oncology cohort licensing; explicit “Licensed Data” terms. (Tempus, SEC)
Roche/Genentech	Recursion	Discovery collab w/ dataset access	$150M upfront; up to $12B	10‑K: precedent for selling access to Recursion’s dataset. (SEC, GEN, Fintel)
Boehringer Ingelheim	Ochre Bio	Discovery collab	$35M upfront; up to ~$1B	Builds on Ochre’s human liver datasets for target ID. (Financial Times)
Eli Lilly	Fauna Bio	Target discovery collab	Up to $494M (incl. equity, milestones, royalties)	Uses Fauna’s cross‑species genomics dataset + AI. (Ropes & Gray, PR Newswire)
Novo Nordisk	Fauna Bio	Research collab	Upfront + research support (undisclosed)	Access to hibernation biology datasets for obesity. (BioSpace)
AstraZeneca	Immunai	AI/data collab	$18M	Leverages single‑cell immune atlas to inform trials. (Reuters)
Roche (M&A)	Flatiron Health	Acquisition	$1.9B	Lock‑in of oncology EHR/RWD datasets & products. (BioPharma Dive)
GSK	Relation	Target discovery collab	$45M upfront (incl. $15M equity), up to ~$200M/target in tiered royalties	Relation to run observational studies to create two proprietary functional disease for target ID
Amgen	deCode	Acquisition	$415M cash acquisition	The genetically homogenous population of Iceland plus the Icelandic national database of EHRs
GSK	23andMe	Data collab	$350M as an equity investment in parent company	5-yr exclusive discovery collaboration using 23andMe database; 50/50 on certain R&D programs
Regeneron	TriNetX	Data collab / investment	$200M as equity investment	Exclusive opportunitiy to connect RGC’s internal genomic and proteomic data to TriNetX’s 300M de-identified, anonymous EHRs

Biobanks

Large-scale largely public efforts like UK Biobank price their data access at cost‑recovery National and institutional actors increasingly codify cost‑recovery rather than profit—e.g., NHS SDE principles, UK Biobank pricing, and many institutional fee schedules.

Storing, handling, and sharing specimens is the biobank’s core business. If specimens are left unused, the biobank fails to fulfil its mission. Many studies acknowledged that large numbers of underutilised specimens were a major problem for the financial sustainability of biobanks (Campos et al. 2015; Lin et al. 2019). A global survey of 276 biobanks (Henderson et al. 2019b, p. 217) indicated that in over half of the institutions, the utilisation rate was 10% or lower, and the actual annual utilisation rates of samples were by 2.5 to 5 times lower than the target. Henderson et al. (2019b, p. 217) argued that underutilisation ‘breaks the trust between the scientists/biobanks and the donors’ and is a threat to the social sustainability of biobanks.

Economics of Biobanking: Business or Public Good? Literature Review, Structural and Thematic Analysis

We found that researchers placed the greatest relative importance on the quality of specimens (26%), followed by the characterization of specimens (21%). Researchers with prior experience purchasing biological samples also valued access to key endemic in-country sites (11.6%) and low handling fees (5.5%) in biobanks.

Understanding the value of biobank attributes to researchers using a conjoint experiment

[From 2016]: Nowadays IMS automatically receives petabytes (1015 bytes or more) of data from the computerized records held by pharmacies, insurance companies and other medical organizations—including federal and many state health departments. Three quarters of all retail pharmacies in the U.S. send some portion of their electronic records to IMS. All told, the company says it has assembled half a billion dossiers on individual patients globally.

IMS and other data brokers are not restricted by medical privacy rules in the U.S., because their records are designed to be anonymous—containing only year of birth, gender, partial zip code and doctor's name. HIPAA for instance governs only the transfer of medical information that is tied directly to an individual's identity.

Even anonymized data command meaningful prices. Every year, for example, Pfizer spends $12 million to buy health data from a variety of sources, including IMS, according to Marc Berger, who oversees the analysis of anonymized patient data at Pfizer.

How Data Brokers Make Money Off Your Medical Records

Earlier this week, health data company Truveta, which normally traffics data like patient immunizations, social determinations of health, lab tests, and pharmacy and insurance claims, announced that it will be starting a new Truveta Genome Project to create a massive database of genetic information from 10 million patients over the next five years to pair with their health record data. A crop of companies including Avandra, Gradient Health, Segmed, and Protege offers de-identified patient images to companies and researchers.

Adaptive Biotechnologies Announces Two Immune Receptor Licensing Agreements with Pfizer

More comprehensive pricing and business model info for biobanks:

As of ~2016, many biobanks hadn’t even formalized a business model at all. Almost all relied on non-profit funding.

Assessing and measuring financial sustainability model of the Spanish HIV HGM BioBank - Journal of Translational Medicine

UK Biobank’s pricing structure (Non-profit):

UK Biobank provides its annual financials:

Report And Consolidated Financial Statements 2023, Report And Consolidated Financial Statements 2020, Report And Consolidated Financial Statements 2022, Our funding

Iarc Biobank Price List External April2025

How much does it cost to access UK Biobank data?

Other pricing lists:

Region Stockholm Biobank – Price list 2025. Storage/withdrawal charges; stepped discounts after first 1,000 aliquots.

Karolinska (KI) Biobank – Sample services price list 2025. Academic rates for start‑up, storage (LN2, −80 °C), and logistics; external/industry via quotation.

Barcelona Blood & Tissue Bank (BST) – Biobank price list 2025. Per‑unit pricing across blood derivatives and processing.

Princess Margaret Cancer Biobank (Toronto) – Fee schedule (Apr 1, 2024). Admin/services, retrieval, FFPE/frozen block charges for grant‑funded studies.

St Vincent’s (Melbourne) – Biobank pricing (Jul 1, 2024). Split‑cost model (partial up‑front, remainder at retrieval) with CPI uplift guidance—useful structure for staged cost‑recovery.

NSW Health Statewide Biobank – Service models & costs (Feb 1, 2024). States an explicit cost‑recovery model for long‑term viability; outlines service options.

CCPM Biobank (University of Colorado) – FY25 price list. Per‑sample aliquoting/service charges for cost modeling in U.S. core settings.

National Jewish Health – FY‑2026 research support price schedule (Aug 1, 2025). Includes a biobank section with internal vs external rates.

Ontario Tumour Bank – Researcher FAQs.Researcher FaqsExplains below‑cost access for academics and commercial/academic differentiation (contact for quotes).

Buyer

Startup

Deal type

Headline terms

Notes on dataset

GSK

Ochre Bio

Data liscence (multi‑year)

Up to $37.5M

Human liver disease atlases/perturbations; “foundational” dataset. (PMLiVE, Business Wire)

AstraZeneca + Pathos

Tempus

Data licensing + model build

$200M fees

De‑identified multimodal oncology RWD for a foundation model. (Tempus Investors)

Recursion

Tempus

Data access (preferred)

Up to $160M / 5 yrs

Large oncology cohort licensing; explicit “Licensed Data” terms. (Tempus, SEC)

Roche/Genentech

Recursion

Discovery collab w/ dataset access

$150M upfront; up to $12B

10‑K: precedent for selling access to Recursion’s dataset. (SEC, GEN, Fintel)

Boehringer Ingelheim

Ochre Bio

Discovery collab

$35M upfront; up to ~$1B

Builds on Ochre’s human liver datasets for target ID. (Financial Times)

Eli Lilly

Fauna Bio

Target discovery collab

Up to $494M (incl. equity, milestones, royalties)

Uses Fauna’s cross‑species genomics dataset + AI. (Ropes & Gray, PR Newswire)

Novo Nordisk

Fauna Bio

Research collab

Upfront + research support (undisclosed)

Access to hibernation biology datasets for obesity. (BioSpace)

AstraZeneca

Immunai

AI/data collab

$18M

Leverages single‑cell immune atlas to inform trials. (Reuters)

Roche (M&A)

Flatiron Health

Acquisition

$1.9B

Lock‑in of oncology EHR/RWD datasets & products. (BioPharma Dive)

GSK

Relation

Target discovery collab

$45M upfront (incl. $15M equity), up to ~$200M/target in tiered royalties

Relation to run observational studies to create two proprietary functional disease for target ID

Amgen

deCode

Acquisition

$415M cash acquisition

The genetically homogenous population of Iceland plus the Icelandic national database of EHRs

GSK

23andMe

Data collab

$350M as an equity investment in parent company

5-yr exclusive discovery collaboration using 23andMe database; 50/50 on certain R&D programs

Regeneron

TriNetX

Data collab / investment

$200M as equity investment

Exclusive opportunitiy to connect RGC’s internal genomic and proteomic data to TriNetX’s 300M de-identified, anonymous EHRs

Selling Data

TL;DR

Companies

Overview

Further Reading

Selling Data

TL;DR

Companies

Overview

Further Reading