One thing Claude Code has really made too cheap to meter is data analysis. If you have a question that statistics could shed light on, and there’s relevant public data online, you can now get your answer frictionlessly.
One thing I’d been wondering about for a while was predictors of biotech startup success. Turns out, at least for publicly traded companies, SEC filings and stock data can go a long way towards answering those questions.
My repo is here — as always, please do let me know if you see any problems with the code.
Methodology
I used API access to SEC EDGAR for filing data on publicly traded biotech companies; SIC codes 2834 (“Pharmaceutical preparations”) and 2836 (“Biological products”) are the relevant ones for biotech.
Filtering for companies that have ever filed an S-1 form (the preparation for IPO) and had a founding date after 2000, I got a dataset of 803 companies.
S-1 filings come with information including founding date, founding location, stock ticker symbol, and stock exchange.
Cross-correlating this with Yahoo Finance data (using the yfinance library) gives additional information on the stock price over time.
Other SEC filings from EDGAR, like the 8-K, give information about whether companies were acquired, went bankrupt, or were otherwise delisted from stock exchanges.
The “business” section of the S-1 filing gives information about the company’s pipeline at the time of IPO. Here I used Claude to extract categories including:
and modality sub-type (biologics can include peptides, oligonucleotides, antibodies, vaccines, etc; “other” can include natural products, radiotherapies, drug formulations, etc)
disease areas (cardiovascular, oncology, rare disease, immunology, CNS, etc)
drug targets
lead stage (preclinical, Phase I, Phase II, Phase III, or approved)
Then, I classified company outcomes into four categories:
Acquired
Failed (including bankruptcy, delisting, or other keywords relating to going out of business)
Trading+ (currently trading on a stock exchange, with a higher stock price today than at IPO, indicating that the company became more valuable over time)
Trading- (currently trading on a stock exchange, with a lower stock price today than at IPO, indicating that the company became less valuable over time.)
Basic Outcome Stats
About half of the startups in the dataset are still trading, and half have exited. Failures (31%) somewhat outnumber acquisitions (22%). Among currently trading startups, very few of them have positive CAGR (7%), while the majority (39.7%) have lost value.
This might reflect the high rate of clinical trial failure. Most biotech startups will IPO while their most advanced candidate is in clinical trials, and their stock will lose value if they get bad news in the clinic. Unsuccessful companies eventually fail; acquisitions are probably a mix of successful companies (acquired on good terms) and moderately unsuccessful companies (acquired on bad terms).
If you look at founding dates, you can see that failed companies skew earlier, while negative-CAGR companies skew later. In other words, the data supports the hypothesis that poor-performing currently-trading companies are just companies that haven’t failed yet.
Also notice that acquisitions stay pretty steady except for a drop in the past five years, which makes sense (the newest companies haven’t had time to be acquired yet).
Predictors of Outcome
If we run a multinomial logistic regression with the variables in the dataset, we get some striking results.
More recently founded firms are more likely to be still trading (vs. acquired or failed). This makes sense; firms have a life cycle.
Don’t be from a flyover state. Companies founded in states other than CA, MA, NY, NJ, or PA — that is, outside California or the Northeast — are much more likely to fail. The benefit (or the selection effect) of being located in a biotech hub is huge.
Don’t pick a modality “other” than biologics or small molecules. The most common “other” modalities are natural products and drug formulations — these are typically lower-value than novel compounds and biologics.
Do pursue rare disease and immunology indications. Companies with a rare disease focus are more likely to be acquired and more likely to trade with positive CAGR; companies with an immunology focus are more likely to trade with positive CAGR.
Do IPO at Phase III. Companies that IPO at the last phase of clinical trials are more likely to be acquired than companies that IPO earlier (more risk) or later (perhaps, more likely to be generic?).
Looking in more detail at pipeline information gives a largely consistent picture.
Disease Areas
Rare diseases and immunology are good (less likely to fail, more likely to have positive CAGR).
Modalities (detailed)
Antibodies are good (less likely to fail, more likely to have positive CAGR).
And PD-1 (the target of immune checkpoint inhibitors for cancer) is by far the best target (more likely to be acquired, less likely to fail.)
All of this is pretty much conventional wisdom in the modern biotech world. The immune system (immuno-oncology, antibody therapies, immunology indications) is an exciting place to work these days. Rare diseases generally have unmet medical need, clear genetic causes, and an easier path to approval. And locating your company in a biotech hub region is generally the recommended thing to do.
No huge surprises here — but it’s nice to know when the data can back up the conventional wisdom.
I’m mildly surprised that cell therapies, CAR-T, and peptides aren’t outperforming the baseline, despite conventional wisdom being that some of these novel biologic modalities are “the way of the future.” Maybe it’s still too soon to tell, or maybe exciting science hasn’t translated sufficiently into financial returns.
I’m also a little surprised that oncology (the most popular disease category) isn’t outperforming the baseline, despite its easier path to approval than most disease types. But maybe there’s a lot of variance in company quality; lots of firms are drawn to oncology, but treating cancer is easier said than done.
One thing these stats aren’t very informative about is very recent trends. Anything that’s mostly happening in pre-IPO companies isn’t in this dataset, and the most recently-IPO’d companies are too young to really have outcomes. So, “should you continue betting on these trends or are they played out?” is something you’d need more domain expertise (and a bit of luck) to assess.
But, while it’s hard to use the past to predict the future, I think it’s a decent starting point in making sense of the present. Basically, the data backs up the buzzwords; the “cool” locations and research focuses do also tend to be the financially successful ones.
One thing Claude Code has really made too cheap to meter is data analysis. If you have a question that statistics could shed light on, and there’s relevant public data online, you can now get your answer frictionlessly.
One thing I’d been wondering about for a while was predictors of biotech startup success. Turns out, at least for publicly traded companies, SEC filings and stock data can go a long way towards answering those questions.
My repo is here — as always, please do let me know if you see any problems with the code.
Methodology
I used API access to SEC EDGAR for filing data on publicly traded biotech companies; SIC codes 2834 (“Pharmaceutical preparations”) and 2836 (“Biological products”) are the relevant ones for biotech.
Filtering for companies that have ever filed an S-1 form (the preparation for IPO) and had a founding date after 2000, I got a dataset of 803 companies.
S-1 filings come with information including founding date, founding location, stock ticker symbol, and stock exchange.
Cross-correlating this with Yahoo Finance data (using the yfinance library) gives additional information on the stock price over time.
Other SEC filings from EDGAR, like the 8-K, give information about whether companies were acquired, went bankrupt, or were otherwise delisted from stock exchanges.
The “business” section of the S-1 filing gives information about the company’s pipeline at the time of IPO. Here I used Claude to extract categories including:
modality (small molecule? biologic? both? neither?)
and modality sub-type (biologics can include peptides, oligonucleotides, antibodies, vaccines, etc; “other” can include natural products, radiotherapies, drug formulations, etc)
disease areas (cardiovascular, oncology, rare disease, immunology, CNS, etc)
drug targets
lead stage (preclinical, Phase I, Phase II, Phase III, or approved)
Then, I classified company outcomes into four categories:
Acquired
Failed (including bankruptcy, delisting, or other keywords relating to going out of business)
Trading+ (currently trading on a stock exchange, with a higher stock price today than at IPO, indicating that the company became more valuable over time)
Trading- (currently trading on a stock exchange, with a lower stock price today than at IPO, indicating that the company became less valuable over time.)
Basic Outcome Stats
About half of the startups in the dataset are still trading, and half have exited. Failures (31%) somewhat outnumber acquisitions (22%). Among currently trading startups, very few of them have positive CAGR (7%), while the majority (39.7%) have lost value.
This might reflect the high rate of clinical trial failure. Most biotech startups will IPO while their most advanced candidate is in clinical trials, and their stock will lose value if they get bad news in the clinic. Unsuccessful companies eventually fail; acquisitions are probably a mix of successful companies (acquired on good terms) and moderately unsuccessful companies (acquired on bad terms).
If you look at founding dates, you can see that failed companies skew earlier, while negative-CAGR companies skew later. In other words, the data supports the hypothesis that poor-performing currently-trading companies are just companies that haven’t failed yet.
Also notice that acquisitions stay pretty steady except for a drop in the past five years, which makes sense (the newest companies haven’t had time to be acquired yet).
Predictors of Outcome
If we run a multinomial logistic regression with the variables in the dataset, we get some striking results.
More recently founded firms are more likely to be still trading (vs. acquired or failed). This makes sense; firms have a life cycle.
Don’t be from a flyover state. Companies founded in states other than CA, MA, NY, NJ, or PA — that is, outside California or the Northeast — are much more likely to fail. The benefit (or the selection effect) of being located in a biotech hub is huge.
Don’t pick a modality “other” than biologics or small molecules. The most common “other” modalities are natural products and drug formulations — these are typically lower-value than novel compounds and biologics.
Do pursue rare disease and immunology indications. Companies with a rare disease focus are more likely to be acquired and more likely to trade with positive CAGR; companies with an immunology focus are more likely to trade with positive CAGR.
Do IPO at Phase III. Companies that IPO at the last phase of clinical trials are more likely to be acquired than companies that IPO earlier (more risk) or later (perhaps, more likely to be generic?).
Looking in more detail at pipeline information gives a largely consistent picture.
Rare diseases and immunology are good (less likely to fail, more likely to have positive CAGR).
Antibodies are good (less likely to fail, more likely to have positive CAGR).
And PD-1 (the target of immune checkpoint inhibitors for cancer) is by far the best target (more likely to be acquired, less likely to fail.)
All of this is pretty much conventional wisdom in the modern biotech world. The immune system (immuno-oncology, antibody therapies, immunology indications) is an exciting place to work these days. Rare diseases generally have unmet medical need, clear genetic causes, and an easier path to approval. And locating your company in a biotech hub region is generally the recommended thing to do.
No huge surprises here — but it’s nice to know when the data can back up the conventional wisdom.
I’m mildly surprised that cell therapies, CAR-T, and peptides aren’t outperforming the baseline, despite conventional wisdom being that some of these novel biologic modalities are “the way of the future.” Maybe it’s still too soon to tell, or maybe exciting science hasn’t translated sufficiently into financial returns.
I’m also a little surprised that oncology (the most popular disease category) isn’t outperforming the baseline, despite its easier path to approval than most disease types. But maybe there’s a lot of variance in company quality; lots of firms are drawn to oncology, but treating cancer is easier said than done.
One thing these stats aren’t very informative about is very recent trends. Anything that’s mostly happening in pre-IPO companies isn’t in this dataset, and the most recently-IPO’d companies are too young to really have outcomes. So, “should you continue betting on these trends or are they played out?” is something you’d need more domain expertise (and a bit of luck) to assess.
But, while it’s hard to use the past to predict the future, I think it’s a decent starting point in making sense of the present. Basically, the data backs up the buzzwords; the “cool” locations and research focuses do also tend to be the financially successful ones.