An Open-Source Repo for Embryo Selection, Validated on OpenSNP

Dawn Drain

One of the most exciting emerging technologies right now is genetic enhancement: removing deleterious alleles and promoting beneficial ones so that people are genetically predisposed to avoiding heart disease or cancer and to being much smarter than they would be otherwise. Even without more speculative advances in gene editing or iterated embryo selection, parents undergoing IVF can already reasonably expect gains on the order of half a year of healthy life, or 3 IQ points, from polygenic selection:

How can you actually do embryo selection, though, or run a state of the art polygenic predictor? The most popular open-source resources here are a bit of a mess. The tool pgsc_calc is okay, but it doesn't run any kind of imputation, and requires you to filter PGS Catalog for their most accurate predictors, which is very error prone and time consuming. The repo impute.me (which turned into Nucleus Genomics) is more opinionated, but their predictors are from 2019, and there's evidence that their predictors are very underwhelming.

Other embryo selection providers are better, but still cost upwards of ten thousand dollars.

I've heard of a few people trying to develop open-source solutions, but doing so is very error prone -- you have to run imputation on the parents, subtract off the top principal components to remove non-causal confounders, harmonize different data formats. There's this subtle step related to imputing very sparse reads from embryonic cells from PGT-A. Selecting the best predictors for each trait from PGS Catalog is very tricky. And after all this there's the step of how to actually interpret all these numbers.

I recently made my own open-source repo for end-to-end polygenic prediction and embryo selection which closes all these gaps, and which I want to share with more people. Importantly I've verified that the numbers from PGS Catalog replicate on OpenSNP data (for height and for SAT score), and anyone else can validate those results. Please complain about any bugs that you see, and I'll try and fix them if they're serious. This is one of the biggest benefits of things being open-source!

In the rest of this post I'll do a deeper dive into the datasets and algorithms. In addition to this post, I also have this longer technical essay attached to the repo which goes deeper into the math, as well as these slides.

Data

My repo pulls in data for existing predictors from the PGS (polygenic score) catalog, and filters to the best weights for each feature using Claude's best judgment (this worked better than using simpler heuristics like recency and dataset size). There are predictors for intelligence, height, and many disease traits. Across adults these correlate with measured phenotype at around 0.3, 0.65, and 0.15-0.3 after accounting for obvious confounders like sex and age, so pretty nontrivial.

Note that the best predictors here are essentially all linear. Naively you would need some sophisticated deep learning solution, but studies on identical twins and ordinary siblings show that almost all the signal for these polygenic traits is linear in the general population. Obviously this linearity will break down if you try and extrapolate out by ten standard deviations or something, but at present all you need is these linear weights.

In addition to uploading those final weights, researchers will also upload per-SNP correlations for each trait. Remarkably, those open-source GWAS sumstats are sufficient to rederive state of the art predictors. The field has rallied around developing techniques like lassosum or LDpred or SBayesRC for learning PGS weights, each of which assumes that all you have access to is these GWAS sumstats, along with population-level linkage-disequilibrium matrices encoding how frequently neighboring SNP's occur together compared to chance. It's actually kind of a blessing, since you can go further and aggregate open-source GWAS sumstats across biobanks, without ever getting formal portal access.

While I mostly deferred to published weights, I also made a slightly improved intelligence predictor by combining the ~250k intelligence datapoints with the ~750k public education datapoints using this technique called MTAG. This is roughly equivalent to taking a linear combination of the intelligence and education predictors, with the weighting chosen to maximize intelligence prediction. You can get even fancier, especially by doing more feature engineering using traits like reaction time, visual pairs matching etc., but I wanted to keep things relatively simple given that there isn’t enough public data to distinguish different IQ predictors.

Importantly, I validated things by trying to predict people's heights based on their OpenSNP data. For the 500 people I have heights for, I got an of 0.3 using vanilla mean imputation, and an of 0.38 after imputing with Beagle, essentially matching published estimates. I similarly got an r of 0.3 for SAT score for the 100 people who entered that information. You can do "pseudovalidation" again using sumstats as further validation, but I would only think of that as a rough sanity check.

You can run your own 23andMe data through my repo's predictors, although it's kind of navel gazey in my opinion. You already know how tall, smart, and handsome you are.

Embryo Selection

The main value of polygenic scoring is for prospective parents going through IVF who want to select which embryo to implant based on a polygenic trait, like intelligence or various disease risks. You get a very good predictor for any given embryo based on the parents' phenotypes, but then all the signal between embryos comes from polygenic scoring.

Given polygenic scores, my repo converts them to qaly's (quality-adjusted life-years), and then produces a table like this one:

Note that there are several arbitrary choices here, especially around time discounting. By default these qaly numbers are computed with no time discounting, and their sizes would shrink if you used something like a 3% per year discount. I don't have a great way to think about discounting here given that AI is going to upheave everything.

This also represents a somewhat optimistic scenario — best of five healthy embryos, with a white couple where the statistics are more accurate. That being said, I am accounting for the within-family attenuation that comes from factors like genetic nurture and assortative mating. From published papers, these cut the variance explained by a factor of 0.5 for IQ and 0.4 for years of education, so they're important to not ignore.

Imputation

Arguably the most subtle part of this project is centered around imputation i.e. filling in the most likely values for SNP's that are missing.

For SNP data from 23andMe, you can boost by something like 30% by imputing using a tool like Beagle or the Michigan Imputation Server.

For embryos, imputation is more involved. You can't actually sequence an embryo to high fidelity. Rather on day five the clinic will slice off a few cells from the part of the embryo that will eventually become the placenta, which is a routine procedure at this point for detecting if the embryo has the right number of chromosomes. The data you can get from such a small number of cells is very noisy -- the reads are very sparse, which is enough to tell if you have three copies of some chromosome, but not nearly enough data for a polygenic scoring model to use. The trick then, is to get whole genome sequences from the parents (with imputed genomes from SNP data nearly as good). If you know the parents' haplotypes with certainty, then you can model the embryo's maternal haplotype as coming from the mom's haplotype + crossover, and similarly for the paternal haplotype. A straightforward hidden markov model (HMM) then gives the most likely imputation for the embryo.

In general though, there are switching errors when you try and phase someone's genome into its haplotype, which invalidates the basic HMM. Instead you need a hidden markov model that tracks all embryos simultaneously, and takes advantage of how a single parental switching error is more likely than simultaneous crossover events across all the embryos. Performance here degrades a bit if you only have a couple embryos, but by less than you might think. There's also a more brute force solution for fixing the parental phasing where you get a grandparent or other child's genome.

Closing Thoughts

I think it's sobering to keep in mind that the gains with current reprogenetic technology (IVF + SNP sequencing) are not exactly transformative. If you have five people and choose the person who's the most extreme along a given trait, they'll typically be around 1.2 standard deviations above average and fall in something like the 88th percentile. If you're instead selecting based on a pgs with an of 0.1 amongst people who already share half their dna, then the top person will typically be standard deviations above average, landing in the 58th percentile. (The first follows directly from theory from only having half as much dna to select on, and the second comes from empirical results about within-family attenuation for intelligence and years of education.)

I think things will get more exciting once we have better biotech and larger datasets. You could imagine having enough selection power to really minimize the risk of twenty different diseases simultaneously, including debilitating psychiatric diseases, and to have children who are essentially arbitrarily smart. I'm not sure how much those things will matter in a post-AGI world, but they seem like part of a good future to me.