Vibe analyzing my genome

Ruby

The most interesting and useful result concerns drug metabolism; you can skip to that section here.

What I did

I had my genome sequenced by Nucleus Genomics. I downloaded various genome files from Nucleus and started a new Git repo within Cursor. I worked with the LLMs to make a plan, hoping to find insights that would help me prioritize treatments by better understanding my own conditions, e.g., identifying responsible pathways.

Or rather, what Claude/ChatGPT did

This is the dense description that will mean something specific to someone who knows anything about genetic analysis, and is otherwise supposed to seem impressive if you don't. In all seriousness, though, I think the LLMs do offer you more analysis than you can get by uploading your .vcf to a site, at the risk that they'll definitely lead you astray with interpretation unless you're careful.

This project analyzed a 43x whole-genome sequence for a patient with bipolar, inflammatory symptoms, and extreme multi-drug sensitivity. The work included: data QC and coverage verification; pharmacogenomic star-allele calling via PharmCAT and Cyrius (CYP2D6); HLA class I typing via OptiType with tag SNP cross-validation; ClinVar and VEP functional annotation of a 76-gene candidate sweep; a genome-wide rare variant screen filtering 5 million variants down to 27 high-impact candidates; multi-trait polygenic risk scores for psychiatric and immune traits; two literature reviews synthesizing published genetic correlations (LDSC) and Mendelian randomization evidence for inflammation-psychiatry causal pathways; a critical adversarial review that checked all findings against population frequencies and ClinVar review status, dismantling several overclaimed results; a joint probability calculation quantifying the rarity of the patient's multi-CYP profile (~1 in 23,000); a drug history cross-reference mapping 23 medications to specific genotypes; and a comprehensive drug contingency table covering 75+ medications with pathway-specific safety ratings.

To give you an idea of what the project looked like, you can see the resulting directory structure within this footnote^[1] and the software packages installed are in this footnote^[2].

Motivation

I would like to reduce the up and down fluctuations of my Bipolar. They're simultaneously not that bad, but also bad enough to be unpleasant, disruptive, and to rob me of many productive hours.

Bipolar, for all its prevalence and study^[3], is not yet well understood. It's maybe something something a circadian rhythm disorder. It's maybe something something inflammatory. It's maybe a single disease with somewhat different presentations, or maybe it's multiple diseases with somewhat similar presentation.

I hoped that if I could look at my own genes and see which were anomalous, perhaps I could pin down what's going wrong in my mind. If it seems to be clock gene-related, I'll emphasize circadian type treatments; if it's inflammatory, focus on that. I probably want to work on everything, but info that directs effort would be good^[4].

I don't know much about genetics

Due to poor life choices, I'm likely missing even many basics taught in high school. Sure, I knew about SNPs before this project, but I couldn't have told you what linkage disequilibrium was.

But if the hundreds of new users submitting AI-generated work to LessWrong have taught me anything, it's that you don't have to be a domain expert to believe you've found invaluable insight with the help of a friendly AI collaborator. Ideally, I'd stop and study a genetics primer, but it's so much easier to mash "continue" on Cursor/Claude Code.

I am hoping that, via sufficient paranoia, prior experience with LLMs, and red-teaming-like efforts, I can nonetheless get something trustworthy out of the LLMs.

LLMs know about genetics and dev-ops

I got the notification from Nucleus Genomics that my genome results were in. Nucleus provides a basic readout of "elevated risk of this, decreased risk of that", and I'm aware there are other sites where you can upload a .vcf file and get a similar analysis.

However, I wanted a bespoke analysis, and I wanted to get it via interacting with an LLM that can run arbitrary analyses on my genome. I created a new Git repo, described my initial aim, and Claude/GPT5.4^[5] created a plan and set up a good repo structure. I downloaded all the files containing my genome.

For many kinds of genetic analyses, there are available public git repos. What became apparent is that LLMs were incredibly useful for gene analysis, not just because of knowing more about genetics than me, but because the analysis represented a huge devops/programming task that the LLMs had no problem with. God, I'm grateful for something that figures out Python environments and package dependencies for me. I'll legit guess that the LLM proficiency at coding allowed me to do, in a dozen hours, as much genetic analysis as would take a genetics PhD (or me) weeks to do, just because of the programming involved.

My Naive Approach

At the start of this project, I had been talking to Claude about potentially relevant genes. We had an initial list of 50 that grew to about a hundred.

Looking over my clinical presentation and drug response, we broke things down into five mechanistic axes: pharmacogenomic, inflammatory/neuroinflammatory, circadian, catecholamine, and neuropsychiatric. For each, it was clear I had variants on implicated genes.

Stop-gained IDO2 was a neuroinflammatory bridge, NFKB1 splice donor was a central inflammatory regulator, PER3 splice acceptor was a core circadian clock gene, promoter variants of pro-inflammatory signaling genes, IL-6, IL-1β, and IL-10, etc.

At last, my entire life made sense.

Too much sense, really.

Here is a research report. I want you to treat this as the work of a grad student and you are the PI. What do you make of the methodology? What do you make of the reinterpretation? It's important for this student's education and work to not spare feelings, and give your honest assessment. What survives stricter standards? It's not too late for this student to correct issues if given proper direction. – not what I actually wrote, but something like what I would have prompted.

Say what you will about LLMs, they're entirely faithless to their prior work. The analysis was detailed and brutal. The just-so story wasn't patently flawed to me in my genetic ignorance, but the mistakes were not subtle, it turns out.

The LLM has assembled a story by saying, "this gene is at least vaguely linked to this thing, and it is altered in this patient, therefore it explains this result". This is problematic, it turns out, because:

Certain gene variations are extremely common and therefore unlikely to explain (on their own), meaningful pathology.
Of these specific gene variations, the ClinVar database lists several as benign.
Of these variations, the evidence of their roles varies between well-established and a single report of deleterious effects.
Of others, the genes involved have tiny effects on the overall system (e.g., proinflammatory cytokine genes).

One or more of the above were true about all the genes identified as explanatory. With the approach used, it seems like you'd be able to tell a just-so story for whatever. Alas, my life does not get to make sense yet.

Rebuilding

At this point, Claude4.6/GPT5.4 was pretty insistent that the answers I wanted wouldn't be found in my own genome; we had to look at the literature and population-wide studies. I didn't like that. I wanted to be able to look at my genes and understand my brain and body. Unfortunately, I think that's a fantasy that depends on us understanding the genes and what they do and how they combine and all that.

If you want to look for connections between genes and symptoms, you have to do it at the population level. Claude4.6/GPT5.4 conducted a very nice literature review on conditions interesting to me, and I learned some stuff. Not going into detail out of reticence to dump my entire medical history on the Internet all at once.

Eventually, Claude4.6/GPT5.4 agreed that we could look at my genome again for rare variations that explained something, maybe. We examined all of my FIVE MILLION GENE VARIANTS.

The gene sweep had found only common variants because it looked at pre-selected genes. A genome-wide screen could find things nobody thought to check. VEP was run across all 5 million variant calls. After filtering for rare (<0.1% AF) and high-impact variants, and removing pseudogene artifacts (GBA1/GBAP1 produced 29 false calls from paralog misalignment), 27 candidates in 26 genes remained.

The most exciting initial find — a novel frameshift in SYN2, a gene associated with bipolar disorder and lithium response — turned out to be an artifact. Read-level inspection showed FILTER: MosaicLowAF, QUAL 8.7, only 3 of 37 reads supporting the variant, all on one strand. DRAGEN had correctly flagged it; the analysis pipeline had not checked.

Two genuine findings emerged:

LPP — a novel 28bp splice-donor deletion in one of the most replicated non-HLA autoimmune GWAS loci (genome-wide significant for celiac disease, associated with RA and type 1 diabetes). The patient has evidence of inflammatory disease, and bipolar correlates with celiac at rg = 0.31. Speculative, but it sits in an interesting place.
SLC22A1 (OCT1) — a novel splice-donor deletion creating a null allele for the organic cation transporter. After checking the other allele (clean — no known reduced-function variants), the patient was classified as OCT1 intermediate function. Another layer added to the drug-metabolism picture: modest effects on tramadol, morphine, ondansetron, and metformin handling.

"Speculative but it sits in an interesting place." Aka, not very interesting.

Polygenic Risk Scores are where it's at, le sigh

Sigh. The engineer's mind wants mechanism, but what we get are statistics. PRS's are what the sequencing companies will give you by default. My impression is that PRS databases are actually kinda proprietary, and what makes one company's analysis of your genome better than others.

Nonetheless, some PRS studies are available online, and we (by which I mean it) ran my genome against PRS studies for seven different conditions.

This was modestly interesting. I'm most elevated on bipolar, not surprising, but for another cluster of symptoms, I was at ~zero, despite having an unmistakable clinical presentation (that runs in the family, no less), which would suggest my family is getting certain unfortunate outcomes despite not having the typical genes for that at all? There's nothing immediately actionable from that, and it explains some deviations from typical presentation there, but kind of an interesting finding.

Edited: I discussed this draft with Steve Hsu, who suggested I investigate the quality of the PRS studies used. Actually, most of them are weak, and it is not at all surprising to get a null result.

Evaluation of the PRS studies used

PGS ID	Source / study type	Evidence strength	Main limitations	How surprising would a low score be for someone with the condition?	Bottom line
`PGS002786`	Gui et al. 2022 using PGC bipolar GWAS	Moderate	Respectable psychiatric GWAS base, but selected PGS came from a PRS association paper rather than a clean clinical prediction paper; local match rate only `56.7%`	Not very surprising	Useful as a research signal, but a low or average BD score would still be common among true cases
`PGS000907`	Campos et al. 2021 using UK Biobank depression GWAS	Moderate	Huge sample, but phenotype is broad depression / UKB-derived rather than the cleanest strict MDD definition; local match rate `44.0%`	Not surprising	Reasonable generic mood-liability score, not a strong individual predictor
`PGS000908`	Campos et al. 2021 using Jansen et al. insomnia GWAS	Moderate to moderately strong GWAS; moderate realized score	Very large discovery GWAS, but local match rate only `33.5%`; practical score quality is much worse than source-paper quality	Not surprising	A positive score is mildly supportive; a low score would not rule out real insomnia liability
`PGS002318`	Weissbrod et al. 2022 UK Biobank PRS release	Moderate	Large sample and validation, but trait is based on simple self-report and best European incremental `R²` is only about `0.036`	Not surprising	Neutral or low score is very plausible even with real circadian problems
`PGS002746`	Lahey et al. 2022 using Demontis et al. ADHD GWAS	Moderate to weak for this use	Underlying ADHD GWAS is decent, but selected PGS paper focused on childhood psychopathology / impulsivity in `4,483` children, not broad adult diagnosis prediction; local match rate `45.5%`	Not surprising	Directional signal only; poor basis for strong inference in an adult
`PGS002344`	Weissbrod et al. 2022 UK Biobank PRS release	Weak	European incremental `R²` only about `0.0035`; broad release score rather than a specialized score.	Very unsurprising	Near-zero score should not be treated as strong evidence against disease biology
`PGS005393`	Bugiga et al. 2024	Poor	Validation sample was only `117` Brazilian women after sexual assault; narrow and non-generalizable setting; local match rate `37.7%`; repo already notes score is not interpretable	Completely unsurprising	Should not be used for meaningful personal inference
`PGS001287`	Tanigawa et al. 2022 sparse UKB PRS	Weak here / effectively unusable	Only `36` variants; tiny case count relative to PRS standards; key HLA entries skipped in local application	Meaningless to be surprised	Incomplete score, not interpretable without proper HLA typing
`PGS002886`	ExPRS-style sparse score	Weak	Only `5` variants, with `3` matched locally	Not surprising at all	Too underpowered for individual interpretation

Genomically Stiffed

The Nucleus website makes several genome files available for download (see image above). The structural variant and copy number files are empty placeholder stubs. They have headers but no actual content.

Claude4.6/GPT5.4 attempted to rederive these files from upstream raw files but failed due to a mismatch between the genome data format and the reference data it had. Probably the mysteries of my life lie in that data, and I don't get to know.

Definitely Worthwhile: Drug Metabolism

I always knew I was special. Now I can point to my very DNA and say that I'm between I'm something between 1 in 20,000 and 1 in 1,000,000 for my drug metabolism profile (if I trust Claude4.6/GPT5.4, that I don't especially.)

To recap for people who skipped straight to this section, the way we wished it worked is we could say "gene A does Blah, your gene A is broken, so your body is sucky at Blah". It turns out our science is primitive, and we do not know this for most genes. What we can say is "people who have these 87 genes altered tend to have conditions like impaired-Blah-syndrome," which is not very mechanistic at all. This is called a Genome-Wide Association Study and you can calculate Polygenic Risk Scores.

There are some exceptions, though. Certain altered genes are so horribly deleterious that we can say yes, that gene, yes, you, Mr. Gene, is responsible for some pretty bad condition.

But another really cool exception is drug metabolism, where CPIC (Clinical Pharmacogenetics Implementation Consortium) has identified how different specific genes interact with drug metabolism, and this is material for drug choice.

Going in, I already knew that I was very sensitive to a range of drugs. Sensitive to the extent that I find 10% of the usual dose to be effective. Well, here is my metabolism:

Gene	Diplotype	Phenotype	Tool	Evidence	Frequency in Europeans
CYP2C19	17/17	Ultrarapid Metabolizer	PharmCAT	CPIC Strong	~4.4%
CYP2B6	6/6	Poor Metabolizer	PharmCAT	Strong	~4.2%
CYP2D6	1/4	Intermediate Metabolizer (AS 1.0)	Cyrius (51x, PASS)	CPIC Strong	~25%
CYP3A4	1/22	Intermediate Metabolizer	PharmCAT	Moderate	~9.5% het
CYP3A5	3/3	Non-expressor	PharmCAT	Strong	~88% (population default)
NAT2	5/5	Slow Acetylator	PharmCAT	Strong	~20% for this diplotype

Additionally:

CYP1A2 — likely decreased function (*1C het + R/H missense het), but this is an unvalidated call with no clinical-grade diplotype tool. The R/H missense is Ashkenazi-enriched (35x). Confidence: ~55-60%.
COMT — Met/Met homozygous (rs4680). Biochemically confirmed 3-4x reduced activity. ~25% of Europeans.
ABCB1 — Triple homozygous variant haplotype. ~10-15% of Europeans. No CPIC guideline. Functional significance debated.
SLC22A1/OCT1 — Novel splice donor variant (het). Other allele clean. Intermediate OCT1 function (one null + one working copy).
ADH1B — *1/*2 heterozygous (rs1229984). Encodes ~40-100x faster ethanol → acetaldehyde conversion. ~20% AF in Ashkenazi Jews. Protective against alcohol use disorder at the population level. Clinical effect is mild in this patient (alcohol is "largely fine" at modest amounts).

How unusual is this combination?

Joint probability from published European diplotype frequencies:

Scope	Joint probability	"1 in N"
4 non-default CYPs (2C19, 2B6, 2D6, 3A4)	4.4 × 10⁻⁵	~1 in 23,000
Full 6-gene profile (+ CYP3A5, NAT2)	7.7 × 10⁻⁶	~1 in 129,000

The rarity is driven by three uncommon diplotypes stacking: CYP2C19 *17/*17 (4.4%), CYP2B6 *6/*6 (4.2%), CYP3A4 *1/*22 (9.5%). CYP2D6 *1/*4 at ~25% is common by itself and barely contributes. CYP3A5 *3/*3 is the majority European genotype.

Assumptions and caveats: This calculation assumes independence across loci. CYP3A4 and CYP3A5 are on the same chromosome (7q22.1) in partial LD — a modest violation. All other gene pairs are on different chromosomes. If CYP1A2 impairment were included (~1-2% in Europeans), the profile would be ~1 in 6-13 million, but this is excluded because the CYP1A2 call is unvalidated.

Bottom line: Standard dosing assumptions fail across multiple drug-metabolizing pathways simultaneously. This is genuinely unusual at the ~1 in 23,000 level. Most prescribers will never have encountered this combination. The patient's experience of reacting to many drugs is not hypochondria or nocebo — it is the expected phenotype of this genotype.

Independence across loci seems improbable, but the conclusion that I process drugs differently has been clinically verified, as they say. In fact, the particular genes affected do a really solid job of retrospectively predicting my reaction to specific drugs.

Ok, but how predictive and trustworthy is this really?

Epistemic Hygiene: Post-Preregistration

Trustworthy science writes down its predictions first, runs the experiment, and then grades. I didn't do that. I didn't know enough about the experiment to do it. However, the beautiful, tragic nature of LLM existence is that I can start a fresh instance, and it doesn't know the observed results.

With a model's help, I constructed a blind benchmark and fed it to Claude Opus 4.6 and GPT5.4:

My genomic data: star-allele calls, metabolizer phenotypes, variant-level findings across 22 identified and 23 unknown pharmacology genes. 14 genes are non-default.
A list of 88 drugs to make predictions about. For 18, I have ground truth data.

Out of 18 drugs: 6-7 fully correct prediction of reaction, 3-5 partially correct prediction, 2 where one model was correct and one wrong.

The summary stats here are lossy on how surprising the predictions were, but I would say this is really quite impressively informative, and I expect it to be predictive of drugs I haven't taken.

There's one drug where I'm still early on it, and the response isn't expected, but it's a weird case. Otherwise, if the models are wrong, it's in underestimating the magnitude of the response rather than getting the direction wrong. Claude's post-hoc explanation is that having multiple drug metabolism pathways affected has an additive effect, but neither model flagged that advance.

See this collapsed section for a more detailed drug metabolism evaluation output.

Discussion of genome-based drug retrodiction success

Notes:

It's been a long time since I've taken Temazepam; I wouldn't confidently say I had an abnormal reaction to it such that the prediction was wrong.
I don't think my lurasidone response was necessarily atypical. I'm not sure, wouldn't necessarily give the models credit for it.

The Best Predictions: Genes That Clearly Worked

CYP3A4*22 — the standout allele

This patient is CYP3A4 *1/*22 (intermediate metabolizer) with CYP3A5 *3/*3 (non-expresser), meaning both CYP3A enzymes are reduced. This predicted higher exposure for CYP3A substrates, and reality confirmed it across multiple drugs:

Quetiapine: Predicted higher exposure → observed strong sedation. Published PK data shows ~2.5× higher concentrations in *22 carriers. Score: 7/7.
Suvorexant: Predicted higher exposure → observed effective at very small doses. Score: 7/7.
Lurasidone: Predicted higher exposure → discontinued due to side effects. Score: 7/7.
Clonazepam: Predicted modestly higher exposure → observed 24–36h grogginess (though magnitude exceeded what CYP3A4 alone would explain). Score: 5/7.

CYP3A4*22 is the single most valuable finding in this patient's panel. It would have meaningfully changed prescribing decisions for at least three drugs.

CYP1A2 decreased function — caffeine nailed, olanzapine direction right

Caffeine: Predicted slow clearance → observed good effect for 1–2h then fuzzy/headachy rest of day, sleep interference. Textbook match. Score: 7/7.
Olanzapine: Predicted somewhat higher exposure → observed effective at 1/16–1/8 of a tablet with 36–48h duration. Direction correct, but the gene can't explain the extraordinary magnitude. Score: 5/7.

NAT2 5/5 slow acetylator — sulfasalazine confirmed

Sulfasalazine: Predicted higher sulfapyridine exposure and ADR risk → observed severe bad headaches. Classic, well-replicated interaction. Meta-analysis data: 3.37× higher ADR odds. Score: 5–6/7 (models should have said avoid rather than use_with_caution).

"No concern" predictions — correct and expected

Lithium (renally cleared), lamotrigine (no HLA risk alleles), biologics (proteolysis, no CYP), and alcohol (no ALDH2*2) were all predicted as unproblematic, and all were tolerated normally. These are correct predictions but low-information — a clinician without genomic data would have made the same call.

The Failures: Where Genes Didn't Predict Reality

Temazepam — the panel is blind to this (see not above)

Temazepam is cleared by glucuronidation, not CYP enzymes. The PGx panel correctly identifies this. Both models predicted "use normally, no genomic concern" with moderate-to-high confidence.

Reality: 24–36h grogginess/sedation. The drug works but the prolonged effect is a significant problem that the genotype panel simply cannot see. Whatever is causing this patient's prolonged benzodiazepine effects — it's not CYP-mediated and it's not captured by any gene on this panel.

Codeine — CPIC-grade prediction didn't manifest

CYP2D6 IM (activity score 1.0) → CPIC says reduced codeine activation → both models predicted reduced analgesia with high confidence.

Reality: effective after about an hour, described as expected. Activity score 1.0 appears to sit in a zone where population-level guidelines overstate the individual effect. The genomic logic is textbook; the patient just didn't match the population average.

Diazepam — mixed signals resolved badly by one model

CYP2C19 UM pushes toward lower exposure; CYP3A4 IM pushes toward higher exposure. GPT left this as "mixed/unclear" and scored well. Claude committed to CYP2C19 dominance, predicted lower exposure and shorter duration — the opposite of the observed 24–36h grogginess.

The lesson: when two PGx signals point in opposite directions, confident resolution in either direction is risky.

Bottom Line

The genomic data is genuinely predictive for CYP3A4 substrates, caffeine, and sulfasalazine — these are real findings that would have changed prescribing decisions.

It is correct but low-information for drugs without PGx liability — biologics, lithium, and the like behave as expected regardless of genotype.

It is overconfident on CYP2D6 IM and CYP2C19 UM — these are real metabolic effects that don't always dominate clinical outcomes.

And it is blind to a recurring pattern of broad CNS drug sensitivity that runs through this patient's drug history. The CYP variants explain part of it. The rest is an open question that current pharmacogenomic panels don't answer.

Gene	Predictive value in this patient
CYP3A422 + CYP3A53/*3	High — confirmed across 4 drugs
CYP1A2 decreased	Moderate-high — caffeine confirmed, olanzapine direction right
NAT2 5/5	High — sulfasalazine confirmed
CYP2C19 17/17 UM	Mixed — diazepam wrong direction, escitalopram TBD, PPIs untested
CYP2D6 IM (AS 1.0)	Low — codeine prediction didn't manifest
HLA-A, HLA-B (absence of risk alleles)	Correct but expected
ADH1B 1/2, ALDH2 normal	Correct — alcohol tolerance confirmed
ABCB1 variants	Uncertain — clinical significance unclear
Ungenotyped PD genes	The biggest gap — may explain the unexplained CNS sensitivity

Final note: Actually, the ungenotyped PD genes are probably not the explanation when I looked into them. Claude then said the multiple pathway interaction was likely being underweighted.

Overall, the drug metabolism findings from this vibe analysis are really not chance. The models helped me pull out real signal here.

It's clear from the results that the genes I was able to identify (not everything relevant was available in this short-read commercial sequencing) were not adequate to perfectly predict all my drug reactions. I think going off these genes would have false positives/negatives in some cases with some drugs, but the recommendations of the output are "use normally" vs "use with caution" vs "strong caution". It outputs "use caution" in many cases.

It is definitely the case that I wish I'd had these genes and the corresponding list of drug predictions going back the last couple of decades of my life. In some cases, when I was having a bad reaction, I could have stopped immediately and not been surprised. Also just seems great in general to know which drugs are more or less likely to be a problem.

Curiously, across a range of drug classes and purposes, there is typically at least one drug that avoids the pathways where I'm atypical. This is surprising to me. I would have thought that if a bunch of drugs do the same things, they'd get metabolized in the same pathways, but apparently not.

The drug results alone justify the time and cost of the exercise.

Closing Thoughts

So far, I think only the drug metabolism stuff has survived scrutiny and has practical implications, which is hardly small, but I didn't get the kinds of answers that would narrow the focus of my Bipolar interventions in the way I was hoping.

I've been doing some further projects with the LLMs, just doing literature reviews and analyzing Bipolar GWAS studies, seeing if somehow I can figure out what's going wrong in Bipolar generally. At some point, an AI will be powerful enough to infer what's going on without needing any further experiments (cf. Einstein's Arrogance), and I'd be surprised if we're there yet, but I figure I can keep trying with each generation till the mystery is solved.

^{^}

Directory structure and resulting files.

Directory Structure and Files

# RubyGeneticCode — Project Structure

```
RubyGeneticCode/
├── README.md
├── Snakefile
├── environment.yaml
├── genes_of_interest.xlsx
├── genome_analysis_plan.md
├── pharmcat.log
│
├── data/
│   ├── raw/
│   │   ├── Ruben_Bloom_nucleus_dna_download_cnv_NU-HYFQ-8076.cnv.vcf.gz
│   │   ├── Ruben_Bloom_nucleus_dna_download_cnv_NU-HYFQ-8076.cnv.vcf.gz.tbi
│   │   ├── Ruben_Bloom_nucleus_dna_download_cram_NU-HYFQ-8076.cram
│   │   ├── Ruben_Bloom_nucleus_dna_download_cram_NU-HYFQ-8076.cram.crai
│   │   ├── Ruben_Bloom_nucleus_dna_download_sv_NU-HYFQ-8076.sv.vcf.gz
│   │   ├── Ruben_Bloom_nucleus_dna_download_sv_NU-HYFQ-8076.sv.vcf.gz.tbi
│   │   ├── Ruben_Bloom_nucleus_dna_download_vcf_NU-HYFQ-8076.vcf.gz
│   │   └── Ruben_Bloom_nucleus_dna_download_vcf_NU-HYFQ-8076.vcf.gz.tbi
│   │
│   └── working/
│       ├── vep_regions.bed
│       ├── vep_regions.txt
│       │
│       ├── bipolar_model/
│       │   ├── anchor_snp_coverage.tsv
│       │   ├── anchor_snp_genotypes.tsv
│       │   ├── moderate_high_pathway_variants.tsv
│       │   ├── pathway_gene_variants.tsv
│       │   ├── pathway_gene_variants.vcf
│       │   ├── pathway_summary.json
│       │   ├── pathway_vep_annotated.tsv
│       │   └── regions.txt
│       │
│       ├── hla/
│       │   ├── hla_pipeline.py
│       │   ├── update_results.py
│       │   ├── hla_results.json
│       │   ├── hla_tag_snp_results.json
│       │   ├── hla_region.bam
│       │   ├── hla_region.bam.bai
│       │   ├── hla_region.extract.log
│       │   ├── hla_region.extracted.1.fq.gz
│       │   ├── hla_region.extracted.2.fq.gz
│       │   ├── hla_region_namesorted.bam
│       │   ├── hla_R1.fastq.gz
│       │   ├── hla_R2.fastq.gz
│       │   ├── check_bcftools.sh
│       │   ├── diagnose.sh
│       │   ├── diagnose2.sh
│       │   ├── diagnose3.sh
│       │   ├── find_bcftools.sh
│       │   ├── find_envs.sh
│       │   ├── find_tools.sh
│       │   ├── find_tools2.sh
│       │   ├── fix_and_run_optitype.sh
│       │   ├── install_and_run.sh
│       │   ├── install_bcftools.sh
│       │   ├── run_hla_extraction.sh
│       │   ├── run_optitype.sh
│       │   ├── arcashla_results/
│       │   │   └── hla_region.genotype.log
│       │   └── optitype_results/
│       │       ├── hla_typing_coverage_plot.pdf
│       │       └── hla_typing_result.tsv
│       │
│       ├── mosdepth/
│       │   ├── coverage.mosdepth.global.dist.txt
│       │   └── coverage.mosdepth.summary.txt
│       │
│       ├── pgx/
│       │   ├── cyrius_cyp2d6.json
│       │   ├── cyrius_cyp2d6.tsv
│       │   ├── cyrius_manifest.txt
│       │   ├── pharmcat_input.match.json
│       │   ├── pharmcat_input.match_warnings.txt
│       │   ├── pharmcat_input.missing_pgx_var.vcf
│       │   ├── pharmcat_input.phenotype.json
│       │   ├── pharmcat_input.preprocessed.vcf.bgz
│       │   ├── pharmcat_input.report.json
│       │   ├── pharmcat_input.vcf.gz
│       │   ├── pharmcat_input.vcf.gz.tbi
│       │   ├── pharmcat_with_ref.match.json
│       │   ├── pharmcat_with_ref.match_warnings.txt
│       │   ├── pharmcat_with_ref.missing_pgx_var.vcf
│       │   ├── pharmcat_with_ref.phenotype.json
│       │   ├── pharmcat_with_ref.preprocessed.vcf.bgz
│       │   └── pharmcat_with_ref.report.json
│       │
│       ├── prs/
│       │   └── multi_trait_prs_results.json
│       │
│       ├── rare_variants/
│       │   ├── full_vep_functional.tsv
│       │   ├── full_vep_functional.tsv_summary.html
│       │   ├── full_vep_functional.tsv_warnings.txt
│       │   ├── high_impact_rare.tsv
│       │   └── moderate_impact_rare.tsv
│       │
│       ├── sv_cnv/
│       │   └── exclude_contigs.tsv
│       │
│       ├── vep/
│       │   ├── clinvar_hits_vep_input.txt
│       │   ├── clinvar_hits_vep_output.tsv
│       │   ├── sweep_variants.vcf.gz
│       │   └── sweep_variants.vcf.gz.tbi
│       │
│       └── vep_local/
│           ├── chr_synonyms.txt
│           ├── functional_variants_canonical.tsv
│           ├── novel_variants_canonical.tsv
│           ├── rare_variants_canonical.tsv
│           ├── sweep_vep_gnomad.tsv
│           └── sweep_vep_gnomad.tsv_summary.html
│
├── docs/
│   └── gwas_enrichment_preregistration (1).md
│
├── notebooks/
│   (empty)
│
├── outputs for sharing/
│   └── ruben_bloom_gene_drug_analysis_incomplete.pdf
│
├── refs/
│   ├── 1kg/
│   │   (empty)
│   │
│   ├── config/
│   │   ├── output_contract.md
│   │   ├── project.yaml
│   │   └── reference_stack.md
│   │
│   ├── gene_panels/
│   │   ├── bipolar_model_pathways.tsv
│   │   ├── circadian_genes.txt
│   │   ├── expanded_genes.tsv
│   │   ├── genes_of_interest_from_sheet.tsv
│   │   ├── inflammatory_genes.txt
│   │   ├── neuropsychiatric_genes.txt
│   │   └── pgx_genes.txt
│   │
│   ├── pgs_scores/
│   │   ├── PGS000907_hmPOS_GRCh38.txt.gz
│   │   ├── PGS000908_hmPOS_GRCh38.txt.gz
│   │   ├── PGS001287_PsA_GRCh38.txt.gz
│   │   ├── PGS002318_hmPOS_GRCh38.txt.gz
│   │   ├── PGS002344_hmPOS_GRCh38.txt.gz
│   │   ├── PGS002746_hmPOS_GRCh38.txt.gz
│   │   ├── PGS002786_bipolar_GRCh38.txt.gz
│   │   ├── PGS002886_CRP_GRCh38.txt.gz
│   │   └── PGS005393_hmPOS_GRCh38.txt.gz
│   │
│   ├── reference_genome/
│   │   ├── chrM.fa.gz
│   │   ├── clinvar.vcf.gz
│   │   ├── clinvar.vcf.gz.tbi
│   │   ├── reference.dict
│   │   ├── reference.fa
│   │   ├── reference.fa.fai
│   │   └── reference.fa.gz
│   │
│   ├── score_lists/
│   │   └── pgs_catalog_score_targets.tsv
│   │
│   └── vep_cache/
│       └── homo_sapiens/
│           └── 115_GRCh38/
│               ├── 1/ ... 22/          (autosomes)
│               ├── GL*/, KI*/          (alt contigs)
│               └── (~13,926 cache files total)
│
├── reports/
│   ├── analysis_plan_phase7.md
│   ├── bipolar_disorder_causal_model_v031_final_report.md
│   ├── bipolar_model_vs_genome_deep_comparison.md
│   ├── bipolar_model_vs_ruben_genome_comparison.md
│   ├── clinvar_annotated_variants.tsv
│   ├── clinvar_annotation_summary.md
│   ├── clinvar_hits.tsv
│   ├── clinvar_interpretation.md
│   ├── data_qc.md
│   ├── drug_contingency_table.html
│   ├── drug_contingency_table.md
│   ├── drug_contingency_table.pdf
│   ├── drug_gene_crossref.md
│   ├── drug_history.md
│   ├── drug_sensitivity_candidates.tsv
│   ├── expanded_gene_summary.tsv
│   ├── expanded_gene_variants.tsv
│   ├── genes_of_interest_gene_summary.tsv
│   ├── genes_of_interest_gene_summary_phase1.tsv
│   ├── genes_of_interest_gene_sweep.md
│   ├── genes_of_interest_gene_variants.tsv
│   ├── genes_of_interest_gene_variants_phase1.tsv
│   ├── genes_of_interest_progress.md
│   ├── genes_of_interest_rsid_lookup.tsv
│   ├── genetic_correlations_psychiatric_immune.md
│   ├── genomic_findings_report.xlsx
│   ├── genomic_hypothesis_report.md
│   ├── hla_typing.md
│   ├── input_inventory.md
│   ├── markdown-pdf.css
│   ├── mechanism_memo.md
│   ├── mechanism_scores.json
│   ├── mr_inflammation_psychiatry_literature_review.md
│   ├── multi_cyp_joint_probability.md
│   ├── multi_trait_prs.md
│   ├── pgx_report.md
│   ├── pharmcat_prep.md
│   ├── prs_gwas_quality_evaluation.md
│   ├── prs_gwas_quality_summary_table.md
│   ├── rare_variant_analysis.md
│   ├── rare_variant_candidates.tsv
│   ├── research_narrative.md
│   ├── ruben_bloom_drug_contingency_analysis_incomplete.pdf
│   ├── session_handoff.md
│   ├── setup_checklist.md
│   ├── sv_cnv_analysis.md
│   ├── synthesis.md
│   ├── synthesis_attempt_01.md
│   ├── unified_findings.md
│   ├── vep_annotated_variants.tsv
│   ├── vep_annotation_summary.md
│   ├── vep_functional_variants.tsv
│   └── vep_local_annotation_summary.md
│
├── scripts/
│   ├── analyze_pathway_results.py
│   ├── bipolar_model_pathways.py
│   ├── extract_pathway_variants.py
│   ├── rare_variant_screen.py
│   │
│   ├── annotate/
│   │   ├── clinvar_annotate.py
│   │   └── vep_rest_annotate.py
│   │
│   ├── pgx/
│   │   └── normalize_for_pharmcat.sh
│   │
│   ├── preprocess/
│   │   ├── extract_genes_of_interest.py
│   │   ├── inspect_inputs.py
│   │   ├── query_genes_of_interest_regions.py
│   │   └── query_genes_of_interest_rsids.py
│   │
│   ├── prs/
│   │   ├── compute_bipolar_prs.py
│   │   ├── compute_multi_trait_prs.py
│   │   ├── compute_prs.py
│   │   └── compute_small_prs.py
│   │
│   ├── reports/
│   │   ├── build_genomic_report.py
│   │   └── write_setup_checklist.py
│   │
│   └── setup/
│       (empty)
│
└── tools/
    └── pharmcat/
        ├── pharmcat-3.2.0-all.jar
        ├── pharmcat-preprocessor-3.2.0.tar.gz
        ├── pharmcat_positions.uniallelic.vcf.bgz
        ├── pharmcat_positions.uniallelic.vcf.bgz.csi
        ├── pharmcat_positions_3.2.0.vcf.bgz
        ├── pharmcat_positions_3.2.0.vcf.bgz.csi
        ├── pharmcat_regions.bed
        └── preprocessor/
            ├── README.md
            ├── requirements.txt
            ├── pharmcat_pipeline
            ├── pharmcat_vcf_preprocessor
            └── pcat/
                ├── __init__.py
                ├── common.py
                ├── exceptions.py
                ├── preprocess.py
                ├── utilities.py
                └── chr_rename_map.tsv
```

^{^}
All the various tools had to be made to work.
- cyvcf2 — Fast VCF file parser built on htslib. Used throughout the PRS (polygenic risk score) computation scripts, rare variant screening, and ClinVar annotation to read/write VCF files.
- pysam — Python wrapper for samtools/htslib. Used in the main PRS computation script for reading indexed VCF/BAM files.
- numpy — Numerical computing library. Used in the PRS scripts for score calculations and array operations.
- openpyxl — Excel file reader/writer. Used to parse gene-of-interest spreadsheets and to build the final genomic report workbook with styled output.
- bcftools — Swiss-army knife for VCF/BCF manipulation (filtering, querying, normalizing variants). Called as a subprocess from several scripts and the PharmCAT normalization shell script.
- samtools — BAM/CRAM file manipulation and indexing (declared in environment.yaml).
- htslib — C library underpinning bcftools/samtools; provides tabix and bgzip (declared in environment.yaml).
- bedtools — Genome arithmetic (intersecting, merging genomic intervals) (declared in environment.yaml).
- vt — Variant normalization and decomposition tool (declared in environment.yaml).
- Ensembl VEP — Variant Effect Predictor for functional annotation of variants. Called via REST API in vep_rest_annotate.py and also listed as a conda dependency.
- SnpSift — Companion to SnpEff for filtering/extracting fields from annotated VCFs (declared in environment.yaml).
- Snakemake — Workflow engine orchestrating the analysis pipeline. Scripts use the injected snakemake object for inputs/outputs/params.
- PharmCAT — Pharmacogenomics Clinical Annotation Tool. Maps genotypes to drug-response phenotypes. Run via Docker.
- pgsc_calc — Polygenic Score Catalog calculator pipeline. Computes PRS from published GWAS weights. Run via Docker/Nextflow.
- PLINK 2 — Whole-genome association analysis toolset for QC, PCA, and genotype management.
- Cyrius — CYP2D6 star-allele caller from whole-genome sequencing data. Listed as a pip dependency in environment.yaml.
- Aldy — Pharmacogene star-allele caller (mentioned in README tool table).
^{^}
According to Claude, since 1966, there have been ~80,000 peer-reviewed papers, ~4,000 clinical trials registered since 2000, and perhaps $5 billion in funding for studying bipolar.
Depending on the threshold, people with a bipolar spectrum disorder are 0.5-4% of the population.
^{^}
Interventions attempted or under consideration include:
- Careful circadian rhythm entrainment
- - consistent sleep/wake times
  - strong blue light in the morning, strong absence of blue light in the evening
  - Melatonin taken 4-5 hours before sleep for chronobiotic effect
- Vague nerve toning
- Heart rate variability biofeedback
- HPA Axis/cortisol supplements like Ashwagandha
- Anti-inflammatory
- - Minocycline (crosses blood-brain barrier)
  - Omega 3s
  - Vegan diet??
^{^}
I switch between using each of them.

[-]epiphi3mo20

We should chat about this! I have been semi-vibe-analyzing my genome based in part on the January 2025 blog post Calculating Polygenic Risk Scores from Whole Genome Sequencing Data and have replicated some of the same conclusions as you.

The most immediately actionable / least ambiguous things are SNPs with well-established effects.

I recommend checking your VCF for variants in the following set sof genes:

The American College of Medical Genetics and Genomics (ACMG) Secondary Findings Working Group (SFWG) list of recommended genes for opportunistic screening, most recently updated in 2025. This is a list of 100 genes that doctors recommend taking a look at if you happen to already have genomic data. You still need to cross-reference ClinVar ― they have per-gene reporting thresholds that are, for most of them "all P and LP" (i.e. all variants annotated as "Pathogenic" or "Likely Pathogenic".
Nutrigenomics: I ended up asking Claude to write up a list of this, since I couldn't find a single canoncial source, but these are things like FUT2 (vitamin D metabolism) and MTHFR (folate metabolism), that have common variants which may affect your ability to acquire certain nutrients from certain dietary sources.
Phamacogenomics: I agree that PharmCAT is the right tool for this.

Polygenic risk scores vary a lot in quality and are more difficult to calculate.

I'm still forming opinions on this, but a few important caveats:

Published PRS are of highly variable quality, and the best ones are proprietary
Often papers will only publish a list of significant SNPs, which is much less information than you want, and only kind of a PRS; I have ended up trying to do some of my own GWAS --> PRS conversions, which has a lot of its own minefields (more advanced methods go beyond "Bayesian correction for linkage disequilibrium" and use larger datasets, combine multiple GWAS, and do adjustments for family relatedness that are beyond my sophistication)
Your VCF is not enough to calcualte your PRS, because the reference (rather than variant) is the "effect allele" for many scores of interest; the default behaviour of tools like PRSKB CLI and PSG_Calc is to impute the missing alleles based on your VCF but this is going to be wrong, you likely want to re-call variants from your CRAM (this is dicussed at length in the blog post I linked earlier)

CNVs are fairly tractable to calculate from CRAMs

You probably should do this analysis, since there are some CNVs that have very high effect sizes for psychiatric conditions.

[-]Ruby3mo20

Thanks! Sounds good. Yeah, I'll check for those variants.

Regarding PRS quality, indeed. There's a table in a collapsible section with an analysis of the quality of PRSs used. Interesting regarding your own conversion from GWAS to PRS.

Your VCF is not enough to calcualte your PRS, because the reference (rather than variant) is the "effect allele" for many scores of interest; the default behaviour of tools like PRSKB CLI and PSG_Calc is to impute the missing alleles based on your VCF but this is going to be wrong, you likely want to re-call variants from your CRAM (this is dicussed at length in the blog post I linked earlier)

Ah, cool, yes. Interestingly Claude/GPT left a comment in its code mentioning exactly this problem, and then punted on it and I didn't notice.

We attempted this and it failed because of contig mismatch with the reference on the CRAM. Going back to it, we could have just downloaded the appropriate one? (DRAGEN/Lumina?) Another thing not done for no good reason that I didn't catch. (Other things I did catch, but not this.)

Cool, that gives me some things to do.

[-]epiphi3mo10

Oh, huh, DRAGEN is new Illumina software that appears to be using human pangenome references; do you know what reference genome your CRAM was aligned to?

Since it's already aligned to a reference, your better bet is to remap the coordinates; LiftOver in bcftools is a normal way to remap from one reference to another. I used Manta for calling CNVs, but it seems like maybe DRAGEN is better software?

[-]avturchin3mo20

I used Claude Code for my genome analyze and results were great. Also it can provide interesting answers even to stupid questions like - what is my ability to lucid dream? What is my IQ?

11