We should chat about this! I have been semi-vibe-analyzing my genome based in part on the January 2025 blog post Calculating Polygenic Risk Scores from Whole Genome Sequencing Data and have replicated some of the same conclusions as you.
The most immediately actionable / least ambiguous things are SNPs with well-established effects.
I recommend checking your VCF for variants in the following set sof genes:
Polygenic risk scores vary a lot in quality and are more difficult to calculate.
I'm still forming opinions on this, but a few important caveats:
CNVs are fairly tractable to calculate from CRAMs
You probably should do this analysis, since there are some CNVs that have very high effect sizes for psychiatric conditions.
Thanks! Sounds good. Yeah, I'll check for those variants.
Regarding PRS quality, indeed. There's a table in a collapsible section with an analysis of the quality of PRSs used. Interesting regarding your own conversion from GWAS to PRS.
- Your VCF is not enough to calcualte your PRS, because the reference (rather than variant) is the "effect allele" for many scores of interest; the default behaviour of tools like PRSKB CLI and PSG_Calc is to impute the missing alleles based on your VCF but this is going to be wrong, you likely want to re-call variants from your CRAM (this is dicussed at length in the blog post I linked earlier)
Ah, cool, yes. Interestingly Claude/GPT left a comment in its code mentioning exactly this problem, and then punted on it and I didn't notice.
CNVs are fairly tractable to calculate from CRAMs
We attempted this and it failed because of contig mismatch with the reference on the CRAM. Going back to it, we could have just downloaded the appropriate one? (DRAGEN/Lumina?) Another thing not done for no good reason that I didn't catch. (Other things I did catch, but not this.)
Cool, that gives me some things to do.
I used Claude Code for my genome analyze and results were great. Also it can provide interesting answers even to stupid questions like - what is my ability to lucid dream? What is my IQ?
The most interesting and useful result concerns drug metabolism; you can skip to that section here.
What I did
I had my genome sequenced by Nucleus Genomics. I downloaded various genome files from Nucleus and started a new Git repo within Cursor. I worked with the LLMs to make a plan, hoping to find insights that would help me prioritize treatments by better understanding my own conditions, e.g., identifying responsible pathways.
Or rather, what Claude/ChatGPT did
This is the dense description that will mean something specific to someone who knows anything about genetic analysis, and is otherwise supposed to seem impressive if you don't. In all seriousness, though, I think the LLMs do offer you more analysis than you can get by uploading your .vcf to a site, at the risk that they'll definitely lead you astray with interpretation unless you're careful.
This project analyzed a 43x whole-genome sequence for a patient with bipolar, inflammatory symptoms, and extreme multi-drug sensitivity. The work included: data QC and coverage verification; pharmacogenomic star-allele calling via PharmCAT and Cyrius (CYP2D6); HLA class I typing via OptiType with tag SNP cross-validation; ClinVar and VEP functional annotation of a 76-gene candidate sweep; a genome-wide rare variant screen filtering 5 million variants down to 27 high-impact candidates; multi-trait polygenic risk scores for psychiatric and immune traits; two literature reviews synthesizing published genetic correlations (LDSC) and Mendelian randomization evidence for inflammation-psychiatry causal pathways; a critical adversarial review that checked all findings against population frequencies and ClinVar review status, dismantling several overclaimed results; a joint probability calculation quantifying the rarity of the patient's multi-CYP profile (~1 in 23,000); a drug history cross-reference mapping 23 medications to specific genotypes; and a comprehensive drug contingency table covering 75+ medications with pathway-specific safety ratings.
To give you an idea of what the project looked like, you can see the resulting directory structure within this footnote[1] and the software packages installed are in this footnote[2].
Motivation
I would like to reduce the up and down fluctuations of my Bipolar. They're simultaneously not that bad, but also bad enough to be unpleasant, disruptive, and to rob me of many productive hours.
Bipolar, for all its prevalence and study[3], is not yet well understood. It's maybe something something a circadian rhythm disorder. It's maybe something something inflammatory. It's maybe a single disease with somewhat different presentations, or maybe it's multiple diseases with somewhat similar presentation.
I hoped that if I could look at my own genes and see which were anomalous, perhaps I could pin down what's going wrong in my mind. If it seems to be clock gene-related, I'll emphasize circadian type treatments; if it's inflammatory, focus on that. I probably want to work on everything, but info that directs effort would be good[4].
I don't know much about genetics
Due to poor life choices, I'm likely missing even many basics taught in high school. Sure, I knew about SNPs before this project, but I couldn't have told you what linkage disequilibrium was.
But if the hundreds of new users submitting AI-generated work to LessWrong have taught me anything, it's that you don't have to be a domain expert to believe you've found invaluable insight with the help of a friendly AI collaborator. Ideally, I'd stop and study a genetics primer, but it's so much easier to mash "continue" on Cursor/Claude Code.
I am hoping that, via sufficient paranoia, prior experience with LLMs, and red-teaming-like efforts, I can nonetheless get something trustworthy out of the LLMs.
LLMs know about genetics and dev-ops
I got the notification from Nucleus Genomics that my genome results were in. Nucleus provides a basic readout of "elevated risk of this, decreased risk of that", and I'm aware there are other sites where you can upload a .vcf file and get a similar analysis.
However, I wanted a bespoke analysis, and I wanted to get it via interacting with an LLM that can run arbitrary analyses on my genome. I created a new Git repo, described my initial aim, and Claude/GPT5.4[5] created a plan and set up a good repo structure. I downloaded all the files containing my genome.
For many kinds of genetic analyses, there are available public git repos. What became apparent is that LLMs were incredibly useful for gene analysis, not just because of knowing more about genetics than me, but because the analysis represented a huge devops/programming task that the LLMs had no problem with. God, I'm grateful for something that figures out Python environments and package dependencies for me. I'll legit guess that the LLM proficiency at coding allowed me to do, in a dozen hours, as much genetic analysis as would take a genetics PhD (or me) weeks to do, just because of the programming involved.
.vcf is conveniently small but you might want more info than gets compressed into that
My Naive Approach
At the start of this project, I had been talking to Claude about potentially relevant genes. We had an initial list of 50 that grew to about a hundred.
Looking over my clinical presentation and drug response, we broke things down into five mechanistic axes: pharmacogenomic, inflammatory/neuroinflammatory, circadian, catecholamine, and neuropsychiatric. For each, it was clear I had variants on implicated genes.
Stop-gained IDO2 was a neuroinflammatory bridge, NFKB1 splice donor was a central inflammatory regulator, PER3 splice acceptor was a core circadian clock gene, promoter variants of pro-inflammatory signaling genes, IL-6, IL-1β, and IL-10, etc.
At last, my entire life made sense.
Too much sense, really.
Say what you will about LLMs, they're entirely faithless to their prior work. The analysis was detailed and brutal. The just-so story wasn't patently flawed to me in my genetic ignorance, but the mistakes were not subtle, it turns out.
The LLM has assembled a story by saying, "this gene is at least vaguely linked to this thing, and it is altered in this patient, therefore it explains this result". This is problematic, it turns out, because:
One or more of the above were true about all the genes identified as explanatory. With the approach used, it seems like you'd be able to tell a just-so story for whatever. Alas, my life does not get to make sense yet.
Rebuilding
At this point, Claude4.6/GPT5.4 was pretty insistent that the answers I wanted wouldn't be found in my own genome; we had to look at the literature and population-wide studies. I didn't like that. I wanted to be able to look at my genes and understand my brain and body. Unfortunately, I think that's a fantasy that depends on us understanding the genes and what they do and how they combine and all that.
If you want to look for connections between genes and symptoms, you have to do it at the population level. Claude4.6/GPT5.4 conducted a very nice literature review on conditions interesting to me, and I learned some stuff. Not going into detail out of reticence to dump my entire medical history on the Internet all at once.
Eventually, Claude4.6/GPT5.4 agreed that we could look at my genome again for rare variations that explained something, maybe. We examined all of my FIVE MILLION GENE VARIANTS.
The gene sweep had found only common variants because it looked at pre-selected genes. A genome-wide screen could find things nobody thought to check. VEP was run across all 5 million variant calls. After filtering for rare (<0.1% AF) and high-impact variants, and removing pseudogene artifacts (GBA1/GBAP1 produced 29 false calls from paralog misalignment), 27 candidates in 26 genes remained.
The most exciting initial find — a novel frameshift in SYN2, a gene associated with bipolar disorder and lithium response — turned out to be an artifact. Read-level inspection showed FILTER: MosaicLowAF, QUAL 8.7, only 3 of 37 reads supporting the variant, all on one strand. DRAGEN had correctly flagged it; the analysis pipeline had not checked.
Two genuine findings emerged:
"Speculative but it sits in an interesting place." Aka, not very interesting.
Polygenic Risk Scores are where it's at, le sigh
Sigh. The engineer's mind wants mechanism, but what we get are statistics. PRS's are what the sequencing companies will give you by default. My impression is that PRS databases are actually kinda proprietary, and what makes one company's analysis of your genome better than others.
Nonetheless, some PRS studies are available online, and we (by which I mean it) ran my genome against PRS studies for seven different conditions.
This was modestly interesting. I'm most elevated on bipolar, not surprising, but for another cluster of symptoms, I was at ~zero, despite having an unmistakable clinical presentation (that runs in the family, no less), which would suggest my family is getting certain unfortunate outcomes despite not having the typical genes for that at all? There's nothing immediately actionable from that, and it explains some deviations from typical presentation there, but kind of an interesting finding.
Edited: I discussed this draft with Steve Hsu, who suggested I investigate the quality of the PRS studies used. Actually, most of them are weak, and it is not at all surprising to get a null result.
Evaluation of the PRS studies used
PGS ID
Source / study type
Evidence strength
Main limitations
How surprising would a low score be for someone with the condition?
Bottom line
PGS002786Gui et al. 2022 using PGC bipolar GWAS
Moderate
Respectable psychiatric GWAS base, but selected PGS came from a PRS association paper rather than a clean clinical prediction paper; local match rate only
56.7%Not very surprising
Useful as a research signal, but a low or average BD score would still be common among true cases
PGS000907Campos et al. 2021 using UK Biobank depression GWAS
Moderate
Huge sample, but phenotype is broad depression / UKB-derived rather than the cleanest strict MDD definition; local match rate
44.0%Not surprising
Reasonable generic mood-liability score, not a strong individual predictor
PGS000908Campos et al. 2021 using Jansen et al. insomnia GWAS
Moderate to moderately strong GWAS; moderate realized score
Very large discovery GWAS, but local match rate only
33.5%; practical score quality is much worse than source-paper qualityNot surprising
A positive score is mildly supportive; a low score would not rule out real insomnia liability
PGS002318Weissbrod et al. 2022 UK Biobank PRS release
Moderate
Large sample and validation, but trait is based on simple self-report and best European incremental
R²is only about0.036Not surprising
Neutral or low score is very plausible even with real circadian problems
PGS002746Lahey et al. 2022 using Demontis et al. ADHD GWAS
Moderate to weak for this use
Underlying ADHD GWAS is decent, but selected PGS paper focused on childhood psychopathology / impulsivity in
4,483children, not broad adult diagnosis prediction; local match rate45.5%Not surprising
Directional signal only; poor basis for strong inference in an adult
PGS002344Weissbrod et al. 2022 UK Biobank PRS release
Weak
European incremental
R²only about0.0035; broad release score rather than a specialized score.Very unsurprising
Near-zero score should not be treated as strong evidence against disease biology
PGS005393Bugiga et al. 2024
Poor
Validation sample was only
117Brazilian women after sexual assault; narrow and non-generalizable setting; local match rate37.7%; repo already notes score is not interpretableCompletely unsurprising
Should not be used for meaningful personal inference
PGS001287Tanigawa et al. 2022 sparse UKB PRS
Weak here / effectively unusable
Only
36variants; tiny case count relative to PRS standards; key HLA entries skipped in local applicationMeaningless to be surprised
Incomplete score, not interpretable without proper HLA typing
PGS002886ExPRS-style sparse score
Weak
Only
5variants, with3matched locallyNot surprising at all
Too underpowered for individual interpretation
Genomically Stiffed
The Nucleus website makes several genome files available for download (see image above). The structural variant and copy number files are empty placeholder stubs. They have headers but no actual content.
Claude4.6/GPT5.4 attempted to rederive these files from upstream raw files but failed due to a mismatch between the genome data format and the reference data it had. Probably the mysteries of my life lie in that data, and I don't get to know.
Definitely Worthwhile: Drug Metabolism
I always knew I was special. Now I can point to my very DNA and say that I'm between I'm something between 1 in 20,000 and 1 in 1,000,000 for my drug metabolism profile (if I trust Claude4.6/GPT5.4, that I don't especially.)
To recap for people who skipped straight to this section, the way we wished it worked is we could say "gene A does Blah, your gene A is broken, so your body is sucky at Blah". It turns out our science is primitive, and we do not know this for most genes. What we can say is "people who have these 87 genes altered tend to have conditions like impaired-Blah-syndrome," which is not very mechanistic at all. This is called a Genome-Wide Association Study and you can calculate Polygenic Risk Scores.
There are some exceptions, though. Certain altered genes are so horribly deleterious that we can say yes, that gene, yes, you, Mr. Gene, is responsible for some pretty bad condition.
But another really cool exception is drug metabolism, where CPIC (Clinical Pharmacogenetics Implementation Consortium) has identified how different specific genes interact with drug metabolism, and this is material for drug choice.
Going in, I already knew that I was very sensitive to a range of drugs. Sensitive to the extent that I find 10% of the usual dose to be effective. Well, here is my metabolism:
Gene
Diplotype
Phenotype
Tool
Evidence
Frequency in Europeans
CYP2C19
*17/*17
Ultrarapid Metabolizer
PharmCAT
CPIC Strong
~4.4%
CYP2B6
*6/*6
Poor Metabolizer
PharmCAT
Strong
~4.2%
CYP2D6
*1/*4
Intermediate Metabolizer (AS 1.0)
Cyrius (51x, PASS)
CPIC Strong
~25%
CYP3A4
*1/*22
Intermediate Metabolizer
PharmCAT
Moderate
~9.5% het
CYP3A5
*3/*3
Non-expressor
PharmCAT
Strong
~88% (population default)
NAT2
*5/*5
Slow Acetylator
PharmCAT
Strong
~20% for this diplotype
Additionally:
How unusual is this combination?
Joint probability from published European diplotype frequencies:
Scope
Joint probability
"1 in N"
4 non-default CYPs (2C19, 2B6, 2D6, 3A4)
4.4 × 10⁻⁵
~1 in 23,000
Full 6-gene profile (+ CYP3A5, NAT2)
7.7 × 10⁻⁶
~1 in 129,000
The rarity is driven by three uncommon diplotypes stacking: CYP2C19 *17/*17 (4.4%), CYP2B6 *6/*6 (4.2%), CYP3A4 *1/*22 (9.5%). CYP2D6 *1/*4 at ~25% is common by itself and barely contributes. CYP3A5 *3/*3 is the majority European genotype.
Assumptions and caveats: This calculation assumes independence across loci. CYP3A4 and CYP3A5 are on the same chromosome (7q22.1) in partial LD — a modest violation. All other gene pairs are on different chromosomes. If CYP1A2 impairment were included (~1-2% in Europeans), the profile would be ~1 in 6-13 million, but this is excluded because the CYP1A2 call is unvalidated.
Bottom line: Standard dosing assumptions fail across multiple drug-metabolizing pathways simultaneously. This is genuinely unusual at the ~1 in 23,000 level. Most prescribers will never have encountered this combination. The patient's experience of reacting to many drugs is not hypochondria or nocebo — it is the expected phenotype of this genotype.
Independence across loci seems improbable, but the conclusion that I process drugs differently has been clinically verified, as they say. In fact, the particular genes affected do a really solid job of retrospectively predicting my reaction to specific drugs.
Ok, but how predictive and trustworthy is this really?
Epistemic Hygiene: Post-Preregistration
Trustworthy science writes down its predictions first, runs the experiment, and then grades. I didn't do that. I didn't know enough about the experiment to do it. However, the beautiful, tragic nature of LLM existence is that I can start a fresh instance, and it doesn't know the observed results.
With a model's help, I constructed a blind benchmark and fed it to Claude Opus 4.6 and GPT5.4:
Out of 18 drugs: 6-7 fully correct prediction of reaction, 3-5 partially correct prediction, 2 where one model was correct and one wrong.
The summary stats here are lossy on how surprising the predictions were, but I would say this is really quite impressively informative, and I expect it to be predictive of drugs I haven't taken.
There's one drug where I'm still early on it, and the response isn't expected, but it's a weird case. Otherwise, if the models are wrong, it's in underestimating the magnitude of the response rather than getting the direction wrong. Claude's post-hoc explanation is that having multiple drug metabolism pathways affected has an additive effect, but neither model flagged that advance.
See this collapsed section for a more detailed drug metabolism evaluation output.
Discussion of genome-based drug retrodiction success
Notes:
The Best Predictions: Genes That Clearly Worked
CYP3A4*22 — the standout allele
This patient is CYP3A4 *1/*22 (intermediate metabolizer) with CYP3A5 *3/*3 (non-expresser), meaning both CYP3A enzymes are reduced. This predicted higher exposure for CYP3A substrates, and reality confirmed it across multiple drugs:
CYP3A4*22 is the single most valuable finding in this patient's panel. It would have meaningfully changed prescribing decisions for at least three drugs.
CYP1A2 decreased function — caffeine nailed, olanzapine direction right
NAT2 *5/*5 slow acetylator — sulfasalazine confirmed
avoidrather thanuse_with_caution)."No concern" predictions — correct and expected
Lithium (renally cleared), lamotrigine (no HLA risk alleles), biologics (proteolysis, no CYP), and alcohol (no ALDH2*2) were all predicted as unproblematic, and all were tolerated normally. These are correct predictions but low-information — a clinician without genomic data would have made the same call.
The Failures: Where Genes Didn't Predict Reality
Temazepam — the panel is blind to this (see not above)
Temazepam is cleared by glucuronidation, not CYP enzymes. The PGx panel correctly identifies this. Both models predicted "use normally, no genomic concern" with moderate-to-high confidence.
Reality: 24–36h grogginess/sedation. The drug works but the prolonged effect is a significant problem that the genotype panel simply cannot see. Whatever is causing this patient's prolonged benzodiazepine effects — it's not CYP-mediated and it's not captured by any gene on this panel.
Codeine — CPIC-grade prediction didn't manifest
CYP2D6 IM (activity score 1.0) → CPIC says reduced codeine activation → both models predicted reduced analgesia with high confidence.
Reality: effective after about an hour, described as expected. Activity score 1.0 appears to sit in a zone where population-level guidelines overstate the individual effect. The genomic logic is textbook; the patient just didn't match the population average.
Diazepam — mixed signals resolved badly by one model
CYP2C19 UM pushes toward lower exposure; CYP3A4 IM pushes toward higher exposure. GPT left this as "mixed/unclear" and scored well. Claude committed to CYP2C19 dominance, predicted lower exposure and shorter duration — the opposite of the observed 24–36h grogginess.
The lesson: when two PGx signals point in opposite directions, confident resolution in either direction is risky.
Bottom Line
The genomic data is genuinely predictive for CYP3A4 substrates, caffeine, and sulfasalazine — these are real findings that would have changed prescribing decisions.
It is correct but low-information for drugs without PGx liability — biologics, lithium, and the like behave as expected regardless of genotype.
It is overconfident on CYP2D6 IM and CYP2C19 UM — these are real metabolic effects that don't always dominate clinical outcomes.
And it is blind to a recurring pattern of broad CNS drug sensitivity that runs through this patient's drug history. The CYP variants explain part of it. The rest is an open question that current pharmacogenomic panels don't answer.
Gene
Predictive value in this patient
CYP3A4*22 + CYP3A5*3/*3
High — confirmed across 4 drugs
CYP1A2 decreased
Moderate-high — caffeine confirmed, olanzapine direction right
NAT2 *5/*5
High — sulfasalazine confirmed
CYP2C19 *17/*17 UM
Mixed — diazepam wrong direction, escitalopram TBD, PPIs untested
CYP2D6 IM (AS 1.0)
Low — codeine prediction didn't manifest
HLA-A, HLA-B (absence of risk alleles)
Correct but expected
ADH1B *1/*2, ALDH2 normal
Correct — alcohol tolerance confirmed
ABCB1 variants
Uncertain — clinical significance unclear
Ungenotyped PD genes
The biggest gap — may explain the unexplained CNS sensitivity
Final note: Actually, the ungenotyped PD genes are probably not the explanation when I looked into them. Claude then said the multiple pathway interaction was likely being underweighted.
Overall, the drug metabolism findings from this vibe analysis are really not chance. The models helped me pull out real signal here.
It's clear from the results that the genes I was able to identify (not everything relevant was available in this short-read commercial sequencing) were not adequate to perfectly predict all my drug reactions. I think going off these genes would have false positives/negatives in some cases with some drugs, but the recommendations of the output are "use normally" vs "use with caution" vs "strong caution". It outputs "use caution" in many cases.
It is definitely the case that I wish I'd had these genes and the corresponding list of drug predictions going back the last couple of decades of my life. In some cases, when I was having a bad reaction, I could have stopped immediately and not been surprised. Also just seems great in general to know which drugs are more or less likely to be a problem.
Curiously, across a range of drug classes and purposes, there is typically at least one drug that avoids the pathways where I'm atypical. This is surprising to me. I would have thought that if a bunch of drugs do the same things, they'd get metabolized in the same pathways, but apparently not.
The drug results alone justify the time and cost of the exercise.
Closing Thoughts
So far, I think only the drug metabolism stuff has survived scrutiny and has practical implications, which is hardly small, but I didn't get the kinds of answers that would narrow the focus of my Bipolar interventions in the way I was hoping.
I've been doing some further projects with the LLMs, just doing literature reviews and analyzing Bipolar GWAS studies, seeing if somehow I can figure out what's going wrong in Bipolar generally. At some point, an AI will be powerful enough to infer what's going on without needing any further experiments (cf. Einstein's Arrogance), and I'd be surprised if we're there yet, but I figure I can keep trying with each generation till the mystery is solved.
Directory structure and resulting files.
Directory Structure and Files
# RubyGeneticCode — Project Structure
```
RubyGeneticCode/
├── README.md
├── Snakefile
├── environment.yaml
├── genes_of_interest.xlsx
├── genome_analysis_plan.md
├── pharmcat.log
│
├── data/
│ ├── raw/
│ │ ├── Ruben_Bloom_nucleus_dna_download_cnv_NU-HYFQ-8076.cnv.vcf.gz
│ │ ├── Ruben_Bloom_nucleus_dna_download_cnv_NU-HYFQ-8076.cnv.vcf.gz.tbi
│ │ ├── Ruben_Bloom_nucleus_dna_download_cram_NU-HYFQ-8076.cram
│ │ ├── Ruben_Bloom_nucleus_dna_download_cram_NU-HYFQ-8076.cram.crai
│ │ ├── Ruben_Bloom_nucleus_dna_download_sv_NU-HYFQ-8076.sv.vcf.gz
│ │ ├── Ruben_Bloom_nucleus_dna_download_sv_NU-HYFQ-8076.sv.vcf.gz.tbi
│ │ ├── Ruben_Bloom_nucleus_dna_download_vcf_NU-HYFQ-8076.vcf.gz
│ │ └── Ruben_Bloom_nucleus_dna_download_vcf_NU-HYFQ-8076.vcf.gz.tbi
│ │
│ └── working/
│ ├── vep_regions.bed
│ ├── vep_regions.txt
│ │
│ ├── bipolar_model/
│ │ ├── anchor_snp_coverage.tsv
│ │ ├── anchor_snp_genotypes.tsv
│ │ ├── moderate_high_pathway_variants.tsv
│ │ ├── pathway_gene_variants.tsv
│ │ ├── pathway_gene_variants.vcf
│ │ ├── pathway_summary.json
│ │ ├── pathway_vep_annotated.tsv
│ │ └── regions.txt
│ │
│ ├── hla/
│ │ ├── hla_pipeline.py
│ │ ├── update_results.py
│ │ ├── hla_results.json
│ │ ├── hla_tag_snp_results.json
│ │ ├── hla_region.bam
│ │ ├── hla_region.bam.bai
│ │ ├── hla_region.extract.log
│ │ ├── hla_region.extracted.1.fq.gz
│ │ ├── hla_region.extracted.2.fq.gz
│ │ ├── hla_region_namesorted.bam
│ │ ├── hla_R1.fastq.gz
│ │ ├── hla_R2.fastq.gz
│ │ ├── check_bcftools.sh
│ │ ├── diagnose.sh
│ │ ├── diagnose2.sh
│ │ ├── diagnose3.sh
│ │ ├── find_bcftools.sh
│ │ ├── find_envs.sh
│ │ ├── find_tools.sh
│ │ ├── find_tools2.sh
│ │ ├── fix_and_run_optitype.sh
│ │ ├── install_and_run.sh
│ │ ├── install_bcftools.sh
│ │ ├── run_hla_extraction.sh
│ │ ├── run_optitype.sh
│ │ ├── arcashla_results/
│ │ │ └── hla_region.genotype.log
│ │ └── optitype_results/
│ │ ├── hla_typing_coverage_plot.pdf
│ │ └── hla_typing_result.tsv
│ │
│ ├── mosdepth/
│ │ ├── coverage.mosdepth.global.dist.txt
│ │ └── coverage.mosdepth.summary.txt
│ │
│ ├── pgx/
│ │ ├── cyrius_cyp2d6.json
│ │ ├── cyrius_cyp2d6.tsv
│ │ ├── cyrius_manifest.txt
│ │ ├── pharmcat_input.match.json
│ │ ├── pharmcat_input.match_warnings.txt
│ │ ├── pharmcat_input.missing_pgx_var.vcf
│ │ ├── pharmcat_input.phenotype.json
│ │ ├── pharmcat_input.preprocessed.vcf.bgz
│ │ ├── pharmcat_input.report.json
│ │ ├── pharmcat_input.vcf.gz
│ │ ├── pharmcat_input.vcf.gz.tbi
│ │ ├── pharmcat_with_ref.match.json
│ │ ├── pharmcat_with_ref.match_warnings.txt
│ │ ├── pharmcat_with_ref.missing_pgx_var.vcf
│ │ ├── pharmcat_with_ref.phenotype.json
│ │ ├── pharmcat_with_ref.preprocessed.vcf.bgz
│ │ └── pharmcat_with_ref.report.json
│ │
│ ├── prs/
│ │ └── multi_trait_prs_results.json
│ │
│ ├── rare_variants/
│ │ ├── full_vep_functional.tsv
│ │ ├── full_vep_functional.tsv_summary.html
│ │ ├── full_vep_functional.tsv_warnings.txt
│ │ ├── high_impact_rare.tsv
│ │ └── moderate_impact_rare.tsv
│ │
│ ├── sv_cnv/
│ │ └── exclude_contigs.tsv
│ │
│ ├── vep/
│ │ ├── clinvar_hits_vep_input.txt
│ │ ├── clinvar_hits_vep_output.tsv
│ │ ├── sweep_variants.vcf.gz
│ │ └── sweep_variants.vcf.gz.tbi
│ │
│ └── vep_local/
│ ├── chr_synonyms.txt
│ ├── functional_variants_canonical.tsv
│ ├── novel_variants_canonical.tsv
│ ├── rare_variants_canonical.tsv
│ ├── sweep_vep_gnomad.tsv
│ └── sweep_vep_gnomad.tsv_summary.html
│
├── docs/
│ └── gwas_enrichment_preregistration (1).md
│
├── notebooks/
│ (empty)
│
├── outputs for sharing/
│ └── ruben_bloom_gene_drug_analysis_incomplete.pdf
│
├── refs/
│ ├── 1kg/
│ │ (empty)
│ │
│ ├── config/
│ │ ├── output_contract.md
│ │ ├── project.yaml
│ │ └── reference_stack.md
│ │
│ ├── gene_panels/
│ │ ├── bipolar_model_pathways.tsv
│ │ ├── circadian_genes.txt
│ │ ├── expanded_genes.tsv
│ │ ├── genes_of_interest_from_sheet.tsv
│ │ ├── inflammatory_genes.txt
│ │ ├── neuropsychiatric_genes.txt
│ │ └── pgx_genes.txt
│ │
│ ├── pgs_scores/
│ │ ├── PGS000907_hmPOS_GRCh38.txt.gz
│ │ ├── PGS000908_hmPOS_GRCh38.txt.gz
│ │ ├── PGS001287_PsA_GRCh38.txt.gz
│ │ ├── PGS002318_hmPOS_GRCh38.txt.gz
│ │ ├── PGS002344_hmPOS_GRCh38.txt.gz
│ │ ├── PGS002746_hmPOS_GRCh38.txt.gz
│ │ ├── PGS002786_bipolar_GRCh38.txt.gz
│ │ ├── PGS002886_CRP_GRCh38.txt.gz
│ │ └── PGS005393_hmPOS_GRCh38.txt.gz
│ │
│ ├── reference_genome/
│ │ ├── chrM.fa.gz
│ │ ├── clinvar.vcf.gz
│ │ ├── clinvar.vcf.gz.tbi
│ │ ├── reference.dict
│ │ ├── reference.fa
│ │ ├── reference.fa.fai
│ │ └── reference.fa.gz
│ │
│ ├── score_lists/
│ │ └── pgs_catalog_score_targets.tsv
│ │
│ └── vep_cache/
│ └── homo_sapiens/
│ └── 115_GRCh38/
│ ├── 1/ ... 22/ (autosomes)
│ ├── GL*/, KI*/ (alt contigs)
│ └── (~13,926 cache files total)
│
├── reports/
│ ├── analysis_plan_phase7.md
│ ├── bipolar_disorder_causal_model_v031_final_report.md
│ ├── bipolar_model_vs_genome_deep_comparison.md
│ ├── bipolar_model_vs_ruben_genome_comparison.md
│ ├── clinvar_annotated_variants.tsv
│ ├── clinvar_annotation_summary.md
│ ├── clinvar_hits.tsv
│ ├── clinvar_interpretation.md
│ ├── data_qc.md
│ ├── drug_contingency_table.html
│ ├── drug_contingency_table.md
│ ├── drug_contingency_table.pdf
│ ├── drug_gene_crossref.md
│ ├── drug_history.md
│ ├── drug_sensitivity_candidates.tsv
│ ├── expanded_gene_summary.tsv
│ ├── expanded_gene_variants.tsv
│ ├── genes_of_interest_gene_summary.tsv
│ ├── genes_of_interest_gene_summary_phase1.tsv
│ ├── genes_of_interest_gene_sweep.md
│ ├── genes_of_interest_gene_variants.tsv
│ ├── genes_of_interest_gene_variants_phase1.tsv
│ ├── genes_of_interest_progress.md
│ ├── genes_of_interest_rsid_lookup.tsv
│ ├── genetic_correlations_psychiatric_immune.md
│ ├── genomic_findings_report.xlsx
│ ├── genomic_hypothesis_report.md
│ ├── hla_typing.md
│ ├── input_inventory.md
│ ├── markdown-pdf.css
│ ├── mechanism_memo.md
│ ├── mechanism_scores.json
│ ├── mr_inflammation_psychiatry_literature_review.md
│ ├── multi_cyp_joint_probability.md
│ ├── multi_trait_prs.md
│ ├── pgx_report.md
│ ├── pharmcat_prep.md
│ ├── prs_gwas_quality_evaluation.md
│ ├── prs_gwas_quality_summary_table.md
│ ├── rare_variant_analysis.md
│ ├── rare_variant_candidates.tsv
│ ├── research_narrative.md
│ ├── ruben_bloom_drug_contingency_analysis_incomplete.pdf
│ ├── session_handoff.md
│ ├── setup_checklist.md
│ ├── sv_cnv_analysis.md
│ ├── synthesis.md
│ ├── synthesis_attempt_01.md
│ ├── unified_findings.md
│ ├── vep_annotated_variants.tsv
│ ├── vep_annotation_summary.md
│ ├── vep_functional_variants.tsv
│ └── vep_local_annotation_summary.md
│
├── scripts/
│ ├── analyze_pathway_results.py
│ ├── bipolar_model_pathways.py
│ ├── extract_pathway_variants.py
│ ├── rare_variant_screen.py
│ │
│ ├── annotate/
│ │ ├── clinvar_annotate.py
│ │ └── vep_rest_annotate.py
│ │
│ ├── pgx/
│ │ └── normalize_for_pharmcat.sh
│ │
│ ├── preprocess/
│ │ ├── extract_genes_of_interest.py
│ │ ├── inspect_inputs.py
│ │ ├── query_genes_of_interest_regions.py
│ │ └── query_genes_of_interest_rsids.py
│ │
│ ├── prs/
│ │ ├── compute_bipolar_prs.py
│ │ ├── compute_multi_trait_prs.py
│ │ ├── compute_prs.py
│ │ └── compute_small_prs.py
│ │
│ ├── reports/
│ │ ├── build_genomic_report.py
│ │ └── write_setup_checklist.py
│ │
│ └── setup/
│ (empty)
│
└── tools/
└── pharmcat/
├── pharmcat-3.2.0-all.jar
├── pharmcat-preprocessor-3.2.0.tar.gz
├── pharmcat_positions.uniallelic.vcf.bgz
├── pharmcat_positions.uniallelic.vcf.bgz.csi
├── pharmcat_positions_3.2.0.vcf.bgz
├── pharmcat_positions_3.2.0.vcf.bgz.csi
├── pharmcat_regions.bed
└── preprocessor/
├── README.md
├── requirements.txt
├── pharmcat_pipeline
├── pharmcat_vcf_preprocessor
└── pcat/
├── __init__.py
├── common.py
├── exceptions.py
├── preprocess.py
├── utilities.py
└── chr_rename_map.tsv
```
All the various tools had to be made to work.
environment.yaml).tabixandbgzip(declared inenvironment.yaml).environment.yaml).environment.yaml).vep_rest_annotate.pyand also listed as a conda dependency.environment.yaml).snakemakeobject for inputs/outputs/params.environment.yaml.According to Claude, since 1966, there have been ~80,000 peer-reviewed papers, ~4,000 clinical trials registered since 2000, and perhaps $5 billion in funding for studying bipolar.
Depending on the threshold, people with a bipolar spectrum disorder are 0.5-4% of the population.
Interventions attempted or under consideration include:
I switch between using each of them.