Huh, those brain stimulation methods might actually be practical to use now, thanks for mentioning them!
Regarding skepticism of survey-data: If you're imagining it's only an end-of-the-retreat survey which asks "did you experience the jhana?", then yeah, I'll be skeptical too. But my understanding is that everyone has several meetings w/ instructors where a not-true-jhana/social-lie wouldn't hold up against scrutiny.
I can ask during my online retreat w/ them in a couple months.
Suppose we were able to gather large amounts of brain-scans, lets say w/ millions of wearable helmets w/ video and audio as well,[1] then what could we do with that? I'm assuming a similar pre-training stage where models are used to predict next brain-states (possibly also video and audio), and then can be finetuned or prompted for specific purposes.
Jhana is a non-addicting high pleasure state. If we can scan people entering this state, we might drastically reduce the time it takes to learn to enter this state. I predict (maybe just hope) that being able to easily enter this state will lead to an increase in psychological well-being and a reduction in power-seeking behavior.
Jhana is great but is only temporary. It is a great foundation to then investigate the causes of suffering and remove it.
I currently believe the cause of suffering is an unnecessary mental motion which should show up in high fidelity brain scans. Again, eliminating suffering would go a long way to make decision-makers more selfless (they can still make poor decisions, just likely not out of selfish intent, unless it's habitual maybe? idk)
Additionally, this tech would also enable seeing the brain state of enlightened vs not, which might look like Steve Byrne's model here. This might not be a desirable state to be in, and understanding the structure of it could give insight.
If we can reliably detect lying, this could be used for better coordination between labs & governments, as well as providing trust in employees and government staff (that they won't e.g. steal algorithmic/state secrets).
It could also be used to maintain power in authoritarian regimes (e.g. repeatedly test for faithfulness of those in power under you), and I agree w/ this comment that it could destroy relationships that would've otherwise been fine.
Additionally this is useful for...
Look! PTSD is just your brain repeating this same pattern in response to this trigger. CBT-related, negative thought spirals could be detected and brought up immediately.
Beyond helping w/ existing brands of therapy, I expect larger amounts of therapeutic progress could be made in short amounts of time, which would also be useful for coordination between labs/governments.
You could also produce extreme levels of stimulation of images/audio that most saliently capture attention and lead to continued engagement.
I'm unsure if there exist mind-state independent jail-breaks (e.g. a set of pixels that hacks a person's brains); however, I could see different mind states (e.g. just waking up, sleep/food/water deprivation) that, when combined w/ images/sounds could lead to crazy behavior.
If you could simulate someone's brain state given different inputs, you could try to figure out the different arguments (or the surrounding setting or their current mood) that would be required to get them to do what you want. Using this for manipulating others is clear, but also could be used for better coordinating.
A more ethical way to use a simulation for pursuasion is to not decieve/leave-out-important truths. You could even run many simulations to figure out what those important factors that they find important.
However, simulating human brains for optimizing is also an ethical concern in general due to simulated suffering, especially if you do figure out what specific mental motion suffering is.
Part of Steven Byrne's agenda. This should be pretty easy to reverse engineer given this tech (epistemic note: I know nothing in this field) since you could find what the initial reward components are and how specific image/audio/thoughts led to our values.
With the good of learning human values comes the bad of figuring out how humans are so sample efficient!
Simulating a full human brain is already an upload; however, I don't expect there to be optimized hardware ready for this. Additionally, running a human brain would be more complicated than just running normal NNs but w/ algorithmic insights from our brains. So this is still possible, but would have a large alignment tax.
[I believe I first heard this argument from Steve Byrnes]
Beyond trying to understand how the brain works (which is covered already), likely studying diseases and analyzing the brain's reaction to drugs.
...
These are clearly dual-use; however, I do believe some of these are potentially pivotal acts (e.g. uploads, solving value alignment, sufficient coordination involving human actors), but would definitely require a lot of trust in the pursuers of this technology (e.g. existing power-seekers in power, potentially unaligned TAI, etc).
Hopefully things like jhana (if the benefits pan out) don't require a high fidelity helmet. The current SOTA is (AFAIK) 50% of people learning how to get into jhana given 1 week of a silent retreat, which is quite good for pedagogy alone! But maybe not sufficient to get e.g. all relevant decision-makers on board. Additionally, I think this percentage is only for getting into it once, as opposed to being able to enter that state at well.
Why would many people give up their privacy in this way? Well, when the technology actually becomes feasable, I predict most people will be out of jobs, so this would equivalent to a new job.
it seems unlikely to me that so many talented people went astray
Well, maybe we did go astray, but it's not for any reasons mentioned in this paper!
SAEs were trained on random weights since Anthropic's first SAE paper in 2023:
To assess the effect of dataset correlations on the interpretability of feature activations, we run dictionary learning on a version of our one-layer model with random weights. 28 The resulting features are here, and contain many single-token features (such as "span", "file", ".", and "nature") and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code). However, we are unable to construct interpretations for the non-single-token features that make much sense and invite the reader to examine feature visualizations from the model with randomized weights to confirm this for themselves. We conclude that the learning process for the model creates a richer structure in its activations than the distribution of tokens in the dataset alone.
In my first SAE feature post, I show a clearly positional feature:
which is not a feature you'll find in a SAE trained on a randomly intitialized transformer.
The reason the auto-interp metric is similar is likely due to the fact that SAEs on random weights still have single-token features (ie activate on one token). Single-token features are the easiest feature to auto-interp since the hypothesis is "activates on this token" which is easy to predict for an LLM.
When you look at their appendix at their sampled features for the random features, all three are single token features.
However, I do want to clarify that their paper is still novel (they did random weights and controls over many layers in Pythia 410M) and did many other experiments in their paper: it's a valid contribution to the field, imo.
Also to clarify that SAEs aren't perfect, and there's a recent paper on it (which I don't think captures all the problems), and I'm really glad Apollo's diversified away from SAE's by pursuing their weight-based interp approach (which I think is currently underrated karma-wise by 3x).
I didn't either, but on reflection it is!
I did change the post based off your comment, so thanks!
I think the fuller context,
Anthropic has put WAY more effort into safety, way way more effort into making sure there are really high standards for safety and that there isn't going to be danger what these AIs are doing
implies it's just the amount of effort is larger than other companies (which I agree with), and not the Youtuber believing they've solved alignment or are doing enough, see:
but he's also a realist and is like "AI is going to really potentially fuck up our world"
and
But he's very realistic. There is a lot of bad shit that is going to happen with AI. I'm not denying that at all.
So I'm not confident that it's "giving people a false impression of how good we are doing on actually making things safe." in this case.
I do know DougDoug has recommended Anthropic's Alignment Faking paper to another youtuber, which is more of a "stating a problem" paper than saying they've solved it.
Did you listen to the song first?
Thinking through it more, Sox2-17 (they changed 17 amino acids from Sox2 gene) was your linked paper's result, and Retro's was a modified version of factors Sox AND KLF. Would be cool if these two results are complementary.
You're right! Thanks
For Mice, up to 77%
Sox2-17 enhanced episomal OKS MEF reprogramming by a striking 150 times, giving rise to high-quality miPSCs that could generate all-iPSC mice with up to 77% efficiency
For human cells, up to 9% (if I'm understanding this part correctly).
SOX2-17 gave rise to 56 times more TRA1-60+ colonies compared with WT-SOX2: 8.9% versus 0.16% overall reprogramming efficiency.
So seems like you can do wildly different depending on the setting (mice, humans, bovine, etc), and I don't know what the Retro folks were doing, but does make their result less impressive.
You're actually right that this is due to meditation for me. AFAIK, it's not a synesthesia-esque though (ie I'm not causing there to be two qualia now), more like the distinction between mental-qualia and bodily-qualia doesn't seem meaningful upon inspection.
So I believe it's a semantic issue, and I really mean "confusion is qualia you can notice and act on" (though I agree I'm using "bodily" in non-standard ways and should stop when communicating w/ non-meditators).
How do you work w/ sleep consolidation?
Sleep consolidation/ "sleeping on it" is when you struggle w/ [learning a piano piece], sleep on it, and then you're suddenly much better at it the next day!
This has happened to me for piano, dance, math concepts, video games, & rock climbing, but it varies in effectiveness. Why? Is it:
My current guess is a mix of all four. But I'm unsure if you [practice piano] in the morning, you'll need to remind yourself about the experience before you fall asleep, also implying that you can only really consolidate 1 thing a day.
The decision relevance for me is I tend to procrastinate heavy maths paper, but it might be more efficient to spend 20 min for 3 days actually trying to understand it, sleep on it, then spend 1 hour in 1 day. This would be neat because it's easier for me to convince myself to really struggle w/ a hard topic for [20] minutes, knowing I'm getting an equivalent [40] minutes if I sleep on it.