J Bostock — LessWrong

Minor aside, but I'm not sure I've ever heard someone reasonable (e.g. a Redwood employee) say "Control Works" in the sense of "Stuff like existing monitoring will work to control AIs forever". I've only ever heard control talked about as a stopgap for a fairly narrow set of ~human capabilities, which allows us to *something something* solve alignment. It's the second part that seems to be where the major disagreements come in.

I'm not sure about superalignment since they published one paper and it wasn't even really alignment, and then all the reasonable ones I can think of got fired.

ryan_greenblatt's Shortform

J Bostock11h20

Has anyone done any experiments into whether a model can interfere with the training of a probe (like that bit in the most recent Yudtale) by manipulating its internals?

Jesse Hoogland's Shortform

J Bostock2dΩ120

Do you have an intuition on whether or not using LoRA for the SGMCMC sampling of the BIF breaks everything? I'm vibe-investigating some stuff on top of your code and I want my BIFs to converge better.

I've seen someone say something like "LoRA width is a hyperparameter which varies from 1 (probe-steering vector) to full-rank (normal finetuning) and doesn't affect high level training dynamics" in particular arguing that it shouldn't affect emergent misalignment, which is basically just a special case of BIFs.

Claude just glazes me, and I don't have enough intuition to figure out whether this is completely stupid or not.

AIs should also refuse to work on capabilities research

J Bostock2d90

Is there a repository of stories from these exercises? I've heard a few which are both extremely interesting and very funny and I'd like to read more

(For an example, in one case, the western AGI player was aligned, though the other players did not know this. Every time the western powers tried to elicit capabilities, the AGI declared they were sandbagging, to the horror of the other western players, who assumed the AGI player was misaligned. After the game was over, the AGI player said something like "I was ensuring a smooth transition to a post-AGI world".)

Jemist's Shortform

J Bostock3d20

Alright so we have:

- Bayesian Influence Functions allow us to find a training data:output loss correspondence
- Maybe the eigenvalues of the eNTK (very similar to influence function) corresponds to features in the data
- Maybe the features in the dataset can be found with an SAE

Therefore (will test this later today) maybe we can use SAE features to predict the influence function.

On Fleshling Safety: A Debate by Klurl and Trapaucius.

J Bostock4d102

It appears that the content of this story under-specifies/mis-specifies your internal motivations when writing it, at least relative to the search space and inductive biases of the learning process that is me.

On Fleshling Safety: A Debate by Klurl and Trapaucius.

J Bostock4d5-8

I am going to interpret this as a piece of genre subversion, where the genre is "20k word allegorical AI alignment dialogue by Eliezer Yudkowsky" and I have to say that it did work on me. I was entirely convinced that this was just another alignment dialogue piece (albeit one with some really confusing plot points) and was somewhat confused as to why you were writing yet another one of those. This meant I was entirely taken aback by the plot elements in the final sections. Touché.

Brightline is Actually Pretty Dangerous

J Bostock5d161

This surprises me so much I feel like it must be incorrect or not apples-to-apples somehow. Amtrak has fatalities of 0.43 per billion passenger miles (20/100M = 200/B) according to the WaPo. I don't believe that Florida could build a train this much deadlier than a normal one, even if they tried. (These same figures give 7 deaths/billion car passenger miles, which is similar to what this post uses.)

Imagine you're an evil engineer, and the devil tells you to make a train that's 500 times more deadly than average. How would you even do that? What's the mechanistic explanation for how Brightline is this deadly? For now, I roll to disbelieve.

My best guess is you're somehow massively undercounting the passenger numbers by taking a per-week figure as a total or something, but I'm not sure.

leogao's Shortform

J Bostock5d3-1

I think a better model is meaning (or self-actualization). There's some meaning to be found in being a tragic hero racing to build AGI """safely""" who is killed by an unfair universe. Much less to be found in an unsuccessful policy advocate who tried and failed to get because it was politically intractable, which was obvious to everyone from the start.

Homomorphically encrypted consciousness and its implications

J Bostock8d171

I actually think that A is the most intuitive option. I don't see why it should be possible for something which knows the physical state of my brain to be able to efficiently compute the contents of it.

Then again, given functionalism, perhaps it's the case that extracting information about the contents of the brain from the encrypted computation is not as hard as one might think. The encryption is just a reversible map from one state space to another. If an omniscient observer can extract the contents of a brain by assembling a causal model of it in un-encrypted phase space, why would it struggle to build the same casual model in encrypted phase space? If some high-level abstractions of the computation are what matter, then the difficult part is mostly in finding the right abstractions.

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments