Wiki Contributions


Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn't a certainty.

They're a bit out of date but the original proposal is ELK Sub - Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages.

Haven't done any proper testing but it's high up on my agenda, would be interested in working out what tests would be best - will elaborate later.

For the ELK contest I was working on something similar, but where the problem is more acute (you're trying to force a MuZero-style reinforcement learner to compress its hidden state into words).

The methods I proposed to deal with this are:

  • Cut down the null space by training a language model to rewrite the written words in a different format which preserves the meaning as closely as possible, so that the CoT reasoner is forced to respond to similar sentences in similar ways.
  • Use multiple models which are trained separately but have to respond to each others' thoughts, making it more difficult to have a shared steganographic protocol.
  • Use something like dropout on the chains of thought, so that it can't hide information in the combination of multiple sentences.
  • Have some non-linguistic bits which we don't apply optimization to, but instead focus our intepretability efforts on. The hope is that this soaks up the highest-order bits that aren't easily expressible in language,so there's less need to hide them within words. Applying dropout to these is particularly necessary.

You can get copies of that on amazon

There's also LessWrong essay collections and things on there.

I'm really interested to see this progress, it would feel very healthy if we could have a solid integrated definition of optimizer to work with.

I'm not sure I understand why you don't agree with the 'small' criterion for the target set. It seems that you should be able to say something about the likelihood of the target in the absence of any agent (or if the agent takes a max-ent distribution over actions or something), and that's the relevant notion of smallness, which then becomes large in the presence of the agent. Or is it that you expect it to be difficult to properly specify what it means to have no agent or random decisions?

On the relationships between the three ways of defining acts - is there a trivial way of connecting (1) and (3) by saying that the action that the agent takes in (1) is just baked into the trajectory as some true fact about the trajectory that doesn't have consequences until the agent acts on it? Or instead of the action itself, we could 'bake in' the mapping from some information about the trajectory to the action. Either way we could see this as being determined initially or at the point of decision without a difference in the resulting trajectories.

'all dependent variables in the system of equations' - I think this should be 'independent'.

Are there any Twitter lists you'd recommend for high % of good AI (safety) content?


Eliezer has huge respect in the community; he has strong, well thought-out opinions (often negative) on a lot of the safety research being done (with exceptions, Chris Olah mentioned a few times); but he's not able to work full time on research directly (or so I understand, could be way off).

Perhaps he should institute some kind of prize for work done, trying to give extra prestige and funding to work going in his preferred direction? Does this exist in some form without my noticing? Is there a reason it'd be bad? Time/energy usage for Eliezer combined with difficulty of delegation?

Looking at:

Setup Property | AWS G4ad.16xlarge | Claimed Eff0 setup

n GPU | 4 | 4

n CPU | 64 | 96

memory/GPU | 8GB | 20GB

So not sure you could do a perfect replication but you should be able to do a similar run to their 100K steps runs in less than a day I think.

I would also potentially be interested in collaboration - there are some things I'm not keen to help with and especially to publish on but I think we could probably work something out - I'll send you a DM tomorrow.

Load More