Filip Sondej

Wiki Contributions


  1. I agree that scaffolding can take us a long way towards AGI, but I'd be very surprised if GPT4 as core model was enough.

  2. Yup, that wasn't a critique, I just wanted to note something. By "seed of deception" I mean that the model may learn to use this ambiguity more and more, if that's useful for passing some evals, while helping it do some computation unwanted by humans.

  3. I see, so maybe in ways which are weird to humans to think about.

we make the very strong assumption throughout that S-LLMs are a plausible and likely path to AGI

It sounds unlikely and unnecessarily strong to say that we can reach AGI by scaffolding alone (if that's what you mean). But I think it's pretty likely that AGI will involve some amount of scaffolding, and that it will boost its capabilities significantly.

there is a preexisting discrepancy between how humans would interpret phrases and how the base model will interpret them

To the extent that it's true, I expect that it may also make deception easier to arise. This discrepancy may serve as a seed of deception.

Systems engaging in self modification will make the interpretation of their natural language data more challenging.

Why? Sure, they will get more complex, but are there any other reasons?

Also, I like the richness of your references in this post :)

I edited the post to make it clearer that Bob throws out the wheel because he didn't notice in time that Alice threw.

Yup, side payments are a deviation, that's why I have this disclaimer in game definition (I edited the post now to emphasize it more):

there also may be some additional actions available, but they are not obvious

Re separating speed of information and negotiations: I think here they are already pretty separate. The first example with 3 protocol rules doesn't allow negotiations and only tackles the information speed problem. The second example with additional fourth rule enables negotiations. Maybe you could also have a system tackling only negotiations and not the information speed problem, but I'm not sure now how would it look like, or if it would be much simpler.

Another problem (closely tied to negotiations) I wanted to tackle is something like "speed of deliberation" where agents make some bad commitments because they didn't have enough time to consider their consequences, and later realize they want to revoke/negotiate.

Oh, so the option to choose all of those disease weights is there, it's just a lot of effort for the parents? That's good to know.

Yeah, ideally it shouldn't need to be done by each parents separately, but rather there should be existing analyses ready. And even if those orgs don't provide a satisfactory analyses themselves, they could be done independently. F.e. collaborating on that with Happier Lives Institute could work well, as they have some similar expertise.

each disease is weighted according to its impact on disability-adjusted lifespan

It's a pity they don't use some more accurate well-being metrics like f.e. WELLBY (although I think WELLBY isn't ideal either).

How much control do the parents have on what metric will be used to rank the embryos?

Oh yeah, I meant the final locked-in commitment, not initial tentative one. And my point is that when committing outside is sufficiently more costly, then it's not worth doing it, even if that would let you commit faster.

Yup, you're totally right, it may be too easy to commit in other ways, outside this protocol. But I still think it may be possible to create such a "main mechanism" for making commitments where it's just very easy/cheap/credible to commit, compared to other mechanisms. But that would require a crazy amount of cooperation.

The vast majority that I know of use ad-hoc and agent-specific commitment mechanisms

If you have some particular mechanisms in mind could you list some? I'd like to compile a list of the most relevant commitment mechanisms to try to analyze them.

Love that post!

Can we train ML systems that clearly manifest a collective identity?

I feel like in multi-agent reinforcement learning that's already the case.

Re training setting for creating shared identity. What about a setting where a human and LLM take turns generating text, like in the current chat setting, but first they receive some task, f.e. "write a good strategy for this startup" and the context for this task. At the end they output the final answer and there is some reward model which rates the performance of the cyborg (human+LLM) as a whole.

In practice, having real humans in this training loop may be too costly, so we may want to replace them most of the time with an imitation of a human.

(Also a minor point to keep in mind: having emergent collective action doesn't mean that the agents have a model of the collective self. F.e. colony of ant behaves as one, but I doubt ants have any model of the colony, rather just executing their ant procedures. Although with powerful AIs, I expect those collective self models to arise. I just mean that maybe we should be careful in transferring insight from ant colonies, swarms, hives etc., to settings with more cognitively capable agents?)

Oh yeah, definitely. I think such a system shouldn't try to enforce one "truth" - which content is objectively good or bad.

I'd much rather see people forming groups, each with its own moderation rules. And let people be a part of multiple groups. There's a lot of methods that could be tried out, f.e. some groups could use algorithms like EigenTrust, to decide how much to trust users.

But before we can get to that, I see a more prohibitive problem - that it will be hard to get enough people to get that system off the ground.

Cool post! I think the minimum viable "guardian" implementation, would be to

  • embed each post/video/tweet into some high-dimensional space
  • find out which regions of that space are nasty (we can do this collectively - f.e. my clickbait is probably clickbaity for you too)
  • filter out those regions

I tried to do something along these lines for youtube:

I couldn't find a good way to embed videos using ML, so I just scraped which videos recommend each other, and made a graph from that (which kinda is an embedding). Then I let users narrow down on some particular region of that graph. So you can not only avoid some nasty regions, but you can also decide what you want to watch right now, instead of the algorithm deciding for you. So this gives the user more autonomy.

The accuracy isn't yet too satisfying. I think the biggest problem with systems like these is the network effect - you could get much better results with some collaborative filtering.

Load More