amaury lorin

Wiki Contributions

Comments

Sounds like those people are victim of a salt-in-pasta-water fallacy.

It's also very old-fashioned. Can't say I've ever heard anyone below 60 say "pétard" unironically.

You might also assign different values to red-choosers and blue-choosers (one commenter I saw said they wouldn't want to live in a world populated only by people who picked red) but I'm going to ignore that complication for now.

Roko has also mentioned they think people choose blue for being bozos and I think it's fair to assume from their comments that they care less about bozos than smart people.

I'm very interested in seeing the calculations where you assign different utilities to people depending on their choice (and possibly, also depending on yours, like if you only value people who choose like you).

I mean, as an author you can hack through them like butter; it is highly unlikely that out of all the characters you can write, the only ones that are interesting will all generate interesting content iff (they predict) you'll give them value (and this prediction is accurate).

I strongly suspect the actual reason you'll spend half of your post's value on buying ads for Olivia (if in fact you do that, which is doubtful as well) is not that (begin proposition) she would only accept this trade if you did that because
- she can predict your actions (as in, you wrote her as being unable to act in another manner than being able to predict your actions)
- she predicts you'll do that (in exchange for providing you with a fun story)
(end proposition).
I suspect that your actual reason is more like staying true to your promise, making a point, having fun and other such things.

 

I can imagine acausally trading with humans gone beyond the cosmological horizon, because our shared heritage would make a lot of the critical flaws in the post go away.

This is mostly wishful thinking.
You're throwing away your advantages as an author to bargain with fictionally smart entities. You can totally void the deal with Olivia and she can do nothing about it because she's as dumb as you write her to be.
Likewise, the author writing about space warring aliens writing about giant cube-having humans could just consider the aliens that have space wars without consideration for humans at all; you haven't given enough detail for the aliens' modelization of the humans be precise enough that their behavior must depend on it.

Basically, you're creating characters that are to you as you are to a superintelligence, but dumb yourself down to their level for the fun of trading acausally. This is not acausal trading, because you are not actually on their level and their decisions do not in fact depend on reliably predicting you'll cooperate in the trade.
This is just fiction.

For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.

There is a critical step missing here, which is when the trade-bot makes a "choice" between maximising money or satisfying preferences.
At this point, I see two possibilities:

  • Modelling the trade-bot as an agent does not break down: the trade-bot has an objective which it tries to optimize, plausibly maximising money (since that is what it was trained for) and probably not satisfying human preferences (unless it had some reason to have that has an objective). 
    A comforting possibility is that it is corrigibly aligned, that it optimizes for a pointer to its best understanding of its developers. Do you think this is likely? If so, why?
  • An agentic description of the trade-bot is inadequate. The trade-bot is an adaptation-executer, it follows shards of value, or something. What kind of computation is it making that steers it towards satisfying human preferences?

So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"

This is a false dichotomy. Assuming that when the AI gains situational awareness, it will optimize for its developers' goals, alignment is already solved. Making the goals safe before situational awareness is not that hard: at that point, the AI is not capable enough for X-risk.
(A discussion of X-risk brought about by situationally unaware AIs could be interesting, such as a Christiano failure story, but Soares's model is not about it, since it assumes autonomous ASI.)

A new paper, built upon the compendium of problems with RLHF, tries to make an exhaustive list of all the issues identified so far: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

That sounds nice but is it true? Like, that's not an argument, and it's not obvious! I'm flabbergasted it received so many upvotes.
Can someone please explain?

Well, I wasn't interested because AIs were better than humans at go, I was interested because it was evidence of a trend of AIs being better at humans at some tasks, for its future implications on AI capabilities.
So from this perspective, I guess this article would be a reminder that adversarial training is an unsolved problem for safety, as Gwern said above. Still doesn't feel like all there is to it though.

Load More