Tobias H

Wiki Contributions


How useful do people think it would be to have human-in-the-loop (HITL) AI systems? 

What's the likeliest failure mode of HITL AI Systems? And what kind of policies are mostly likely to help against those failure modes?

If we assume that HITL AI Systems are not going to be safer once we reach unaligned AGI. Could it maybe give use tools/information/time-steps to increase the chance that AGI will be aligned?

Interesting. Can you give us a sense of how much those asks (offer to pay for the extra labour) end up costing you?

Also: The last name is "Von Almen" not "Almen"

I always assumed that "Why don't we give Terence Tao a million dollars to work on AGI alignment?" was using Tao to refer to a class of people. Your comment implies that it would be especially valuable for Tao specifically to work on it. 

Why should we believe that Tao would be especially likely to be able to make progress on AGI alignment (e.g. compared to other recent fields medal winners like Peter Scholze)?

[I think this is more anthropomorphizing ramble than concise arguments. Feel free to ignore :) ]

I get the impression that in this example the AGI would not actually be satisficing. It is no longer maximizing a goal but still optimizing for this rule. 

For a satisficing AGI, I'd imagine something vague like "Get many paperclips" resulting in the AGI trying to get paperclips but at some point (an inflection point of diminishing marginal returns? some point where it becomes very uncertain about what the next action should be?) doing something else. 

Or for rules like "get 100 paperclips, not more" the AGI might only directionally or opportunistically adhere. Within the rule, this might look like "I wanted to get 100 paperclips, but 98 paperclips are still better than 90, let's move on" or "Oops, I accidentally got 101 paperclips. Too bad, let's move on".

In your example of the AGI taking lots of precautions, the satisficing AGI would not do this because it could be spending its time doing something else.

I suspect there are major flaws with it, but an intuition I have goes something like this:

  • Humans have in some sense similar decision-making capabilities to early AGI.
  • The world is incredibly complex and humans are nowhere near understanding and predicting most of it. Early AGI will likely have similar limitations.
  • Humans are mostly not optimizing their actions, mainly because of limited resources, multiple goals, and because of a ton of uncertainty about the future. 
  • So early AGI might also end up not-optimizing its actions most of the time.
  • Suppose we assume that the complexity of the world will continue to be sufficiently big such that the AGI will continue to fail to completely understand and predict the world. In that case, the advanced AGI will continue to not-optimize to some extent.
    • But it might look like near-complete optimization to us. 
  • Would an AGI that only tries to satisfice a solution/goal be safer?
  • Do we have reason to believe that we can/can't get an AGI to be a satisficer?

There has been quite a lot of discussion over on the EA Forum:

Avital Balwit linked to this lesswrong post in the comments of her own response to his longtermism critique (because Phil Torres is currently banned from the forum, afaik):

The whole thing was much more banal than what you're imagining. It was an interim-use building with mainly student residents. There was no coordination between residents that I knew of.

The garden wasn't trashed before the letter. It was just a table and a couple of chairs, that didn't fit the house rules. If the city had just said "please, take the table out of the garden", I'd have given a 70% chance of it working. If the city had not said a thing, there would not have been (a lot of) additional furniture in the garden.

By issuing the threat, the city introduced an incentive they didn't intend.
Some residents who picked up on the incentive destroyed the garden because they were overconfident in the authority following through with the threat – no matter what. 

Load More