benwr

If you have feedback for me, you can fill out the form at https://forms.gle/kVk74rqzfMh4Q2SM6 .

Or you can email me, at [the second letter of the alphabet]@[my username].net

Wikitag Contributions

Comments

Sorted by
benwr10

I think you may have missed, or at least not taken literally, at least one of these things in the post:

  1. The expansion of "superhuman strategic agent" is not "agent that's better than humans at strategic reasoning", it's "agent that is better than the best groups of humans at taking (situated) strategic action"
  2. Strategic action is explicitly context-dependent, e.g. an AI system that's inside a mathematically perfect simulated world that can have no effect on the rest of the physical world and vice versa, has zero strategic power in this sense. Also e.g. in the FAQ, "Capabilities and controls are relevant to existential risks from agentic AI insofar as they provide or limit situated strategic power." So, yes, an agent that lives on your laptop is only strategically superhuman if it has the resources to actually take strategic action rivaling the most strategically capable groups of humans.
  3. "increasingly accurately" is meant to point out that we don't need to understand or limit the capabilities of things that are obviously much strategically worse than us.
benwr104

I think it probably makes sense for ~everyone to have an explicit list of "things I'd like AI to do for me", especially around productivity and/or things that could help you with world-saving. If you have a list like this, and we happen to hit a relevant capability threshold before we lose, you should probably avoid wasting time on that thing as quickly as possible.

benwr20

Thanks everyone for thoughts so far! I do want to emphasize that we're actually highly interested in collecting even the most "obvious" evidence in favor of or against these ideas. In fact, in many ways we're more interested in the obvious evidence than in reframes or conceptual problems in the ideas here;  of course we want to be updating our beliefs, but we also want to get a better understanding of the existing state of concrete evidence on these questions. This is partly because we consider it part of our mission to expand the amount and quality of relevant evidence on these beliefs, and are trying to ensure that we're aware of existing work.

benwr50

Surprisingly to me, Claude 3.5 Sonnet is much more consistent in its answer! It is still not perfect, but it usually says the same thing (9/10 times it gave the same answer).

benwr167

From the "obvious-but-maybe-worth-mentioning" file:

ChatGPT (4 and 4o at least) cheats at 20 questions:

If you ask it "Let's play a game of 20 questions. You think of something, and I ask up to 20 questions to figure out what it is.", it will typically claim to "have something in mind", and then appear to play the game with you.

But it doesn't store hidden state between messages, so when it claims to "have something in mind", either that's false, or at least it has no way of following the rule that it's thinking of a consistent thing throughout the game. i.e. its only options are to cheat or refuse to play.

You can verify this by responding "Actually, I don't have time to play the whole game right now. Can you just tell me what it was you were thinking of?", and then "refreshing" its answer. When I did this 10 times, I got 9 different answers and only one repeat.

benwr30

Sometimes people use "modulo" to mean something like "depending on", e.g. "seems good, modulo the outcome of that experiment" [correct me ITT if you think they mean something else; I'm not 100% sure]. Does this make sense, assuming the term comes from modular arithmetic?

Like, in modular arithmetic you'd say "5 is 3, modulo 2". It's kind of like saying "5 is the same as 3, if you only consider their relationship to modulus 2". This seems pretty different to the usage I'm wondering about; almost its converse: to import the local English meaning of "modulo", you'd be saying "5 is the same as 3, as long as you've taken their relationship to the modulus 2 into account". This latter statement is false; 5 and 3 are super different even if you've taken this relationship into account.

But the sense of the original quote doesn't work with the mathematical meaning: "seems good, if you only consider the outcome of that experiment and nothing else".

Is there a math word that means the thing people want "modulo" to mean?

benwr10

Well, not that much, right? If you had an 11-word diceware passphrase to start, each word is about 7 characters on average, so you have maybe 90 places to insert a token - only 6.5 extra bits come from choosing a place to insert your character. And of course you get the same added entropy from inserting a random 3 base32 chars at a random location.

Happy to grant that a cracker assuming no unicode won't be able to crack your password, but if that's your goal then it might be a bad idea to post about your strategy on the public internet ;)

benwr10

maybe; probably the easiest way to do this is to choose a random 4-digit hexadecimal number, which gives you 16 bits when you enter it (e.g. via ctrl+u on linux). But personally I think I'd usually rather just enter those hex digits directly, for the same entropy minus a keystroke. Or, even better, maybe just type a random 3-character base32 string for one fewer bit.

benwr50

Some thoughts after doing this exercise:

I did the exercise because I couldn't sleep; I didn't keep careful count of the time, and I didn't do it all in one sitting. I'd guess I spent about an hour on it total, but I think there's a case to be made that this was cheating. However, "fresh eyes" is actually a really killer trick when doing this kind of exercise, in my experience, and it's usually available in practice. So I don't feel too bad about it.

I really really dislike the experience of saying things I think are totally stupid, and I currently don't buy that I should start trying to say stupider things. My favorite things in the above list came from refusing to just say another totally stupid thing. Nearly everything in my list is stupid in some way, but the things that are so stupid they don't even feel interesting basically make me feel sad. I trust my first-round aesthetic pruner to actually be helping to train my babbler in constructive directions.

The following don't really feel worth having said, to me:

  • Throw it really hard
  • Catapult
  • Kick it really hard
  • Wormhole
  • Nuclear explosion based craft


My favorites didn't come after spewing this stuff; instead they came when I refused to be okay with just saying more of that kind of junk:

  • Move the thing upward by one foot per day
  • Name the thing "420 69 Doge To The Moon" and hope Elon takes the bait
  • The various bogo-send options
  • Optical tweezers

The difference isn't really that these are less stupid; in fact they're kind of more stupid, practically speaking. But I actually viscerally like them, unlike the first group. Forcing myself to produce things I hate feels like a bad strategy on lots of levels.

benwr*30

A thing that was going through my head but I wasn't sure how to turn into a real idea (vulgar language from a movie):

Perhaps you would like me to stop the car and you two can fuck yourselves to Lutsk!

Load More