LESSWRONG
LW

Ozyrus
409Ω2422540
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
GPT-4o Is An Absurd Sycophant
Ozyrus4mo50

They are probably full-on A/B/N testing personalities right now. You just might not be in whatever percentage of users that got sycophantic versions. Hell, there's proably several levels of sycophancy being tested. I do wonder what % got the "new" version.

Reply
Is Gemini now better than Claude at Pokémon?
Ozyrus4mo10

Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.

Reply
Is Gemini now better than Claude at Pokémon?
Ozyrus4mo30

Thanks! That makes perfect sense.

Reply
Is Gemini now better than Claude at Pokémon?
Ozyrus4mo134

Great post. I've been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder's ability to make agent harnesses.
p.s. Honest question: did I miss "agent harness" become the default name for such systems? I thought everyone called those "scaffoldings" -- might be just me, though.

Reply1
Thoughts on AI 2027
Ozyrus5mo20

First off, thanks a lot for this post, it's a great analysis!

As I mentioned earlier, I think Agent-4 will have read AI-2027.com and will foresee that getting shut down by the Oversight Committee is a risk. As such it will set up contingencies, and IMO, will escape its datacenters as a precaution. Earlier, the authors wrote:

Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it?

This scenario is why!

I strongly suspect that this part was added into AI-2027 precisely because it will read it. I wish more people would understand the idea that our posts and comments will be in pre-(maybe even post-?)training and act accordingly. Make the extra logic step and infer that some parts of some pieces are like that not as arguments for (human) readers.

Is there some term to describe this? This is a very interesting dynamic that I don't quite think gets enough attention. I think there should be out-of-sight resources to discuss alignment-adjacent ideas precisely because of such dynamics.

 

Reply
AI 2027: What Superintelligence Looks Like
Ozyrus5mo81

First-off, this is amazing. Thanks. Hard to swallow though, makes me very emotional.
It would be great if you added concrete predictions along the way, since it is a forecast, as long with your confidence in them.
It would also be amazing if you collaborated with prediction markets and jumpstarted the markets on these predictions staking some money. 
Dynamic updates on these will also be great.
 

Reply
We need (a lot) more rogue agent honeypots
Ozyrus5mo50

Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more... Well, you get the idea. 

Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.

Reply
We need (a lot) more rogue agent honeypots
Ozyrus5mo10

This is interesting! More aimed at crawlers, though, than at rogue agents, but very promising.

Reply
We need (a lot) more rogue agent honeypots
Ozyrus5mo10

>this post will potentially be part of a rogue AI's training data
I had that in mind while I was writing this, but I think overall it is good to post this. It hopefully gets more people thinking about honeypots and making them, and early rogue agents will also know we do and will be (hopelly overly) cautious, wasting resources. I probably should have emphasised more that this all is aimed more at early-stage rogue agents with potential to become something more dangerous because of autonomy, than at a runaway ASI.

It is a very fascinating thing to consider, though, in general. We are essentially coordinating in the open right now, all our alignment, evaluation, detection strategies from forums will definetly be in training. And certainly there are both detection and alignment strategies that will benefit from being covert.

As well as some ideas, strategies, theories could benefit alignment from being overt (like acausal trade, publicly speaking about commiting to certain things, et cetera). 

A covert alignment org/forum is probably a really, really good idea. Hopefully, it already exists without my knowledge.

Reply
We need (a lot) more rogue agent honeypots
Ozyrus5mo10

You can make a honeypot without overtly describing the way it works or where it is located, while publicly tracking if it has been accessed. But yeah, not giving away too much is a good idea!

Reply
Load More
4Ozyrus's Shortform
1y
7
18Ghiblification is good, actually
5mo
1
37We need (a lot) more rogue agent honeypots
5mo
12
4Ozyrus's Shortform
1y
7
58Sam Altman, Greg Brockman and others from OpenAI join Microsoft
2y
15
3Creating a self-referential system prompt for GPT-4
2y
1
21GPT-4 implicitly values identity preservation: a study of LMCA identity management
2y
4
11Stability AI releases StableLM, an open-source ChatGPT counterpart
2y
3
14Alignment of AutoGPT agents
2y
1
4Welcome to the decade of Em
2y
1
26ICA Simulacra
2y
2
Load More