Jonathan Claybrough

Software engineer, transitioning into AI safety research. Particularly interested in (the usual) psychology, game theory, system design, economics, transitioning to utopia.


Sorted by New

Wiki Contributions


Did you know about "by default, GPTs think in plain sight"?
It doesn't explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)

To not be misinterpreted, I didn't say I'm sure it's more the format than the content that's causing the upvotes (open question), nor that this post doesn't meet the absolute quality bar that normally warrants 100+ upvote (to each reader their opinion).

If you're open to object level discussing this, I can point on concrete disagreement with the content. Most importantly, this should not be seen as a paradigm shift, because it does not invalidate any of the previous threat models - it would only be so if it rendered impossible to do AGI any other way. I also don't think this should "change the alignment landscape" because it's just another part of it, one which was known and has been worked on for years (Anthropic and OpenAI have been "aligning" LLMs and I'd bet 10:1 anticipated these would be used to do agents like most people I know in alignment). 

To clarify, I do think it's really important and great people work on this, and that in order this will be the first x-risk stuff we see. But we could solve the GPT-agent problem and still die to unalignment AGI 3 months afterwards. The fact that the world trajectory we're on is throwing additional problems in the mix (keeping the world safe from short term misuse and unaligned GPT-agents) doesn't make the existing ones simpler. There still is pressure to built autonomous AGI, there might still be mesa optimizers, there might still be deception, etc. We need the manpower to work on all of these, and not "shift the alignment landscape" to just focus on the short term risks.

I'd recommend to not worry much about PR risk, just ask the direct question: Even if this post is only ever read by LW folk, does the "break all encryption" add to the conversation? Causing people to take time to debunk certain suggestions isn't productive even without the context of PR risk

Overall I'd like some feedback on my tone, if it's too direct/aggressive to you of it's fine. I can adapt. 

You can read "reward is not the optimization target" for why a GPT system probably won't be goal oriented to become the best at predicting tokens, and thus wouldn't do the things you suggested (capturing humans). The way we train AI matters for what their behaviours look like, and text transformers trained on prediction loss seem to behave more like Simulators. This doesn't make them not dangerous, as they could be prompted to simulate misaligned agents (by misuses or accident), or have inner misaligned mesa-optimisers

I've linked some good resources for directly answering your question, but otherwise to read more broadly on AI safety I can point you towards the AGI Safety Fundamentals course which you can read online, or join a reading group. Generally you can head over to AI Safety Support, check out their "lots of links" page and join the AI Alignment Slack, which has a channel for question too.

Finally, how does complexity emerge from simplicity? Hard to answer the details for AI, and you probably need to delve into those details to have real picture, but there's at least strong reason to think it's possible : we exist. Life originated from "simple" processes (at least in the sense of being mechanistic, non agentic), chemical reactions etc. It evolved to cells, multi cells, grew etc. Look into the history of life and evolution and you'll have one answer to how simplicity (optimize for reproductive fitness) led to self improvement and self awareness

Quick meta comment to express I'm uncertain that posting things in lists of 10 is a good direction. The advantages might be real, easy to post, quick feedback, easy interaction, etc.

But the main disadvantage is that this comparatively drowns out other better posts (with more thought and value in them). I'm unsure if the content of the post was also importantly missing from the conversation (to many readers) and that's why this got upvoted so fast or if it's a lot the format... Even if this post isn't bad (and I'd argue it is for the suggestions it promotes), this is early warning of a possible trend that people with less thought out takes quickly post highly accessible content, get comparatively more upvotes than should, and it's harder to find good content.

(Additional disclosure, some of my bad taste for this post come from the fact its call to break all encryption is being cited on Twitter as representative of the alignment community - I'd have liked to answer that obviously no, but it got many upvotes! This makes my meta point also seem to be motivated by PR/optics which is why it felt necessary to disclose but let's mostly focus on consequences inside the community)

First a quick response on your dead man switch proposal : I'd generally say I support something in that direction. You can find existing literature considering the subject and expanding in different directions in the "multi level boxing" paper by Alexey Turchin , I think you'll find it interesting considering your proposal and it might give a better idea of what the state of the art is on proposals (though we don't have any implementation afaik)

Back to "why are the predicted probabilities so extreme that for most objectives, the optimal resolution ends with humans dead or worse". I suggest considering a few simple objectives we could give ai (that it should maximise) and what happens, and over trials you see that it's pretty hard to specify anything which actually keeps humans alive in some good shape, and that even when we can sorta do that, it might not be robust or trainable. 
For example, what happens if you ask an ASI to maximize a company's profit ? To maximize human smiles? To maximize law enforcement ? Most of these things don't actually require humans, so to maximize, you should use the atoms human are made of in order to fulfill your maximization goal. 
What happens if you ask an ASI to maximize number of human lives ? (probably poor conditions). What happens if you ask it to maximize hedonistic pleasure ? (probably value lock in, plus a world which we don't actually endorse, and may contain astronomical suffering too, it's not like that was specified out was it?). 

So it seems maximising agents with simple utility functions (over few variables) mostly end up with dead humans or worse. So it seems approaches which ask for much less, eg. doing an agi that just tries to secure the world from existential risk (a pivotal act) and solve some basic problems (like dying) then gives us time for a long reflection to actually decide what future we want, and be corrigible so it lets us do that, seems safer and more approachable. 

I watched the video, and appreciate that he seems to know the literature quite well and has thought about this a fair bit - he did a really good introduction of some of the known problems. 
This particular video doesn't go into much detail on his proposal, and I'd have to read his papers to delve further - this seems worthwhile so I'll add some to my reading list. 

I can still point out the biggest ways in which I see him being overconfident : 

  • Only considering the multi-agent world. Though he's right that there already are and will be many many deployed AI systems, that doe not translate to there being many deployed state of the art systems. As long as training costs and inference costs continue increasing (as they have), then on the contrary fewer and fewer actors will be able to afford state of the art system training and deployment, leading to very few (or one) significantly powerful AGI (as compared to the others, for example GPT4 vs GPT2)
  • Not considering the impact that governance and policies could have on this. This isn't just a tech thing where tech people can do whatever they want forever, regulation is coming. If we think we have higher chances of survival in highly regulated worlds, then the ai safety community will do a bunch of work to ensure fast and effective regulation (to the extent possible). The genie is not out of the bag for powerful AGI, governments can control compute and regulate powerful AI as weapons, and setup international agreements to ensure this. 
  • The hope that game theory ensures that AI developed under his principles would be good for humans. There's a crucial gap between going from real world to math models. Game theory might predict good results under certain conditions and rules and assumptions, but many of these aren't true of the real world and simple game theory does not yield accurate world predictions (eg. make people play various social games and they won't act like how game theory says). Stated strongly, putting your hope on game theory is about as hard on putting your hope on alignment.  There's nothing magical about game theory which makes it work simpler than alignment, and it's been studied extensively by ai researchers (eg. why Eliezer calls himself a decision theorist and writes a lot about economics) with no clear "we've found a theory which empirically works robustly and in which we can put the fate of humanity in"

I work in AI strategy and governance, and feel we have better chances of survival in a world where powerful AI is limited to extremely few actors, with international supervision and cooperation for the guidance and use of these systems, making extreme efforts in engineering safety, in corrigibility, etc. I am not trustworthy of predictions on how complex systems turn out (which is the case of real multi agent problems) and don't think we can control these well in most relevant cases. 

Writing down predictions. The main caveat is that these predictions are predictions about how the author will resolve these questions, not my beliefs about how these techniques will work in the future. I am pretty confident at this stage that value editing can work very well in LLMs when we figure it out, but not so much that the first try will have panned out. 

  1. Algebraic value editing works (for at least one "X vector") in LMs: 90 %
  2. Algebraic value editing works better for larger models, all else equal 75 %
  3. If value edits work well, they are also composable 80 %
  4. If value edits work at all, they are hard to make without substantially degrading capabilities 25 %
  5. We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X = 
    1. "truth-telling" 10 %
    2. "love" 70 %
    3. "accepting death" 20%
    4. "speaking French" 80%

I don't think reasoning about others' beliefs and thoughts is helping you be correct about the world here. Can you instead try to engage with the arguments themselves and point out at what step you you don't see a concrete way for that to happen ? 
You don't show much sign of having read the article so I'll copy paste the part with explanations of how AIs start acting in the physical space.

In this scenario, the AIs face a challenge: if it becomes obvious to everyone that they are trying to defeat humanity, humans could attack or shut down a few concentrated areas where most of the servers are, and hence drastically reduce AIs' numbers. So the AIs need a way of getting one or more "AI headquarters": property they control where they can safely operate servers and factories, do research, make plans and construct robots/drones/other military equipment. 

Their goal is ultimately to have enough AIs, robots, etc. to be able to defeat the rest of humanity combined. This might mean constructing overwhelming amounts of military equipment, or thoroughly infiltrating computer systems worldwide to the point where they can disable or control most others' equipment, or researching and deploying extremely powerful weapons (e.g., bioweapons), or a combination.

Here are some ways they could get to that point:

  • They could recruit human allies through many different methods - manipulation, deception, blackmail and other threats, genuine promises along the lines of "We're probably going to end up in charge somehow, and we'll treat you better when we do."
    • Human allies could be given valuable intellectual property (developed by AIs), given instructions for making lots of money, and asked to rent their own servers and acquire their own property where an "AI headquarters" can be set up. Since the "AI headquarters" would officially be human property, it could be very hard for authorities to detect and respond to the danger.
    • Via threats, AIs might be able to get key humans to cooperate with them - such as political leaders, or the CEOs of companies running lots of AIs. This would open up further strategies.
  • As assumed above, particular companies are running huge numbers of AIs. The AIs being run by these companies might find security holes in the companies' servers (this isn't the topic of this piece, but my general impression is that security holes are widespread and that reasonably competent people can find many of them)15, and thereby might find opportunities to create durable "fakery" about what they're up to.
    • E.g., they might set things up so that as far as humans can tell, it looks like all of the AI systems are hard at work creating profit-making opportunities for the company, when in fact they're essentially using the server farm as their headquarters - and/or trying to establish a headquarters somewhere else (by recruiting human allies, sending money to outside bank accounts, using that money to acquire property and servers, etc.)
  • If AIs are in wide enough use, they might already be operating lots of drones and other military equipment, in which case it could be pretty straightforward to be able to defend some piece of territory - or to strike a deal with some government to enlist its help in doing so.
  • AIs could mix-and-match the above methods and others: for example, creating "fakery" long enough to recruit some key human allies, then attempting to threaten and control humans in key positions of power to the point where they control solid amounts of military resources, then using this to establish a "headquarters."

So is there anything here you don't think is possible ? 
Getting human allies ? Being in control of large sums of compute while staying undercover ? Doing science, and getting human contractors/allies to produce the results ? etc

I think this post would benefit from being more explicit on its target. This problem concerns AGI labs and their employees on one hand, and anyone trying to build a solution to Alignment/AI Safety on the other. 

By narrowing the scope to the labs, we can better evaluate the proposed solutions (for example  to improve decision making we'll need to influence decision makers therein), make them more focused (to the point of being lab specific, analyzing each's pressures), and think of new solutions (inoculating ourselves/other decision makers on AI about believing stuff that come from those labs by adding a strong dose of healthy skepticism). 

By narrowing the scope to people working on AI Safety who's status or monetary support relies on giving impressions of progress, we come up with different solutions (try to explicitly reward honesty, truthfulness, clarity over hype and story making). A general recommendation I'd have is to have some kind of reviews that check against "Wizard of Oz'ing" for flagging the behavior and suggesting corrections. Currently I'd say the diversity of LW and norms for truth seeking are doing quite well at this, so posting on here publicly is a great way to control this. It highlights the importance of this place and of upkeeping these norms.


Thanks for the reply ! 

The main reason I didn't understand (despite some things being listed) is I assumed none of that was happening at Lightcone (because I guessed you would filter out EAs with bad takes in favor of rationalists for example). The fact that some people in EA (a huge broad community) are probably wrong about some things didn't seem to be an argument that Lightcone Offices would be ineffective as (AFAIK) you could filter people at your discretion. 

More specifically, I had no idea "a huge component of the Lightcone Offices was causing people to work at those organizations". That's strikingly more of a debatable move but I'm curious why that happened in the first place ? In my field building in France we talk of x-risk and alignment and people don't want to accelerate the labs but do want to slow down or do alignment work. I feel a bit preachy here but internally it just feels like the obvious move is "stop doing the probably bad thing", but I do understand if you got in this situation unexpectedly that you'll have a better chance burning this place down and creating a fresh one with better norms. 

Overall I get a weird feeling of "the people doing bad stuff are being protected again, we should name more precisely who's doing the bad stuff and why we think it's bad" (because I feel aimed at by vague descriptions like field-building, even though I certainly don't feel like I contributed to any of the bad stuff being pointed at)

No, this does not characterize my opinion very well. I don't think "worrying about downside risk" is a good pointer to what I think will help, and I wouldn't characterize the problem that people have spent too little effort or too little time on worrying about downside risk. I think people do care about downside risk, I just also think there are consistent and predictable biases that cause people to be unable to stop, or be unable to properly notice certain types of downside risk, though that statement feels in my mind kind of vacuous and like it just doesn't capture the vast majority of the interesting detail of my model. 

So it's not a problem of not caring, but of not succeeding at the task. I assume the kind of errors you're pointing at are things which should happen less with more practiced rationalists ? I guess then we can either filter to only have people who are already pretty good rationalists, or train them (I don't know if there are good results on that side per CFAR). 

Load More