All of plex's Comments + Replies

Your probabilities are not independent, your estimates mostly flow from a world model which seem to me to be flatly and clearly wrong.

The plainest examples seem to be assigning

We invent a way for AGIs to learn faster than humans40%
AGI inference costs drop below $25/hr (per human equivalent)16%

despite current models learning vastly faster than humans (training time of LLMs is not a human lifetime, and covers vastly more data) and the current nearing AGI and inference being dramatically cheaper and plummeting with algorithmic improvements. There is a general... (read more)

Nice! Glad to see more funding options entering the space, and excited to see the S-process rolled out to more grantmakers.

Added you to the map of AI existential safety:

One thing which might be nice as part of improving grantee experience would be being able to submit applications as a google doc (with a template which gives the sections you list) rather than just by form. This increases re-usability and decreases stress, as it's easy to make updates later on so it's less of a worry that you ended up missing something crucial. Might be more hassle than it'... (read more)

Yeah, I do think there are a bunch of benefits to doing things in Google Docs, though it is often quite useful to have more structured data on the evaluation side.

This increases re-usability and decreases stress, as it's easy to make updates later on so it's less of a worry that you ended up missing something crucial.

You can actually update your application any time. When you submit the application you get a link that allows you to edit your submission as well as see its current status in the evaluation process. Seems like we should sign-post this better.

Cool! Feel free to add it with the form

that was me for context:

core claim seems reasonable and worth testing, though I'm not very hopeful that it will reliably scale through the sharp left turn

my guesses the intuitions don't hold in the new domain, and radical superintelligence requires intuitions that you can't develop on relatively weak systems, but it's a source of data for our intuition models which might help with other stuff so seems reasonable to attempt.

Meta's previous LLM, OPT-175B, seemed good by benchmarks but was widely agreed to be much, much worse than GPT-3 (not even necessarily better than GPT-Neo-20b). It's an informed guess, not a random dunk, and does leave open the possibility that they're turned it around and have a great model this time rather than something which goodharts the benchmarks.

Just a note, I googled a bit and couldn't find anything regarding poor OPT-175B performance.

This is a Heuristic That Almost Always Works, and it's the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.

Security mindset means look for flaws, not assume all plans are so doomed you don't need to look.

If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a n... (read more)

I don't think security mindset means "look for flaws." That's ordinary paranoia. Security mindset is something closer to "you better have a really good reason to believe that there aren't any flaws whatsoever." My model is something like "A hard part of developing an alignment plan is figuring out how to ensure there aren't any flaws, and coming up with flawed clever schemes isn't very useful for that. Once we know how to make robust systems, it'll be more clear to us whether we should go for melting GPUs or simulating researchers or whatnot."

That said, I ... (read more)

Not inspired by them, no. Those did not have, as far as I'm aware, a clear outlet for use of the outputs. We have a whole platform we've been building towards for three years (starting on the FAQ long before those contests), and the ability to point large numbers of people at that platform once it has great content thanks to Rob Miles.

As I said over on your Discord, this feels like it has a shard of hope, and the kind of thing that could plausibly work if we could hand AIs utility functions.

I'd be interested to see the explicit breakdown of the true names you need for this proposal.

Agreed, incentives probably block this from being picked up by megacorps. I had thought to try and get Musk's twitter to adopt it at one point when he was talking about bots a lot, it would be very effective, but doesn't allow rent extraction in the same way the solution he settled on (paid twitter blue).

Websites which have the slack to allow users to improve their experience even if it costs engagement might be better adopters, LessWrong has shown they will do this with e.g. batching karma daily by default to avoid dopamine addiction.

Would it be possible to make an opt-in EigenKarma layer on top of twitter (but independent from it)? I can imagine parsing say 100k most popular twitter accounts plus all of personal tweets and likes of people who opted in to the EigenKarma layer, and then building a customised twitter feed for them
Paid Twitter blue seems to be quite competitive. The algorithm could just weigh paid Twitter blue users more highly than users that aren't Twitter blue. As far as Megacorps go, Youtube likely wouldn't want this for its video ranking, but it might want it for the comment sections of videos. If the EigenKarma from the owner of a Youtube channel would set the ranking of comments within Youtube that would increase the comment quality by a lot. 

Hypothesis #2: These bits of history are wrong for reasons you can check with simpler learned structures.

Maybe these historical patterns are easier to disprove with simple exclusions, like "these things were in different places"?

That is true, but it is also true for the Nash thing, I would say.

And if you use common but obviously wrong science or maths, it is less likely to.

Yeah, my guess is if you use really niche and plausible-sounding historical examples it is much more likely to hallucinate.

Hm... I am unsure what "really niche and plausible-sounding historical examples " would be...  Example 1: Example 2: Note that this is a question about a real historic incident and ChatGPT gives a wrong answer, and you can find the answer on Wikipedia []: "The participation of East Germany was cancelled just hours before the invasion. The decision for the non-participation of the East German National People's Army in the invasion was made on short notice by Brezhnev at the request of high-ranking Czechoslovak opponents of Dubček who feared much larger Czechoslovak resistance if German troops were present, due to previous experience with the German occupation.".
And if you use common but obviously wrong science or maths, it is less likely to.

Maybe the RLHF agent selected for expects the person giving feedback to correct it for the history example, but not know the latter example is false. If you asked a large sample of humans, more would be able to confidently say the first example is false than the latter one.

Thanks, that sounds plausible. But how should I imagine that process, given that I could have made up arbitrary pseudo-historic cases like the one with the Vikings, and not one of them appeared in the training data?

Yeah, that makes a lot of sense and fits my experience of what works.

Hell yeah!

This matches my internal experience that caused me to bring a ton of resources into existence in the alignment ecosystem (with various collaborators):

  • - Man, there really should be a single point of access that lets people self-onboard into the effort. (Helped massively by Rob Miles's volunteer community, soon to launch a paid distillation fellowship)
  • - Maybe we should have a unified place with all the training programs and conferences so people can find what to apply to? (AI Safety Support had a great database that
... (read more)

I've kept updating in the direction of do a bunch of little things that don't seem blocked/tangled on anything even if they seem trivial in the grand scheme of things. In the process of doing those you will free up memory and learn a bunch about the nature of the bigger things that are blocked while simultaneously revving your own success spiral and action-bias.

This is some great advice. Especially 1 and 2 seem foundational for anyone trying to reliably shift the needle by a notable amount in the right direction.

My favorite frame is based on In The Future Everyone Will Be Famous To Fifteen People. If we as a civilization pass this test, we who lived at the turn of history will be outnumbered trillions to one, and the future historians will construct a pretty good model of how everyone contributed. We'll get to read about it, if we decide that that's part of our personal utopia.

I'd like to be able to look back from eternity and see that I shifted things a little in the right direction. That perspective helps defuse some of the local status-seeking drives, I think.

I can verify that the owner of the blaked[1] account is someone I have known for a significant amount of time, that he is a person with a serious, long-standing concern with AI safety (and all other details verifiable by me fit), and that based on the surrounding context I strongly expect him to have presented the story as he experienced it.

This isn't a troll.

  1. ^

    (also I get to claim memetic credit for coining the term "blaked" for being affected by this class of AI persuasion)

And for encouraging me to post it to LW in the first place! I certainly didn't expect it to blow up.

Agree with Jim, and suggest starting with some Rob Miles videos. The Computerphile ones, and those on his main channel, are a good intro.

2Program Den5mo
I'm familiar with AGI, and the concepts herein (why the OP likes the proposed definition of CT better than PONR), it was just a curious post, what with having "decisions in the past cannot be changed" and "does X concept exist" and all. I think maybe we shouldn't muddy the waters more than we already have with "AI" (like AGI is probably a better term for what was meant here— or was it?  Are we talking about losing millions of call center jobs to "AI" (not AGI) and how that will impact [] the economy/whatnot?  I'm not sure if that's transformatively up there with the agricultural and industrial revolutions (as automation seems industrial-ish?).  But I digress.), by saying "maybe crunch time isn't a thing?  Or it's relative?". I mean, yeah, time is relative, and doesn't "actually" exist, but if indeed we live in causal universe [] (up for debate []) then indeed, "crunch time" exists, even if by nature it's fuzzy— as lots of things contribute to making Stuff Happen. (The butterfly effect, chaos theory, game theory &c.) “The avalanche has already started. It is too late for the pebbles to vote." - Ambassador Kosh

Nice! Would you be up for putting this in the Google Drive folder too, with a question-shaped title?

3Jakub Kraus5mo
Done, the current title is "What are some exercises and projects I can try?"

Yes, this is a robustly good intervention on the critical path. Have had it on the Alignment Ecosystem Development ideas list for most of a year now.

Some approaches to solving alignment go through teaching ML systems about alignment and getting research assistance from them. Training ML systems needs data, but we might not have enough alignment research to sufficiently fine tune our models, and we might miss out on many concepts which have not been written up. Furthermore, training on the final outputs (AF posts, papers, etc) might be less good at capturin

... (read more)

I'm working on it. and /map are still a WIP, but very useful already.


  1. Have been continually adding to this, up to 49 online communities.
  2. Added a student groups section, with 12 entries and some extra fields (website, contact, calendar, mailing list), based on the AGISF list (in talks with the maintainer to set up painless syncing between the two systems).
  3. Made a form where you can easily add entries.

Still getting consistent traffic, happy to see it getting used :)

fwiw that strange philosophical bullet fits remarkably well with a set of thoughts I had while reading Anthropic Bias about 'amount of existence' being the fundamental currency of reality (a bunch of the anthropic paradoxes felt like they were showing that if you traded sufficiently large amounts of "patterns like me exist more" then you could get counterintuitive results like bending the probabilities of the world around you without any causal pathway), and infraBayes requiring it actually updated me a little towards infraBayes being on the right track.

My... (read more)

I classified the first as Outer misalignment, and the second as Deceptive outer misalignment, before reading on.

I agree with 

Another use of the terms “outer” and “inner” is to describe the situation in which an “outer” optimizer like gradient descent is used to find a learned model that is itself performing optimization (the “inner” optimizer). This usage seems fine to me.

being the worthwhile use of the term inner alignment as opposed to the ones you argue against, and could imagine that the term is being blurred and used in less helpful ways by many ... (read more)

For developing my hail mary alignment approach, the dream would be to be able to load enough of the context of the idea into a LLM that it could babble suggestions (since the whole doc won't fit in the context window, maybe randomizing which parts beyond the intro are included for diversity?), then have it self-critique those suggestions automatically in different threads in bulk and surface the most promising implementations of the idea to me for review. In the perfect case I'd be able to converse with the model about the ideas and have that be not totally useless, and pump good chains of thought back into the fine-tuning set.

Answer by plexJan 04, 202360

From the outside, it looks like Tesla and SpaceX are doing an unusually good (though likely not perfect) job of resisting mazedom and staying in touch with physical reality. Things like having a rule where any employee can talk to Musk directly, and blocking that is a fireable offence,  promoting self-management rather than layers of middle management, over-the-top work ethic as a norm to keep out people who don't actually want to build things, and checking for people who have solved hard technical problems in the hiring process.

I'd be interested to hear from any employees of those organizations how it is on the ground.

There's something extra which is closely connected with this: The time and motivation of central nodes is extremely valuable. Burning the resources of people who are in positions of power and aligned with your values is much more costly than burning other people's resources[1], and saving or giving them extra resources is proportionately more valuable.

People who have not been a central node generally underestimate how many good things those people would be able to unblock or achieve with a relatively small amount of their time/energy/social favors that sim... (read more)

Very excited by this agenda, was discussing my hope that someone finetunes LLMs on the alignment archive soon today!

Also, from MIT CSAIL and Meta: Gradient Descent: The Ultimate Optimizer

Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model parameters by manually deriving expressions for "hypergradients" ahead of time.
We show how to automatically compute hypergradients with a simple and elegant modification to backpropagation. This allows us to easily apply the method to other optimizers and

... (read more)
1Vael Gates5mo
Thanks! (credit also to Collin :))

Yup, another instance of this is the longtermist census, that likely has the most entries but is not public. Then there's AI Safety Watch, the EA Hub (with the right filters), the mailing list of people who went through AGISF, I'm sure SERI MATS has one, other mailing lists like AISS's opportunities one, other training programs, student groups, people in various entries on

Yeah, there's some organizing to do. Maybe the EA forum's proposed new profile features will end up being the killer app?

I used Metaphor to find many entries on the upcoming map of AI safety.

Value alignment here means being focused on improving humanity's long term future by reducing existential risk, not other specific cultural markers (identifying as EA or rationalist, for example, is not necessary). Having people working towards same goal seems vital for organizational cohesion, and I think alignment orgs would rightly not hire people who are not focused on trying to solve alignment. Upskilling people who are happy to do capabilities jobs without pushing hard internally for capabilities orgs to be more safety focused seems net negative.

These proposals all seem robustly good. What would you think of adding an annual AI existential safety conference?

3Ryan Kidd5mo
Currently, it seems that the AI Safety Unconference [] and maybe the Stanford Existential Risks Conference [] are filling this niche. I think neither of these conferences is optimizing entirely for the AI safety researcher experience; the former might aim to appeal to NeurIPS attendees outside the field of AI safety, and the latter includes speakers on a variety of x-risks. I think a dedicated AI safety conference/unconference detached from existing ML conferences might be valuable. I also think boosting the presence of AI safety at NeurIPS or other ML conferences might be good, but I haven't thought deeply about this.
4Shoshannah Tekofsky5mo
Oh my, this looks really great. I suspect between this and the other list of AIS researchers, we're all just taking different cracks at generating a central registry of AIS folk so we can coordinate at all different levels on knowing what people are doing and knowing who to contact for which kind of connection. However, maintaining such an overarching registry is probably a full time job for someone with high organizational and documentation skills.

Thanks for reporting, should be fixed :)

This seems like it could be useful!

As you've identified, databases of this kind often run into the problem of becoming outdated. For a similar project, EA houses, I wrote a script which automatically sorts the most recently edited entries to the top. If you allow people to edit the sheet, this allows people to easily bump their entry, ensuring that at least the most recent entries are still active. I'd be happy to implement this script on your sheet if you give the email I'm DMing to you access.

I also suggest anyone filling out this sheet to check out aisa... (read more)

3Shoshannah Tekofsky5mo
Great idea! So my intuition is that letting people edit a file that is publicly linked is inviting a high probability of undesirable results (like accidental wipes, unnoticed changes to the file, etc). I'm open to looking in to this if the format gains a lot of traction and people find it very useful. For the moment, I'll leave the file as-is so no one's entry can be accidentally affected by someone else's edits. Thank you for the offer though!
The Slack invite seems to no longer work

Yes, seems correct, it's been a merge candidate for some time.

I've had >50% hit rate for "this person now takes AI x-risk seriously after one conversation" from people at totally non-EA parties (subculturally alternative/hippeish, in not particularly tech-y parts of the UK). I think it's mostly about having a good pitch (but not throwing it at them until there is some rapport, ask them about their stuff first), being open to their world, modeling their psychology, and being able to respond to their first few objections clearly and concisely in a way they can frame within their existing world-model.

Edit: Since I've... (read more)

This series of posts has clarified a set of thoughts that I've exploring recently. I think this formulation of distributed agency might end up being critical to understanding the next stages of AI.

I encourage anyone who is confused by this framing to look into the Free Energy Principle framework.

Connor from Conjecture made a similar point about GPT-3:

I'll raise you [].

This seems like critical work for the most likely path to an existential win that I can see. Keep it up!

Thanks! More to come!

I've personally found that intentionally feeding attention / strength to a part which can play the mediator / meta-part role extremely valuable, and have seen it spark changes in people with otherwise serious long standing treatment resistant internal tangles. Not everyone has a part with enough neural weight and trust of their system to navigate the most severe internal conflicts, and encouraging them to empower some part of themselves can be transformative.

Seems like a good high level structure for internal work.

Some editing notes:

Theoretically, one could use techniques which build on ICF as tools to coerce some parts, or help parts to achieve their goals at the expense of others. We think that doing this is highly prone to bad consequences for you in aggregate, and strongly discourage practising ICF techniques if this is what you want to do with them.

This paragraph is repeated a few lines down.

secret unblended part which has a lot of power over the power over the council,

Power over repeats.

The process of un

... (read more)
Thanks for spotting these; I've made the changes!


  • Simplified columns (link is now in the name of each entry, rather than a separate column, and made the names bold)
  • Reordered to put some of the entries added by volunteers nearer the top (unknown people are adding stuff! excellent!)
  • Moved one now-inactive project to the bottom
  • Added a note about EA Gather Town's invite link breaking every month, and to go via the Discord instead.
  • Updated icon
  • Added note about who to talk to for one invite-only Slack
  • Successfully applied for $1k credit for the software, which means we have edit histories enabled on this and all docs for about three years for free
Load More