Nice! Glad to see more funding options entering the space, and excited to see the S-process rolled out to more grantmakers.
Added you to the map of AI existential safety:
One thing which might be nice as part of improving grantee experience would be being able to submit applications as a google doc (with a template which gives the sections you list) rather than just by form. This increases re-usability and decreases stress, as it's easy to make updates later on so it's less of a worry that you ended up missing something crucial. Might be more hassle than it'...
Yeah, I do think there are a bunch of benefits to doing things in Google Docs, though it is often quite useful to have more structured data on the evaluation side.
This increases re-usability and decreases stress, as it's easy to make updates later on so it's less of a worry that you ended up missing something crucial.
You can actually update your application any time. When you submit the application you get a link that allows you to edit your submission as well as see its current status in the evaluation process. Seems like we should sign-post this better.
that was me for context:
core claim seems reasonable and worth testing, though I'm not very hopeful that it will reliably scale through the sharp left turn
my guesses the intuitions don't hold in the new domain, and radical superintelligence requires intuitions that you can't develop on relatively weak systems, but it's a source of data for our intuition models which might help with other stuff so seems reasonable to attempt.
Meta's previous LLM, OPT-175B, seemed good by benchmarks but was widely agreed to be much, much worse than GPT-3 (not even necessarily better than GPT-Neo-20b). It's an informed guess, not a random dunk, and does leave open the possibility that they're turned it around and have a great model this time rather than something which goodharts the benchmarks.
This is a Heuristic That Almost Always Works, and it's the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.
Security mindset means look for flaws, not assume all plans are so doomed you don't need to look.
If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a n...
I don't think security mindset means "look for flaws." That's ordinary paranoia. Security mindset is something closer to "you better have a really good reason to believe that there aren't any flaws whatsoever." My model is something like "A hard part of developing an alignment plan is figuring out how to ensure there aren't any flaws, and coming up with flawed clever schemes isn't very useful for that. Once we know how to make robust systems, it'll be more clear to us whether we should go for melting GPUs or simulating researchers or whatnot."
That said, I ...
Not inspired by them, no. Those did not have, as far as I'm aware, a clear outlet for use of the outputs. We have a whole platform we've been building towards for three years (starting on the FAQ long before those contests), and the ability to point large numbers of people at that platform once it has great content thanks to Rob Miles.
As I said over on your Discord, this feels like it has a shard of hope, and the kind of thing that could plausibly work if we could hand AIs utility functions.
I'd be interested to see the explicit breakdown of the true names you need for this proposal.
Agreed, incentives probably block this from being picked up by megacorps. I had thought to try and get Musk's twitter to adopt it at one point when he was talking about bots a lot, it would be very effective, but doesn't allow rent extraction in the same way the solution he settled on (paid twitter blue).
Websites which have the slack to allow users to improve their experience even if it costs engagement might be better adopters, LessWrong has shown they will do this with e.g. batching karma daily by default to avoid dopamine addiction.
Hypothesis #2: These bits of history are wrong for reasons you can check with simpler learned structures.
Maybe these historical patterns are easier to disprove with simple exclusions, like "these things were in different places"?
Yeah, my guess is if you use really niche and plausible-sounding historical examples it is much more likely to hallucinate.
Maybe the RLHF agent selected for expects the person giving feedback to correct it for the history example, but not know the latter example is false. If you asked a large sample of humans, more would be able to confidently say the first example is false than the latter one.
Hell yeah!
This matches my internal experience that caused me to bring a ton of resources into existence in the alignment ecosystem (with various collaborators):
I've kept updating in the direction of do a bunch of little things that don't seem blocked/tangled on anything even if they seem trivial in the grand scheme of things. In the process of doing those you will free up memory and learn a bunch about the nature of the bigger things that are blocked while simultaneously revving your own success spiral and action-bias.
This is some great advice. Especially 1 and 2 seem foundational for anyone trying to reliably shift the needle by a notable amount in the right direction.
My favorite frame is based on In The Future Everyone Will Be Famous To Fifteen People. If we as a civilization pass this test, we who lived at the turn of history will be outnumbered trillions to one, and the future historians will construct a pretty good model of how everyone contributed. We'll get to read about it, if we decide that that's part of our personal utopia.
I'd like to be able to look back from eternity and see that I shifted things a little in the right direction. That perspective helps defuse some of the local status-seeking drives, I think.
I can verify that the owner of the blaked[1] account is someone I have known for a significant amount of time, that he is a person with a serious, long-standing concern with AI safety (and all other details verifiable by me fit), and that based on the surrounding context I strongly expect him to have presented the story as he experienced it.
This isn't a troll.
(also I get to claim memetic credit for coining the term "blaked" for being affected by this class of AI persuasion)
Agree with Jim, and suggest starting with some Rob Miles videos. The Computerphile ones, and those on his main channel, are a good intro.
Nice! Would you be up for putting this in the aisafety.info Google Drive folder too, with a question-shaped title?
Yes, this is a robustly good intervention on the critical path. Have had it on the Alignment Ecosystem Development ideas list for most of a year now.
...Some approaches to solving alignment go through teaching ML systems about alignment and getting research assistance from them. Training ML systems needs data, but we might not have enough alignment research to sufficiently fine tune our models, and we might miss out on many concepts which have not been written up. Furthermore, training on the final outputs (AF posts, papers, etc) might be less good at capturin
Updates!
Still getting consistent traffic, happy to see it getting used :)
I do things, such as aisafety.community, aisafety.info, aisafety.world, aisafety.training, ea.domains, alignment.dev and about a dozen others. Come hang out on Alignment Ecosystem Development.
fwiw that strange philosophical bullet fits remarkably well with a set of thoughts I had while reading Anthropic Bias about 'amount of existence' being the fundamental currency of reality (a bunch of the anthropic paradoxes felt like they were showing that if you traded sufficiently large amounts of "patterns like me exist more" then you could get counterintuitive results like bending the probabilities of the world around you without any causal pathway), and infraBayes requiring it actually updated me a little towards infraBayes being on the right track.
My...
I classified the first as Outer misalignment, and the second as Deceptive outer misalignment, before reading on.
I agree with
Another use of the terms “outer” and “inner” is to describe the situation in which an “outer” optimizer like gradient descent is used to find a learned model that is itself performing optimization (the “inner” optimizer). This usage seems fine to me.
being the worthwhile use of the term inner alignment as opposed to the ones you argue against, and could imagine that the term is being blurred and used in less helpful ways by many ...
For developing my hail mary alignment approach, the dream would be to be able to load enough of the context of the idea into a LLM that it could babble suggestions (since the whole doc won't fit in the context window, maybe randomizing which parts beyond the intro are included for diversity?), then have it self-critique those suggestions automatically in different threads in bulk and surface the most promising implementations of the idea to me for review. In the perfect case I'd be able to converse with the model about the ideas and have that be not totally useless, and pump good chains of thought back into the fine-tuning set.
From the outside, it looks like Tesla and SpaceX are doing an unusually good (though likely not perfect) job of resisting mazedom and staying in touch with physical reality. Things like having a rule where any employee can talk to Musk directly, and blocking that is a fireable offence, promoting self-management rather than layers of middle management, over-the-top work ethic as a norm to keep out people who don't actually want to build things, and checking for people who have solved hard technical problems in the hiring process.
I'd be interested to hear from any employees of those organizations how it is on the ground.
There's something extra which is closely connected with this: The time and motivation of central nodes is extremely valuable. Burning the resources of people who are in positions of power and aligned with your values is much more costly than burning other people's resources[1], and saving or giving them extra resources is proportionately more valuable.
People who have not been a central node generally underestimate how many good things those people would be able to unblock or achieve with a relatively small amount of their time/energy/social favors that sim...
Very excited by this agenda, was discussing my hope that someone finetunes LLMs on the alignment archive soon today!
Also, from MIT CSAIL and Meta: Gradient Descent: The Ultimate Optimizer
...Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model parameters by manually deriving expressions for "hypergradients" ahead of time.
We show how to automatically compute hypergradients with a simple and elegant modification to backpropagation. This allows us to easily apply the method to other optimizers and
Yup, another instance of this is the longtermist census, that likely has the most entries but is not public. Then there's AI Safety Watch, the EA Hub (with the right filters), the mailing list of people who went through AGISF, I'm sure SERI MATS has one, other mailing lists like AISS's opportunities one, other training programs, student groups, people in various entries on aisafety.community...
Yeah, there's some organizing to do. Maybe the EA forum's proposed new profile features will end up being the killer app?
Value alignment here means being focused on improving humanity's long term future by reducing existential risk, not other specific cultural markers (identifying as EA or rationalist, for example, is not necessary). Having people working towards same goal seems vital for organizational cohesion, and I think alignment orgs would rightly not hire people who are not focused on trying to solve alignment. Upskilling people who are happy to do capabilities jobs without pushing hard internally for capabilities orgs to be more safety focused seems net negative.
These proposals all seem robustly good. What would you think of adding an annual AI existential safety conference?
Yeah, that is a risk.
Have you checked out ASAP? Seems pretty related
https://airtable.com/shrhjo857neCToCNW/tblXj7gik84xGIZly/viwaKxHhBEmIyEcSr?blocks=hide
https://airtable.com/shrhjo857neCToCNW/tblXj7gik84xGIZly/viwB4nnuzhGLAEONY?blocks=hide
https://asap-homepage.notion.site/asap-homepage/Home-b38ba079d3dd4d258baa7cd1ae4eb68f
This seems like it could be useful!
As you've identified, databases of this kind often run into the problem of becoming outdated. For a similar project, EA houses, I wrote a script which automatically sorts the most recently edited entries to the top. If you allow people to edit the sheet, this allows people to easily bump their entry, ensuring that at least the most recent entries are still active. I'd be happy to implement this script on your sheet if you give the email I'm DMing to you access.
I also suggest anyone filling out this sheet to check out aisa...
Christmas wiki editing! Lots of new important tags in the past year.
I added:
Gradient Hacking
Symbol Grounding
AI Success Models
AXRP
Agent Foundations
Eliciting Latent Knowledge (ELK)
Anthropic
Conjecture (org)
AI Persuasion
DALL-E
Multipolar Scenarios
AI Risk Concrete Stories
AI Alignment Fieldbuilding
SERI MATS
AI Safety Public Materials
AI-assisted Alignment
Encultured AI (org)
Shard Theory
Sharp Left Turn
Deceptive Alignment
Apart Research
AI Questions Open Thread
Power Seeking (AI)
AI Alignment Intro Ma...
I've had >50% hit rate for "this person now takes AI x-risk seriously after one conversation" from people at totally non-EA parties (subculturally alternative/hippeish, in not particularly tech-y parts of the UK). I think it's mostly about having a good pitch (but not throwing it at them until there is some rapport, ask them about their stuff first), being open to their world, modeling their psychology, and being able to respond to their first few objections clearly and concisely in a way they can frame within their existing world-model.
Edit: Since I've...
This seems like critical work for the most likely path to an existential win that I can see. Keep it up!
I've personally found that intentionally feeding attention / strength to a part which can play the mediator / meta-part role extremely valuable, and have seen it spark changes in people with otherwise serious long standing treatment resistant internal tangles. Not everyone has a part with enough neural weight and trust of their system to navigate the most severe internal conflicts, and encouraging them to empower some part of themselves can be transformative.
Seems like a good high level structure for internal work.
Some editing notes:
Theoretically, one could use techniques which build on ICF as tools to coerce some parts, or help parts to achieve their goals at the expense of others. We think that doing this is highly prone to bad consequences for you in aggregate, and strongly discourage practising ICF techniques if this is what you want to do with them.
This paragraph is repeated a few lines down.
secret unblended part which has a lot of power over the power over the council,
Power over repeats.
...The process of un
Updates:
Your probabilities are not independent, your estimates mostly flow from a world model which seem to me to be flatly and clearly wrong.
The plainest examples seem to be assigning
despite current models learning vastly faster than humans (training time of LLMs is not a human lifetime, and covers vastly more data) and the current nearing AGI and inference being dramatically cheaper and plummeting with algorithmic improvements. There is a general... (read more)