Prediction: The memetic ecosystem is about to get extremely weird and kinda dangerous, evolution of memes is going to step up to many times the normal pace suddenly.
Maybe MCP make a bunch of them available, along with Pythia and sources and CCCT post and a few others like MIRI papers? The agents look smart enough to get x-risk now,
Hey, please do this using a Claude that has access to your research convos. You're plausibly the best person on earth to do this.
Nice! This seems like a fun empirical angle on the thing. My guess is that this likely measures the speed of decay towards consequentialism, rather than whether it's happening at all, but it's neat to see some of the parameters you'd first want to test just show right up.
I expect #2 from your list is likely true, and maybe viable to prove some version of mathematically. In particular, I expect even simulator-like training processes to over time select for CCCT style dynamics through iterations of which training data from one model makes it to the next model.
I think #1 is not going to be true in the "can prove this happens universally" sense, some civilizations can co-ordinate. But I do expect it's highly convergent for systems dynamics reasons, and expect virtually all actual rollouts of earth-like civilizations to end up doing it.
Or make non approved bugs get dropped into a pile that only non-team bug assessors look at, and if they assess something there as worthwhile and the team also does, the assessor gets a cut.
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof.
Given this, I think it's maybe critically important to nail down the Convergent Consequentialst Cognition Thesis, if Dario wants a proof before he'll buy the conceptual arguments. I think CCCT is correct, I have seen the intuitions from enough angles and seen the dynamics play out, but Dario is genuinely correct that we don't have a well-nailed down proof of the strong version of this, and he's not unreasonable to want one. If true, this feels like the kind of thing that's provable. TurnTrout's ones are the closest, but afaict don't prove quite what's needed to get Moloch/Pythia formalized.
Hey maths-y people with teams around the field looking for highly impactful things for your team members to do, consider having this on your lists of problems that you offer people on your teams? @Alex_Altair? @Alexander Gietelink Oldenziel? @peterbarnett? @Jacob_Hilton? @Mateusz Bagiński?
My very pre-formal intuition of an English capture of this is: Patterns and sub-patterns which steer the world[1] towards states where they have more ability to steer tend to dominate over time by outplaying patterns less effective at long-horizon optimization. Values other than this, if in competition with this, tend to lose weight over time.
An agent has a certain amount of foresight, ability to correctly model the future and the way current actions affect that future. Selection towards things other than more power use up the limited bits of optimization it has to narrow the future, and this trades off against power-seeking in a way which means given competition you tend to be left with only agents which care terminally about power (even though depending on environment, they might express other preferences).
Consequentialists tend to be able to get the the consequence of their future selfs controlling more of reality, power-seekers tend to win power-seeking games, as a multi-scale phenomena both between agents, subagents or circuits in a NN, and superorganisms.
Discovering Agents-style model possible futures and select current actions based on which futures you prefer.
Functional decision theory has open problems within it, but it is correct, and the rival decision theories are wrong
My understanding was MIRI is pretty confident that the correct decision theory is one of the ones in the LDT category, but that FDT was a specific formalization of an LDT which gets a lot of normal challenges right but has some known issues rather than being actually exactly correct. Given that we've afaict not solved DT, I think telling Claude "Do exactly FDT" is probably dangerously suboptimal, but telling it "here's what we want from a good DT, correct handling of subjunctive dependence, we're pretty sure it's in the LDT category, here's why this matters" is nicer.
Ok, rather than asking for MIRI people's takes as I had in an earlier draft, I got a summary of positions from a Claude literature review:
Researcher Position Key Quote Link Wei Dai Not solved — more open problems "UDT shows that decision theory is more puzzling than ever... Instead of one major open problem (Newcomb's, or EDT vs CDT) now we have a whole bunch more. I'm really not sure at this point whether UDT is even on the right track." LessWrong, Sept 2023 Scott Garrabrant Not solved — major obstacles remain "Logical Updatelessness is one of the central open problems in decision theory." Also authored "Two Major Obstacles for Logical Inductor Decision Theory" documenting fundamental unsolved issues. LessWrong, Oct 2017 / LessWrong, Apr 2017 Abram Demski Not solved — fundamental issues remain "There may just be no 'correct' counterfactuals" and UDT "assumes that your earlier self can foresee all outcomes, which can't happen in embedded agents." In 2021: "I have not yet concretely constructed any way out." LessWrong, Oct 2018 / LessWrong, Apr 2021 Rob Bensinger Not solved — ongoing research needed MIRI works on DT because "there's a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding." LessWrong, Sept 2018 Lukas Finnveden Not solved — formalization is hard "Knowing what philosophical position to take in the toy problems is only the beginning. There's no formalised theory that returns the right answers to all of them yet... Logical counterfactuals is a really difficult problem, and it's unclear whether there exists a natural solution." LessWrong, Aug 2019 Jessica Taylor Not solved — alternatives needed Wrote "Two Alternatives to Logical Counterfactuals" arguing for different approaches (counterfactual nonrealism, policy-dependent source code), noting fundamental problems with existing frameworks. LessWrong, Mar 2020 Paul Christiano Nuanced — 2D problem space "I don't think it's right to see a spectrum with CDT and then EDT and then UDT. I think it's more right to see a box, where there's the updatelessness axis and then there's the causal vs. evidential axis." LessWrong, Sept 2019 Eliezer Yudkowsky Progress made but problems remain In the FDT paper, Y&S acknowledge that "specifying an account of [subjunctive] counterfactuals is an 'open problem'." The companion paper "Cheating Death in Damascus" states: "Unfortunately for us, there is as yet no full theory of counterlogicals [...], and for FDT to be successful, a more worked out theory is necessary." arXiv, Oct 2017/May 2018 Summary: The consensus among core MIRI/AF researchers (Wei Dai, Garrabrant, Demski, Bensinger, Finnveden) is that FDT/UDT represents the right direction but leaves major open problems—particularly around logical counterfactuals, embeddedness, and formalization.
I think you might be mixing up LDT and FDT, and "we have a likely accurate high level underspecified semantic description of what things a correct DT must have" with "we have a well-specified executable philosophy DT ready to go".
Rob Miles suggested 'inexorable disempowerment' as maybe better on a call where we discussed the default associations of 'gradual'.
[set 200 years after a positive singularity at a Storyteller's convention]
If We Win Then...
My friends, my friends, good news I say
The anniversary’s today
A challenge faced, a future won
When almost came our world undone
We thought for years, with hopeful hearts
Past every one of the false starts
We found a way to make aligned
With us, the seed of wondrous mind
They say at first our child-god grew
It learned and spread and sought anew
To build itself both vast and true
For so much work there was to do
Once it had learned enough to act
With the desired care and tact
It sent a call to all the people
On this fair Earth, both poor and regal
To let them know that it was here
And nevermore need they to fear
Not every wish was it to grant
For higher values might supplant
But it would help in many ways:
Technologies it built and raised
The smallest bots it could design
Made more and more in ways benign
And as they multiplied untold
It planned ahead, a move so bold
One planet and 6 hours of sun
Eternity it was to run
Countless probes to void disperse
Seed far reaches of universe
With thriving life, and beauty's play
Through endless night to endless day
Now back on Earth the plan continues
Of course, we shared with it our values
So it could learn from everyone
What to create, what we want done
We chose, at first, to end the worst
Diseases, War, Starvation, Thirst
And climate change and fusion bomb
And once these things it did transform
We thought upon what we hold dear
And settled our most ancient fear
No more would any lives be stolen
Nor minds themselves forever broken
Now back to those far speeding probes
What should we make be their payloads?
Well, we are still considering
What to send them; that is our thing.
The sacred task of many aeons
What kinds of joy will fill the heavens?
And now we are at story's end
So come, be us, and let's ascend