Theoretical Computer Science Msc student at the University of [Redacted] in the United Kingdom.
I'm an aspiring alignment theorist; my research vibes are descriptive formal theories of intelligent systems (and their safety properties) with a bias towards constructive theories.
I think it's important that our theories of intelligent systems remain rooted in the characteristics of real world intelligent systems; we cannot develop adequate theory from the null string as input.
In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.
Yeah, I agree with this. But I don't think the human system aggregates into any kind of coherent total optimiser. Humans don't have an objective function (not even approximately?).
A human is not well modelled as a wrapper mind; do you disagree?
Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.
Conditional on:
I'm pretty sceptical of #2. I'm sceptical that systems that perform inference via direct optimisation over their outputs are competitive in rich/complex environments.
Such optimisation is very computationally intensive compared to executing learned heuristics, and it seems likely that the selection process would have access to much more compute than the selected system.
See also: "Consequentialism is in the Stars not Ourselves".
Do please read the post. Being able to predict human text requires vastly superhuman capabilities, because predicting human text requires predicting the processes that generated said text. And large tracts of text are just reporting on empirical features of the world.
Alternatively, just read the post I linked.
It is not clear how they could ever develop strongly superhuman intelligence by being superhuman at predicting human text.
which is indifferent to the simplicify of the architecture the insight lets you find.
The bolded should be "simplicity".
Sorry, please where can I get access to the curriculum (including the reading material and exercises) if I want to study it independently?
The chapter pages on the website doesn't seem to list full curricula.
If you define your utility function over histories, then every behaviour is maximising an expected utility function no?
Even behaviour that is money pumped?
I mean you can't money pump any preference over histories anyway without time travel.
The Dutchbook arguments apply when your utility function is defined over your current state with respect to some resource?
I feel like once you define utility function over histories, you lose the force of the coherence arguments?
What would it look like to not behave as if maximising an expected utility function for a utility function defined over histories.
Bolded should be sufficient.