Theoretical Computer Science Msc student at the University of [Redacted] in the United Kingdom.
I'm an aspiring alignment theorist; my research vibes are descriptive formal theories of intelligent systems (and their safety properties) with a bias towards constructive theories.
I think it's important that our theories of intelligent systems remain rooted in the characteristics of real world intelligent systems; we cannot develop adequate theory from the null string as input.
We aren’t offering these criteria as necessary for “knowledge”—we could imagine a breaker proposing a counterexample where all of these properties are satisfied but where intuitively M didn’t really know that A′ was a better answer. In that case the builder will try to make a convincing argument to that effect.
Bolded should be sufficient.
In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.
Yeah, I agree with this. But I don't think the human system aggregates into any kind of coherent total optimiser. Humans don't have an objective function (not even approximately?).
A human is not well modelled as a wrapper mind; do you disagree?
Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.
Conditional on:
I'm pretty sceptical of #2. I'm sceptical that systems that perform inference via direct optimisation over their outputs are competitive in rich/complex environments.
Such optimisation is very computationally intensive compared to executing learned heuristics, and it seems likely that the selection process would have access to much more compute than the selected system.
See also: "Consequentialism is in the Stars not Ourselves".
Do please read the post. Being able to predict human text requires vastly superhuman capabilities, because predicting human text requires predicting the processes that generated said text. And large tracts of text are just reporting on empirical features of the world.
Alternatively, just read the post I linked.
It is not clear how they could ever develop strongly superhuman intelligence by being superhuman at predicting human text.
It's working now!
https://podcasts.google.com/feed/aHR0cHM6Ly9heHJwb2RjYXN0LmxpYnN5bi5jb20vcnNz/episode/ODVlM2RkNmItMTdkZi00MWYwLTg2YjAtOWIxY2JkOTBlYjgw?ep=14