We aren’t offering these criteria as necessary for “knowledge”—we could imagine a breaker proposing a counterexample where all of these properties are satisfied but where intuitively M didn’t really know that A′ was a better answer. In that case the builder will try to make a convincing argument to that effect.
Bolded should be sufficient.
In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.
Yeah, I agree with this. But I don't think the human system aggregates into any kind of coherent total optimiser. Humans don't have an objective function (not even approximately?).
A human is not well modelled as a wrapper mind; do you disagree?
Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.
Conditional on:
I'm pretty sceptical of #2. I'm sceptical that systems that perform inference via direct optimisation over their outputs are competitive in rich/complex environments.
Such o...
Do please read the post. Being able to predict human text requires vastly superhuman capabilities, because predicting human text requires predicting the processes that generated said text. And large tracts of text are just reporting on empirical features of the world.
Alternatively, just read the post I linked.
It is not clear how they could ever develop strongly superhuman intelligence by being superhuman at predicting human text.
which is indifferent to the simplicify of the architecture the insight lets you find.
The bolded should be "simplicity".
Sorry, please where can I get access to the curriculum (including the reading material and exercises) if I want to study it independently?
The chapter pages on the website doesn't seem to list full curricula.
If you define your utility function over histories, then every behaviour is maximising an expected utility function no?
Even behaviour that is money pumped?
I mean you can't money pump any preference over histories anyway without time travel.
The Dutchbook arguments apply when your utility function is defined over your current state with respect to some resource?
I feel like once you define utility function over histories, you lose the force of the coherence arguments?
What would it look like to not behave as if maximising an expected utility function for a utility function defined over histories.
My contention is that I don't think the preconditions hold.
Agents don't fail to be VNM coherent by having incoherent preferences given the axioms of VNM. They fail to be VNM coherent by violating the axioms themselves.
Completeness is wrong for humans, and with incomplete preferences you can be non exploitable even without admitting a single fixed utility function over world states.
Not at all convinced that "strong agents pursuing a coherent goal is a viable form for generally capable systems that operate in the real world, and the assumption that it is hasn't been sufficiently motivated.
What are the best arguments that expected utility maximisers are adequate (descriptive if not mechanistic) models of powerful AI systems?
[I want to address them in my piece arguing the contrary position.]
If you're not vNM-coherent you will get Dutch-booked if there are Dutch-bookers around.
This especially applies to multipolar scenarios with AI systems in competition.
I have an intuition that this also applies in degrees: if you are more vNM-coherent than I am (which I think I can define), then I'd guess that you can Dutch-book me pretty easily.
...The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we've identified a suitable asymptotic order on the function, we can say intelligent things like "the smallest network capable of solving a problem in complexity class C of size N is X".
Or if our asymptotic bounds are not tight enough:
"No economically feasible LLM can solve problems in complexity class C of size >= N".
(Where economically feasible may be something defined by aggregate global eco
The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we've identified a suitable asymptotic order on the function, we can say intelligent things like "the smallest network capable of solving a problem in complexity class C of size N is X".
Or if our asymptotic bounds are not tight enough:
"No economically feasible LLM can solve problems in complexity class C of size >= N".
(Where economically feasible may be something defined by aggregate global economic resources or similar, depending on how tight you want the bound to be.)
Regardless, we can still obtain meaningful impossibility results.
Very big caveat: the LLM doesn't actually perform O(1) computations per generated token.
The number of computational steps performed per generated token scales with network size: https://www.lesswrong.com/posts/XNBZPbxyYhmoqD87F/llms-and-computation-complexity?commentId=QWEwFcMLFQ678y5Jp
Strongly upvoted.
Short but powerful.
Tl;Dr: LLMs perform O(1) computational steps per generated token and this is true regardless of the generated token.
The LLM sees each token in its context window when generating the next token so can compute problems in O(n^2) [where n is the context window size].
LLMs can get along the computational requirements by "showing their working" and simulating a mechanical computer (one without backtracking, so not Turing complete) in their context window.
This only works if the context window is large enough to contain the work...
A reason I mood affiliate with shard theory so much is that like...
I'll have some contention with the orthodox ontology for technical AI safety and be struggling to adequately communicate it, and then I'll later listen to a post/podcast/talk by Quintin Pope/Alex Turner, or someone else trying to distill shard theory and then see the exact same contention I was trying to present expressed more eloquently/with more justification.
One example is that like I had independently concluded that "finding an objective function that was existentially safe when optimis...
"All you need is to delay doom by one more year per year and then you're in business" — Paul Christiano.
Took this to drafts for a few days with the intention of refining it and polishing the ontology behind the post.
I ended up not doing that as much, because the improvements I was making to the underlying ontology felt better presented as a standalone post, so I mostly factored them out of this one.
I'm not satisfied with this post as is, but there's some kernel of insight here that I think is valuable, and I'd want to be able to refer to the basic thrust of this post/some arguments made in it elsewhere.
I may make further edits to it in future.
...It should be noted, however, that while inner alignment is a robustness problem, the occurrence of unintended mesa-optimization is not. If the base optimizer's objective is not a perfect measure of the human's goals, then preventing mesa-optimizers from arising at all might be the preferred outcome. In such a case, it might be desirable to create a system that is strongly optimized for the base objective within some limited domain without that system engaging in open-ended optimization in new environments.(11) One possible way to accomplish this might be t
Is this a correct representation of corrigible alignment:
March 22nd is when my first exam starts.
It finishes June 2nd.
Is it possible for me to delay my start a bit?
I'm gestating on this post. I suggest part of my original framing was confused, and so I'll just let the ideas ferment some more.
Yeah for humans in particular, I think the statement is not true of solely biological evolution.
But also, I'm not sure you're looking at it on the right level. Any animal presumably doesvmany bits worth of selection in a given day, but the durable/macroscale effects are better explained by evolutionary forces acting on the population than actions of different animals within their lifetimes.
Or maybe this is just a confused way to think/talk about it.
I could change that. I was thinking of work done in terms of bits of selection.
Though I don't think that statement is true of humans unless you also include cultural memetic evolution (which I think you should).
Yeah, I'm aware.
I would edit the post once I have better naming/terminology for the distinction I was trying to draw.
It happened as something like "humans optimise for local objectives/specific tasks" which eventually collapsed to "local optimisation".
[Do please subject better adjectives!]
Hmm, the etymology was that I was using "local optimisation" to refer to the kind of task specific optimisation humans do.
And global was the natural term to refer to the kind of optimisation I was claiming humans don't do but which an expected utility maximiser does.
The "global" here means that all actions/outputs are optimising towards the same fixed goal(s):
...Local Optimisation
- Involves deploying optimisation (search, planning, etc.) to accomplish specific tasks (e.g., making a good move in chess, winning a chess game, planning a trip, solving a puzzle).
- The choice of local tasks is not determined as part of this framework; local tasks could be subproblems of another optimisation problem (e.g., picking a good next move as part of winning a chess game), generated via learned heuristics, etc.
Global Optimisation
- Entai
Still thinking about consequentialism and optimisation. I've argued that global optimisation for an objective function is so computationally intractable as to be prohibited by the laws of physics of our universe. Yet it's clearly the case that e.g. evolution is globally optimising for inclusive genetic fitness (or perhaps patterns that more successfully propagate themselves if you're taking a broader view). I think examining why evolution is able to successfully globally optimise for its objective function wou...
Strongly upvoted that comment. I think your point about needing to understand the mechanistic details of the selection process is true/correct.
That said, I do have some contrary thoughts:
To: @Quintin Pope, @TurnTrout
I think "Reward is not the Optimisation Target" generalises straightforwardly to any selection metric.
Tentatively, something like: "the selection process selects for cognitive components that historically correlated with better performance according to the metric in the relevant contexts."
From "Contra "Strong Coherence"":
...Many observed values in humans and other mammals (e.g. fear, play/boredom, friendship/altruism, love, etc.) seem to be values that were instrumental for promoting inclusive genetic fitness (promotin
Given that the optimisation performed by intelligent systems in the real world is local/task specific, I'm wondering if it would be more sensible to model the learned model as containing (multiple) mesa-optimisers rather than being a single mesa-optimiser.
My main reservation is that I think this may promote a different kind of confused thinking; it's not the case that the learned optimisers are constantly competing for influence and their aggregate behaviour determines the overall behaviour of the learned algorithm. Rather the learned algorithm employs optimisation towards different local/task specific objectives.
I've come around to the view that global optimisation for a non-trivial objective function in the real world is grossly intractable, so mechanistic utility maximisers are not actually permitted by the laws of physics[1][2].
My remaining uncertainty around expected utility maximisers as a descriptive model of consequentialist systems is whether the kind of hybrid optimisation (mostly learned heuristics, some local/task specific planning/search) that real world agents perform converges towards better approximating...
I think mesa-optimisers should not be thought of as learned optimisers, but systems that employ optimisation/search as part of their inference process.
The simplest case is that pure optimisation during inference is computationally intractable in rich environments (e.g. the real world), so systems (e.g. humans) operating in the real world, do not perform inference solely by directly optimising over outputs.
Rather optimisation is employed sometimes as one part of their inference strategy. That is systems o...
A lot of LessWrong actually relies on just trusting users not to abuse the site/features.
I make judgment calls on when to repost keeping said trust in mind.
And if reposts were a nuisance people could just mass downvote reposts.
But in general, I think it's misguided to try and impose a top down moderation solution given that the site already relies heavily on user trust/judgment calls.
This repost hasn't actually been a problem and is only being an issue because we're discussing whether it's a problem or not.
My claim is mostly that real world intelligent systems do not have values that can be well described by a single fixed utility function over agent states.
I do not see this answer as engaging with that claim at all.
If you define utility functions over agent histories, then everything is an expected utility maximiser for the function that assigns positive utility to whatever action the agent actually took and zero utility to every other action.
I think such a definition of utility function is useless.
If however you define utility functions over agent states, ...
I mean I think it can be abused, and the use case where I was informed of it was a different use case (making substantial edits to a post). I do not know that they necessary approve of republishing for this particular use case.
But the alternative to republishing for this particular use case is just reposting the question as an entirely new post which seems strictly worse.
By the nature of coherence, the ensemble of coherent and aligned parts would get to their purpose much more efficiently than the other parts are not-getting to that purpose and being a hindrance, assuming the purpose was reachable enough. This means that coherent agents are not just reflectively consistent, but also stable: once there's some seed of coherence, it can win other the non-coherent parts.
I think this fails to adequately engage with the hypothesis that values are inherently contextual.
Alternatively, the kind of cooperation you describe where ...
I mean I think it's fine.
I have not experienced the feature being abused.
In this case I didn't get any answers the last time I posted it and ended up needing answers so I'm reposting.
Better than posting the entire post again as a new post and losing the previous conversation (which is what would happen if not for this feature).
Like what's the argument that it's defecting? There are just legitimate reasons to repost stuff and you can't really stop users from reposting stuff.
FWIW, it was a mod that informed me of this feature.
Reposted it because I didn't get any good answers last time, and I'm working on a post that's a successor to this one currently and would really appreciate the good answers I did not get.
My main take away is that I'm going to be cau authoring posts with people I'm trying to get into AI safety, so they aren't stonewalled by moderation.
Realised later on, thanks.
I guess in this formalism you'd need to consider the empty string/similar null token a valid token, so the prompt/completion is prefixed/suffixed with empty strings (to pad to the size of the context window).
Otherwise, you'd need to define the domain as a union over the set of all strings with token lengths the context window.
It's working now!
https://podcasts.google.com/feed/aHR0cHM6Ly9heHJwb2RjYXN0LmxpYnN5bi5jb20vcnNz/episode/ODVlM2RkNmItMTdkZi00MWYwLTg2YjAtOWIxY2JkOTBlYjgw?ep=14