Ryan Kidd

Wiki Contributions

Comments

SERI ML Alignment Theory Scholars Program 2022

Yes, that is currently the plan. If we experience a massive influx of applications in the next two days it is possible this might slightly change, but I doubt it. We will work hard to keep to the announcement and commencement deadlines.

SERI ML Alignment Theory Scholars Program 2022

Application deadlines have been extended to May 22! Feel free message me or Victor if you have any questions.

Ensembling the greedy doctor problem

The problem with this solution is that the doctors can precommit to always diagnosing ill through acausal trade, so that they are rewarded in the case that they are the one who is randomly chosen for treatment (assuming they know the game).

Ensembling the greedy doctor problem

Treating health as a continuous rather than binary variable does complicate this problem and, I think, breaks my solution. If the doctors agree on an ordinal ranking of all patients from "most ill" to "most hale", they can coordinate their diagnoses much easier, as they can search over a smaller space of ensemble diagnoses. If there are lots of "degenerate cases" (i.e. people with the same degree of illness) this might be harder. Requiring a certain minimum Hamming distance (based on some prior) from the all-ill Schelling point doesn't help at all in the case of nondegenerate ordinal ranking.

Ensembling the greedy doctor problem

If only a few of the patients are ambiguously ill, it might be possible to discern this by observing an almost perfect agreement between ensemble diagnoses. In the iterated game, this might cause the doctors to preferentially flip diagnoses that they think are more ambiguous or not flip diagnoses that they think are more certain. This is not a perfect guarantee, of course.

Ensembling the greedy doctor problem

I'm assuming that race and sex and all other discernable features are independent and identically distributed, and also that the game overseer will not pay the doctors if they coordinate on obvious patterns involving race, sex, etc.

Is Fisherian Runaway Gradient Hacking?

I think it's important to distinguish between "fitness as evaluated on the training distribution" (i.e. the set of environments ancestral peacocks roamed) and "fitness as evaluated on a hypothetical deployment distribution" (i.e. the set of possible predation and resource scarcity environments peacocks might suddenly face). Also important is the concept of "path-dependent search" when fitness is a convex function on which biases local search towards , but has global minimum at .

  1. In this case, I'm imagining that Fisherian runaway boosts as long as it still indicates good fitness on-distribution. However, it could be that is the "local optimum for fitness" and in reality is the global optimum for fitness. In this case, the search process has chosen an intiial -direction that biases sexual selection towards . This is equivalent to gradient descent finding a local minima.
  2. I think I agree with your thoughts here. I do wonder if sexual selection in humans has reached a point where we are deliberately immune to natural selection pressure due to such a distributional shift and acquired capabilities.
Is Fisherian Runaway Gradient Hacking?

In the context of your model, I see two potential ways that Fisherian runaway might occur:

  1. Within each generation, males that survive with higher are consistently fitter on average than males that survive with lower because the fitness required to survive monotonically increases with . Therefore, in every generation, choosing males with higher is a good proxy for local improvements in fitness. However, the performance detriments of high "off-distribution" are never signalled. In an ML context, this is basically distributional shift via proxy misalignment.
  2. Positive feedback that negatively impacts fitness "on-distribution" might occur temporarily if selection for higher is so strong that it has "acquired momentum" that ensures females will select for higher males for several generations past the point the trait becomes net costly for fitness. This is possible if the negative effects of the trait take longer to manifest selection pressure than the time window during which sexual selection boosts the trait via preferential mating. This mechanism is temporary, however, but I can see search processes halting prematurely in an ML context.
Introduction to inaccessible information

I think I agree. To the extent that a 'world model' is an appropriate abstraction, I think the levers to pull for resolving world model mismatches seem to be:

  • Post-facto: train an already capable (prosaic?) AI to explain itself in a way that accounts for world model mismatches via a clever training mechanism and hope that only accessible consequences matter for preserving human option value; or
  • Ex-ante: build AI systems in an architecturally transparent manner such that properties of their world model can be inspected and tuned, and hope that the training process makes these AI systems competitive.

I think you are advocating for the latter, or have I misrepresented the levers?

Introduction to inaccessible information

I think world model mismatches are possibly unavoidable with prosaic AGI, which might reasonably bias one against this AGI pathway. It seems possible that much of human and AGI world models would be similar by default if 'tasks humans are optimised for' is a similar set to 'tasks AGI is optimised for' and compute is not a performance-limiting factor, but I'm not at all confident that this is likely (e.g. maybe an AGI draws coarser- or finer-grained symbolic Markov blankets). Even if we build systems that represent the things we want and the things we do to get them as distinct symbolic entities in the same way humans do, they might fail to be competitive with systems that build their world models in an alien way (e.g. draw Markov blankets around symbolic entities that humans cannot factor into their world model due to processing or domain-specific constraints).

Depending on how one thinks AGI development will happen (e.g. is the strategy stealing assumption important) resolving world model mismatches seems more or less a priority for alignment. If near-term performance competitiveness heavily influences deployment, I think it's reasonably likely that prosaic AGI is prioritised and world model mismatches occur by default because, for example, compute is likely a performance-limiting factor for humans on tasks we optimise AGI for, or the symbolic entities humans use are otherwise nonuniversal. I think AGI might generally require incorporating alien features into world models to be maximally competitive, but I'm very new to this field.

Load More